Last updated: 2021-04-30

Checks: 2 0

Knit directory: CassavaNIRS/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e05f210. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:

Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/Cassavabase_phenotypes_20210419.csv
    Ignored:    data/Corrected_metadata/
    Ignored:    data/README.html
    Ignored:    data/README.txt
    Ignored:    data/Spectra/
    Ignored:    data/TrialNameKey.csv
    Ignored:    data/raw_pheno.csv
    Ignored:    data/raw_scans.csv
    Ignored:    output/.DS_Store
    Ignored:    output/Figure2_DMC_distributions.png
    Ignored:    output/Figure4_within_predictions.png
    Ignored:    output/Figure5_Subsamples.png
    Ignored:    output/Figure6_RF_Importance.png
    Ignored:    output/Figure7_CV_predictions.png
    Ignored:    output/FigureS2_within_trial_prediction_all.png
    Ignored:    output/S1_overlapping_accession_counts.csv
    Ignored:    output/S3_removed_scans.csv
    Ignored:    output/Table2_DMC_statistics.csv
    Ignored:    output/Table3_performance_summary.csv
    Ignored:    output/TableS2_within_trial_predictions.csv
    Ignored:    output/TableS4_cv_results.csv
    Ignored:    output/cv_base.png
    Ignored:    output/cv_results.csv
    Ignored:    output/full_filtered_plots.csv
    Ignored:    output/full_filtered_subsamples.csv
    Ignored:    output/full_filtered_unaggregated.csv
    Ignored:    output/subsampling_prediction_results_2021.csv
    Ignored:    output/within_trial_waves_PLSR.csv
    Ignored:    output/within_trial_waves_RF.csv
    Ignored:    output/within_trial_waves_RF_importance.csv
    Ignored:    output/within_trial_waves_SVM.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/index.Rmd) and HTML (docs/index.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd e05f210 Jenna Hershberger 2021-04-30 Update authors
html 56b612c Jenna Hershberger 2021-04-30 Build site.
Rmd 88fee14 Jenna Hershberger 2021-04-30 Build workflowr site
html 88fee14 Jenna Hershberger 2021-04-30 Build workflowr site
Rmd 759b463 Jenna Hershberger 2021-04-22 Update curation code
Rmd 8f143af Jenna Hershberger 2021-04-21 Add content
html 8f143af Jenna Hershberger 2021-04-21 Add content
Rmd fecea09 Jenna Hershberger 2021-04-19 Start workflowr project.


This repository documents all analyses, summary, tables, and figures associated with the following PREPRINT: Low-cost, handheld near-infrared spectroscopy for root dry matter content prediction in cassava


Over 800 million people across the tropics rely on cassava as a major source of calories. While the root dry matter content (RDMC) of this starchy root crop is important for both producers and consumers, characterization of RDMC by traditional methods is time-consuming and laborious for breeding programs. Alternate phenotyping methods have been proposed but lack the accuracy, cost, or speed ultimately needed for cassava breeding programs. For this reason, we investigated the use of a low-cost, handheld NIR spectrometer for field-based RDMC prediction in cassava. Oven-dried measurements of RDMC were paired with 21,044 scans of roots of 376 diverse clones from 10 field trials in Nigeria and grouped into training and test sets based on cross-validation schemes relevant to plant breeding programs. Mean partial least squares regression model performance ranged from R2p = 0.62 - 0.89 for within-trial predictions, which is within the range achieved with laboratory-grade spectrometers in previous studies. Relative to other factors, model performance was highly impacted by the inclusion of samples from the same environment in both the training and test sets. Random forest variable importance analysis of root spectra revealed increased importance in a region previously identified as predictive of water content in plants (~950 - 990 nm). With appropriate model calibration, the tested spectrometer will allow for field-based collection of spectral data with a smartphone for accurate RDMC prediction and potentially other quality traits, a step that could be easily integrated into existing harvesting workflows of cassava breeding programs.

Data availability and reproducibility

The R package workflowr was used to document this study reproducibly.

Much of the supporting data and output from the analyses documented here are too large for GitHub.

The raw data for this repository is stored on Cyverse. Download this folder and add the contents to the /data folder in this repository to run the analysis code. When running the code, follow the order listed below.

Analysis overview

Some of the analyses in this manuscript were more efficiently run from the command line on a server with more memory than is common on desktop/laptop machines. The scripts for these analyses are located in the code/ sub-folder of this repository with names starting with “server”. Results from these analyses are used in subsequent html / Rmd files to generate figures and tables for the manuscript.

  1. Filter and aggregate: Remove outliers and prepare raw data for model training
  2. Summary figures: Generate overview figures and tables
  3. code/server_within_trial_predictions_PLSR_RF_SVM.R: Command line script that performs within-trial predictions with plot mean scans
  4. code/server_within_trial_predictions_RF_var_importance.R: Command line script that calculates within-trial random forest variable importance
  5. code/ Command line shell script that calls code/server_subsampling_generalized.R to subsamples sets of scans within each plot and then performs within-trial predictions on those sets with code/server_subsample_plsr.R. Utilizes functions from code/subsampling_functions.R.
  6. code/server_CV.R: Command line R script that performs predictions according to four cross-validation schemes relevant to plant breeding
  7. Predictions: Generate figures from output of within-trial and cross-validation scheme prediction scripts
  8. Subsampling: Generate figures from output of subsampling scripts