Wrapper that trains models based spectral data to predict reference values and reports model performance statistics

test_spectra(
  train.data,
  num.iterations,
  test.data = NULL,
  pretreatment = 1,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  seed = 1,
  verbose = TRUE,
  wavelengths = deprecated(),
  preprocessing = deprecated(),
  output.summary = deprecated(),
  rf.variable.importance = deprecated()
)

Arguments

train.data

data.frame object of spectral data for input into a spectral prediction model. First column contains unique identifiers, second contains reference values, followed by spectral columns. Include no other columns to right of spectra! Column names of spectra must start with "X" and reference column must be named "reference".

num.iterations

Number of training iterations to perform

test.data

data.frame with same specifications as df. Use if specific test set is desired for hyperparameter tuning. If NULL, function will automatically train with a stratified sample of 70%. Default is NULL.

pretreatment

Number or list of numbers 1:13 corresponding to desired pretreatment method(s):

  1. Raw data (default)

  2. Standard normal variate (SNV)

  3. SNV and first derivative

  4. SNV and second derivative

  5. First derivative

  6. Second derivative

  7. Savitzky–Golay filter (SG)

  8. SNV and SG

  9. Gap-segment derivative (window size = 11)

  10. SG and first derivative (window size = 5)

  11. SG and first derivative (window size = 11)

  12. SG and second derivative (window size = 5)

  13. SG and second derivative (window size = 11)

k.folds

Number indicating the number of folds for k-fold cross-validation during model training. Default is 5.

proportion.train

Fraction of samples to include in the training set. Default is 0.7.

tune.length

Number delineating search space for tuning of the PLSR hyperparameter ncomp. Must be set to 5 when using the random forest algorithm (model.method == rf). Default is 50.

model.method

Model type to use for training. Valid options include:

  • "pls": Partial least squares regression (Default)

  • "rf": Random forest

  • "svmLinear": Support vector machine with linear kernel

  • "svmRadial": Support vector machine with radial kernel

best.model.metric

Metric used to decide which model is best. Must be either "RMSE" or "Rsquared"

stratified.sampling

If TRUE, training and test sets will be selected using stratified random sampling. This term is only used if test.data == NULL. Default is TRUE.

cv.scheme

A cross validation (CV) scheme from Jarquín et al., 2017. Options for cv.scheme include:

  • "CV1": untested lines in tested environments

  • "CV2": tested lines in tested environments

  • "CV0": tested lines in untested environments

  • "CV00": untested lines in untested environments

trial1

data.frame object that is for use only when cv.scheme is provided. Contains the trial to be tested in subsequent model training functions. The first column contains unique identifiers, second contains genotypes, third contains reference values, followed by spectral columns. Include no other columns to right of spectra! Column names of spectra must start with "X", reference column must be named "reference", and genotype column must be named "genotype".

trial2

data.frame object that is for use only when cv.scheme is provided. This data.frame contains a trial that has overlapping genotypes with trial1 but that were grown in a different site/year (different environment). Formatting must be consistent with trial1.

trial3

data.frame object that is for use only when cv.scheme is provided. This data.frame contains a trial that may or may not contain genotypes that overlap with trial1. Formatting must be consistent with trial1.

split.test

boolean that allows for a fixed training set and a split test set. Example// train model on data from two breeding programs and a stratified subset (70%) of a third and test on the remaining samples (30%) of the third. If FALSE, the entire provided test set test.data will remain as a testing set or if none is provided, 30% of the provided train.data will be used for testing. Default is FALSE.

seed

Integer to be used internally as input for set.seed(). Only used if stratified.sampling = TRUE. In all other cases, seed is set to the current iteration number. Default is 1.

verbose

If TRUE, the number of rows removed through filtering will be printed to the console. Default is TRUE.

wavelengths

DEPRECATED wavelengths is no longer supported; this information is now inferred from df column names

preprocessing

DEPRECATED please use pretreatment to specify the specific pretreatment(s) to test. For behavior identical to that of preprocessing = TRUE, set pretreatment = 1:13`.

output.summary

DEPRECATED output.summary = FALSE is no longer supported; a summary of output is always returned alongside the full performance statistics.

rf.variable.importance

DEPRECATED rf.variable.importance = FALSE is no longer supported; variable importance results are always returned if the model.method is set to `pls` or `rf`.

Value

list of 5 objects:

  1. `model.list` is a list of trained model objects, one for each pretreatment method specified by the pretreatment argument. Each model is trained with all rows of df.

  2. `summary.model.performance` is a data.frame containing summary statistics across all model training iterations and pretreatments. See below for a description of the summary statistics provided.

  3. `model.performance` is a data.frame containing performance statistics for each iteration of model training separately (see below).

  4. `predictions` is a data.frame containing both reference and predicted values for each test set entry in each iteration of model training.

  5. `importance` is a data.frame containing variable importance results for each wavelength at each iteration of model training. If model.method is not "pls" or "rf", this list item is NULL.

`summary.model.performance` and `model.performance` data.frames

summary statistics include:

  • Tuned parameters depending on the model algorithm:

    • Best.n.comp, the best number of components

    • Best.ntree, the best number of trees in an RF model

    • Best.mtry, the best number of variables to include at every decision point in an RF model

  • RMSECV, the root mean squared error of cross-validation

  • R2cv, the coefficient of multiple determination of cross-validation for PLSR models

  • RMSEP, the root mean squared error of prediction

  • R2p, the squared Pearson’s correlation between predicted and observed test set values

  • RPD, the ratio of standard deviation of observed test set values to RMSEP

  • RPIQ, the ratio of performance to interquartile difference

  • CCC, the concordance correlation coefficient

  • Bias, the average difference between the predicted and observed values

  • SEP, the standard error of prediction

  • R2sp, the squared Spearman’s rank correlation between predicted and observed test set values

Details

Calls pretreat_spectra, format_cv, and train_spectra functions.

Author

Jenna Hershberger jmh579@cornell.edu

Examples

# \donttest{
library(magrittr)
ikeogu.2017 %>%
  dplyr::rename(reference = DMC.oven,
                unique.id = sample.id) %>%
  dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  test_spectra(
    train.data = .,
    tune.length = 3,
    num.iterations = 3,
    pretreatment = 1
  )
#> Pretreatment initiated.
#> Training models...
#> Working on Raw_data 
#> Returning model...
#> $model
#> Partial least squares regression, fitted with the kernel algorithm.
#> Call:
#> plsr(formula = reference ~ spectra, ncomp = tune.length, data = df.plsr)
#> 
#> $summary.model.performance
#>   SummaryType ModelType     RMSEp        R2p      RPD      RPIQ        CCC
#> 1        mean       pls 2.0440540 0.77824381 2.085018 2.6038009 0.85155095
#> 2          sd       pls 0.2661027 0.02093217 0.183391 0.3062501 0.04696515
#> 3        mode       pls 1.8971232 0.78270780 2.139263 2.7700624 0.86628031
#>          Bias       SEP     RMSEcv      R2cv       R2sp best.ncomp best.ntree
#> 1 -0.05516781 2.0652364 1.95463427 0.7755627 0.77670789          3         NA
#> 2  0.09468830 0.2688604 0.08412512 0.0157923 0.01211986          0         NA
#> 3 -0.13588174 1.9167831 1.96423744 0.7807426 0.76866372          3         NA
#>   best.mtry
#> 1        NA
#> 2        NA
#> 3        NA
#> 
#> $model.performance
#>   Iteration ModelType    RMSEp       R2p      RPD     RPIQ       CCC
#> 1         1       pls 1.897123 0.7827078 2.139263 2.770062 0.8662803
#> 2         2       pls 1.883812 0.7965839 2.235167 2.790961 0.8893859
#> 3         3       pls 2.351227 0.7554397 1.880623 2.250380 0.7989866
#>          Bias      SEP   RMSEcv      R2cv      R2sp best.ncomp best.ntree
#> 1 -0.13588174 1.916783 1.964237 0.7807426 0.7686637          3         NA
#> 2  0.04906262 1.903334 2.033546 0.7578310 0.7708123          3         NA
#> 3 -0.07868431 2.375593 1.866120 0.7881145 0.7906476          3         NA
#>   best.mtry
#> 1        NA
#> 2        NA
#> 3        NA
#> 
#> $predictions
#>     Iteration ModelType   unique.id  reference        predicted
#> 1           1       pls   C16Mcal_3 42.0446205 39.7341295535482
#> 2           1       pls  C16Mcal_11 35.2299995 37.0584291524637
#> 3           1       pls  C16Mcal_14 42.2379684 41.2378709425078
#> 4           1       pls  C16Mcal_16 36.3796349 38.4719648775467
#> 5           1       pls  C16Mcal_17 36.6281929 38.3893405413672
#> 6           1       pls  C16Mcal_21 37.6122742 37.5876338455539
#> 7           1       pls  C16Mcal_23 37.1399994 36.1622873913981
#> 8           1       pls  C16Mcal_24 42.1911201 40.0020104063215
#> 9           1       pls  C16Mcal_28 29.2099991  31.808857469154
#> 10          1       pls  C16Mcal_34 42.5299988 38.7208091511507
#> 11          1       pls  C16Mcal_36 36.4031143  36.246812495919
#> 12          1       pls  C16Mcal_37 36.7437744 37.0942122518994
#> 13          1       pls  C16Mcal_38 33.7883987 35.8131640124789
#> 14          1       pls  C16Mcal_40 36.1300011 36.9447531214695
#> 15          1       pls  C16Mcal_46 32.3829803 33.3493674228586
#> 16          1       pls  C16Mcal_50 37.7099991 34.0582110463825
#> 17          1       pls  C16Mcal_52   39.11203 37.9982998877743
#> 18          1       pls  C16Mcal_53 41.4599991 39.5928476620954
#> 19          1       pls  C16Mcal_64 27.2600002  28.690815010551
#> 20          1       pls  C16Mcal_68 33.1699982 34.0802252970327
#> 21          1       pls  C16Mcal_70 39.5100403 37.6869323999689
#> 22          1       pls  C16Mcal_77 33.8568802 29.7188404153879
#> 23          1       pls  C16Mcal_78 39.3699989 37.6632226605092
#> 24          1       pls  C16Mcal_79 35.9900017 38.7002568474304
#> 25          1       pls  C16Mcal_82 34.8699989 30.4030934524671
#> 26          1       pls  C16Mcal_83 34.1562042 33.3061746818153
#> 27          1       pls  C16Mcal_85 33.1330147 32.4015416610514
#> 28          1       pls  C16Mcal_89 40.6085129 39.9318954639371
#> 29          1       pls  C16Mcal_92 34.4248657 34.7700364144324
#> 30          1       pls C16Mcal_107 38.8699989 37.1385397881858
#> 31          1       pls C16Mcal_109 28.9400005 31.3652388059175
#> 32          1       pls C16Mcal_111 37.6399994 36.8955816318098
#> 33          1       pls C16Mcal_113 33.4700012 35.5350097538434
#> 34          1       pls C16Mcal_116 43.2331314 41.7466397448865
#> 35          1       pls   C16Mval_2 43.7411308 41.3697207058367
#> 36          1       pls   C16Mval_5 41.7549706 42.0425597521298
#> 37          1       pls  C16Mval_10 39.3955345 37.6258211459777
#> 38          1       pls  C16Mval_13 38.6752853 37.2510250165379
#> 39          1       pls  C16Mval_15 38.5099792 39.0789254447938
#> 40          1       pls  C16Mval_17 38.3962402 38.2769583394128
#> 41          1       pls  C16Mval_25 36.5172691 36.4620805976155
#> 42          1       pls  C16Mval_28 36.3800011 36.8373837961633
#> 43          1       pls  C16Mval_32 35.9099998 37.1844223289279
#> 44          1       pls  C16Mval_39 34.5699997  37.112799121684
#> 45          1       pls  C16Mval_40 34.2991219 34.3335500159472
#> 46          1       pls  C16Mval_46 33.4192848 34.1640920673306
#> 47          1       pls  C16Mval_47 31.1025772 34.4603465331052
#> 48          1       pls  C16Mval_49 30.8113632 33.0971258868265
#> 49          1       pls  C16Mval_53 27.3490391  28.008487137467
#> 50          2       pls   C16Mcal_4 39.0099869 36.9721445534137
#> 51          2       pls  C16Mcal_12 41.9791336 40.8186959650909
#> 52          2       pls  C16Mcal_19 39.7091141  38.626203555885
#> 53          2       pls  C16Mcal_22 41.2799988 40.5777161867026
#> 54          2       pls  C16Mcal_23 37.1399994 36.6069430215017
#> 55          2       pls  C16Mcal_24 42.1911201 39.7716650588323
#> 56          2       pls  C16Mcal_25 31.7656345 34.1048344379296
#> 57          2       pls  C16Mcal_29 39.6450653 40.5048095505648
#> 58          2       pls  C16Mcal_33 34.9757233 34.0910898624338
#> 59          2       pls  C16Mcal_41 39.9312401 39.2384874302655
#> 60          2       pls  C16Mcal_42 34.7200012 34.2219028602118
#> 61          2       pls  C16Mcal_45 29.9400005  31.665033656594
#> 62          2       pls  C16Mcal_49 38.2799988 37.9975383380893
#> 63          2       pls  C16Mcal_50 37.7099991 34.5142103098015
#> 64          2       pls  C16Mcal_53 41.4599991 39.2719913629993
#> 65          2       pls  C16Mcal_63 30.8999996  31.735227722246
#> 66          2       pls  C16Mcal_67 36.2628212  37.821088342659
#> 67          2       pls  C16Mcal_69 31.9607944 30.1513898880451
#> 68          2       pls  C16Mcal_71 40.0252266 38.7768290358187
#> 69          2       pls  C16Mcal_73 29.9218559 31.2831883942071
#> 70          2       pls  C16Mcal_74 32.0927048 34.9271074692052
#> 71          2       pls  C16Mcal_80 31.5499992 33.8737717856748
#> 72          2       pls  C16Mcal_81 37.6691475 38.7673884545834
#> 73          2       pls  C16Mcal_84 34.1899986  33.678669967893
#> 74          2       pls  C16Mcal_87 35.0520935  38.026179600089
#> 75          2       pls  C16Mcal_89 40.6085129 39.6314522574414
#> 76          2       pls  C16Mcal_93 34.8271332 35.7453158525005
#> 77          2       pls  C16Mcal_96 36.2366486 39.1518658417376
#> 78          2       pls C16Mcal_103 39.3523369 39.2231309364537
#> 79          2       pls C16Mcal_109 28.9400005 31.6287841708249
#> 80          2       pls C16Mcal_110 23.5921307 19.2719586527373
#> 81          2       pls C16Mcal_115 37.8899994 34.0844299539589
#> 82          2       pls C16Mcal_121 34.3344879 35.4271153607386
#> 83          2       pls   C16Mval_8 39.8222618 37.7519771007149
#> 84          2       pls  C16Mval_10 39.3955345 37.5400073024118
#> 85          2       pls  C16Mval_11  38.898819 38.0622226951498
#> 86          2       pls  C16Mval_16 38.4863548 38.2158385619491
#> 87          2       pls  C16Mval_18 38.1199989 39.7560516216508
#> 88          2       pls  C16Mval_20 37.9215317 37.6329377937141
#> 89          2       pls  C16Mval_22 37.4799995 35.8860319129793
#> 90          2       pls  C16Mval_23 37.3499985 37.8586343778337
#> 91          2       pls  C16Mval_31 36.0321198 35.0370286129604
#> 92          2       pls  C16Mval_32 35.9099998 37.6957989409211
#> 93          2       pls  C16Mval_33 35.4645767  33.942650482815
#> 94          2       pls  C16Mval_38 34.6531563 37.3115954645114
#> 95          2       pls  C16Mval_44 33.7523422 35.3383323430282
#> 96          2       pls  C16Mval_49 30.8113632 33.0505325050558
#> 97          2       pls  C16Mval_51 28.3097172  32.020432547462
#> 98          2       pls  C16Mval_53 27.3490391 28.0155555233523
#> 99          3       pls   C16Mcal_9 38.1199989 37.6136268303616
#> 100         3       pls  C16Mcal_10 31.7993336 33.9960798905344
#> 101         3       pls  C16Mcal_12 41.9791336 40.7223193329217
#> 102         3       pls  C16Mcal_14 42.2379684 40.0322858749734
#> 103         3       pls  C16Mcal_24 42.1911201 38.7410403428496
#> 104         3       pls  C16Mcal_32 37.4700012 37.8085759518241
#> 105         3       pls  C16Mcal_34 42.5299988 38.7895688013236
#> 106         3       pls  C16Mcal_41 39.9312401 39.2703570299875
#> 107         3       pls  C16Mcal_44 43.2962227 38.9882580121138
#> 108         3       pls  C16Mcal_45 29.9400005 33.0227030066139
#> 109         3       pls  C16Mcal_54 35.0544205 34.7842135784515
#> 110         3       pls  C16Mcal_56 38.4799995  37.650172422375
#> 111         3       pls  C16Mcal_59 28.9799995 35.7692039040073
#> 112         3       pls  C16Mcal_60 36.6251221 37.1083182285589
#> 113         3       pls  C16Mcal_64 27.2600002 28.3550462617746
#> 114         3       pls  C16Mcal_65 38.8069687  37.039754641379
#> 115         3       pls  C16Mcal_72 40.1213608 36.9984744249914
#> 116         3       pls  C16Mcal_73 29.9218559 31.5143515084898
#> 117         3       pls  C16Mcal_76  35.254776 34.5368763296041
#> 118         3       pls  C16Mcal_79 35.9900017 39.1099151476543
#> 119         3       pls  C16Mcal_80 31.5499992 33.9930967456893
#> 120         3       pls  C16Mcal_85 33.1330147 33.5299983423354
#> 121         3       pls  C16Mcal_88 34.6599998  36.358740645514
#> 122         3       pls  C16Mcal_92 34.4248657 35.4589905398041
#> 123         3       pls  C16Mcal_93 34.8271332 35.7826209715778
#> 124         3       pls  C16Mcal_97 34.2159386 35.2912693881478
#> 125         3       pls C16Mcal_100 34.8311043 36.1969705259602
#> 126         3       pls C16Mcal_101 41.6001511 39.3333420265112
#> 127         3       pls C16Mcal_104 35.1094627 34.3670903613702
#> 128         3       pls C16Mcal_105 41.1385422 37.4734946714685
#> 129         3       pls C16Mcal_108 39.1699982 38.2254508879645
#> 130         3       pls C16Mcal_109 28.9400005 32.3303651691149
#> 131         3       pls C16Mcal_111 37.6399994 36.6151814290969
#> 132         3       pls C16Mcal_112 40.9370956 38.8778069984782
#> 133         3       pls C16Mcal_115 37.8899994  36.007553828791
#> 134         3       pls C16Mcal_119 31.1599998 29.0166052416485
#> 135         3       pls   C16Mval_4 43.1399994 40.3666684724336
#> 136         3       pls  C16Mval_10 39.3955345 37.1111538375656
#> 137         3       pls  C16Mval_12 38.6911316 38.9907619623887
#> 138         3       pls  C16Mval_20 37.9215317 37.6017574596303
#> 139         3       pls  C16Mval_22 37.4799995 35.7117317305532
#> 140         3       pls  C16Mval_24 36.7746239 36.8383894043584
#> 141         3       pls  C16Mval_28 36.3800011  36.744908213705
#> 142         3       pls  C16Mval_29 36.2344589 36.2652962480481
#> 143         3       pls  C16Mval_34 35.1879196 32.7954860929976
#> 144         3       pls  C16Mval_42 33.8788452 35.6942086569854
#> 145         3       pls  C16Mval_49 30.8113632 33.4093703124484
#> 146         3       pls  C16Mval_52 28.0000305 34.2648127570954
#> 147         3       pls  C16Mval_53 27.3490391 28.1015099612564
#> 
#> $importance
#> # A tibble: 3 × 2,153
#>   Iteration ModelType   X350   X351   X352   X353   X354   X355   X356   X357
#>       <int> <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1         1 pls       0.0258 0.0245 0.0278 0.0268 0.0265 0.0311 0.0343 0.0312
#> 2         2 pls       0.0351 0.0344 0.0329 0.0315 0.0328 0.0365 0.0416 0.0404
#> 3         3 pls       0.0241 0.0251 0.0256 0.0258 0.0253 0.0292 0.0324 0.0310
#> # ℹ 2,143 more variables: X358 <dbl>, X359 <dbl>, X360 <dbl>, X361 <dbl>,
#> #   X362 <dbl>, X363 <dbl>, X364 <dbl>, X365 <dbl>, X366 <dbl>, X367 <dbl>,
#> #   X368 <dbl>, X369 <dbl>, X370 <dbl>, X371 <dbl>, X372 <dbl>, X373 <dbl>,
#> #   X374 <dbl>, X375 <dbl>, X376 <dbl>, X377 <dbl>, X378 <dbl>, X379 <dbl>,
#> #   X380 <dbl>, X381 <dbl>, X382 <dbl>, X383 <dbl>, X384 <dbl>, X385 <dbl>,
#> #   X386 <dbl>, X387 <dbl>, X388 <dbl>, X389 <dbl>, X390 <dbl>, X391 <dbl>,
#> #   X392 <dbl>, X393 <dbl>, X394 <dbl>, X395 <dbl>, X396 <dbl>, X397 <dbl>, …
#> 
# }