R/format_cv.R
format_cv.Rd
Standalone function that is also used within
train_spectra
to divide trials or studies into training and
test sets based on overlap in trial environments and genotype entries
format_cv(
trial1,
trial2,
trial3 = NULL,
cv.scheme,
stratified.sampling = TRUE,
proportion.train = 0.7,
seed = NULL,
remove.genotype = FALSE
)
data.frame
object that is for use only when
cv.scheme
is provided. Contains the trial to be tested in subsequent
model training functions. The first column contains unique identifiers,
second contains genotypes, third contains reference values, followed by
spectral columns. Include no other columns to right of spectra! Column
names of spectra must start with "X", reference column must be named
"reference", and genotype column must be named "genotype".
data.frame
object that is for use only when
cv.scheme
is provided. This data.frame contains a trial that has
overlapping genotypes with trial1
but that were grown in a different
site/year (different environment). Formatting must be consistent with
trial1
.
data.frame
object that is for use only when
cv.scheme
is provided. This data.frame contains a trial that may or
may not contain genotypes that overlap with trial1
. Formatting must
be consistent with trial1
.
A cross validation (CV) scheme from Jarquín et al., 2017.
Options for cv.scheme
include:
"CV1": untested lines in tested environments
"CV2": tested lines in tested environments
"CV0": tested lines in untested environments
"CV00": untested lines in untested environments
If TRUE
, training and test
sets will be selected using stratified random sampling. Default is
TRUE
.
Fraction of samples to include in the training set. Default is 0.7.
Number used in the function set.seed()
for reproducible
randomization. If NULL
, no seed is set. Default is NULL
.
boolean that, if TRUE
, removes the "genotype"
column is removed from the output data.frame
. Default is
FALSE
.
List of data.frames ($train.set, $test.set) compiled according to user-provided cross validation scheme.
Use of a cross-validation scheme requires a column in the input
data.frame
named "genotype" to ensure proper sorting of training and
test sets. Variables trial1
and trial2
are required, while
trial 3
is optional.
Jarquín, D., C. Lemes da Silva, R. C. Gaynor, J. Poland, A. Fritz, R. Howard, S. Battenfield, and J. Crossa. 2017. Increasing genomic-enabled prediction accuracy by modeling genotype × environment interactions in Kansas wheat. Plant Genome 10(2):1-15. <doi:10.3835/plantgenome2016.12.0130>
# Must have a column called "genotype", so we'll create a fake one for now
# We will use CV00, which does not require any overlap in genotypes
# In real scenarios, CV schemes that rely on genotypes should not be applied
# when genotypes are unknown, as in this case.
library(magrittr)
trials <- ikeogu.2017 %>%
dplyr::mutate(genotype = 1:nrow(ikeogu.2017)) %>% # fake for this example
dplyr::rename(reference = DMC.oven) %>%
dplyr::select(
study.name, sample.id, genotype, reference,
tidyselect::starts_with("X")
)
trial1 <- trials %>%
dplyr::filter(study.name == "C16Mcal") %>%
dplyr::select(-study.name)
trial2 <- trials %>%
dplyr::filter(study.name == "C16Mval") %>%
dplyr::select(-study.name)
cv.list <- format_cv(
trial1 = trial1, trial2 = trial2, cv.scheme = "CV00",
stratified.sampling = FALSE, remove.genotype = TRUE
)
cv.list$train.set[1:5, 1:5]
#> # A tibble: 5 × 5
#> sample.id reference X350 X351 X352
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 C16Mval_1 44.1 0.497 0.474 0.489
#> 2 C16Mval_2 43.7 0.546 0.556 0.557
#> 3 C16Mval_3 43.4 0.513 0.521 0.529
#> 4 C16Mval_4 43.1 0.565 0.570 0.583
#> 5 C16Mval_5 41.8 0.497 0.502 0.510
cv.list$test.set[1:5, 1:5]
#> # A tibble: 5 × 5
#> sample.id reference X350 X351 X352
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 C16Mcal_1 39.6 0.488 0.495 0.506
#> 2 C16Mcal_3 42.0 0.599 0.627 0.624
#> 3 C16Mcal_4 39.0 0.517 0.516 0.514
#> 4 C16Mcal_5 33.4 0.519 0.548 0.554
#> 5 C16Mcal_16 36.4 0.517 0.514 0.530