R/filter_spectra.R
filter_spectra.Rd
Determine Mahalanobis distances of observations (rows) within a
given data.frame
with spectral data. Option to filter out
observations based on these distances.
filter_spectra(df, filter, return.distances, num.col.before.spectra,
window.size, verbose)
data.frame
object containing columns of spectra and rows of
observations. Spectral columns must be labeled with an "X" and then the
wavelength (example: "X740" = 740nm). Left-most column must be unique ID.
May also contain columns of metadata between the unique ID and spectral
columns. Cannot contain any missing values. Metadata column names may not
start with "X".
boolean that determines whether or not the input
data.frame
will be filtered. If TRUE
, df
will be
filtered according to squared Mahalanobis distance with a 95% cutoff from
a chi-square distribution with degrees of freedom = number of spectral
columns. If FALSE
, a column of squared Mahalanobis distances
h.distance
will be added to the right side of df and all rows will
be returned. Default is TRUE
.
boolean that determines whether a column of squared
Mahalanobis distances will be included in output data.frame
. If
TRUE
, a column of Mahalanobis distances for each row will be added
to the right side of df
. Default is FALSE
.
number of columns to the left of the spectral
matrix in df
. Default is 4.
number defining the size of window to use when calculating the covariance of the spectra (required to calculate Mahalanobis distance). Default is 10.
If TRUE
, the number of rows removed through filtering
will be printed to the console. Default is TRUE
.
If filter
is TRUE
, returns filtered data frame
df
and reports the number of rows removed. The Mahalanobis distance
with a cutoff of 95% of chi-square distribution (degrees of freedom =
number of wavelengths) is used as filtering criteria. If filter
is
FALSE
, returns full input df with column h.distances
containing the Mahalanobis distance for each row.
This function uses a chi-square distribution with 95% cutoff where
degrees of freedom = number of wavelengths (columns) in the input
data.frame
.
Johnson, R.A., and D.W. Wichern. 2007. Applied Multivariate Statistical Analysis (6th Edition). pg 189
library(magrittr)
filtered.test <- ikeogu.2017 %>%
dplyr::select(-TCC) %>%
na.omit() %>%
filter_spectra(
df = .,
filter = TRUE,
return.distances = TRUE,
num.col.before.spectra = 5,
window.size = 15
)
#>
#> Removed 0 rows.
filtered.test[1:5, c(1:5, (ncol(filtered.test) - 5):ncol(filtered.test))]
#> study.name sample.id DMC.oven X350 X351 X2496 X2497 X2498
#> 1 C16Mcal C16Mcal_1 39.62109 0.4881079 0.4951843 1.868554 1.866739 1.867465
#> 2 C16Mcal C16Mcal_2 35.52017 0.5727389 0.5682541 1.891462 1.893840 1.901451
#> 3 C16Mcal C16Mcal_3 42.04462 0.5989934 0.6266454 1.832336 1.834644 1.828793
#> 4 C16Mcal C16Mcal_4 39.00999 0.5169374 0.5164186 1.839129 1.837023 1.836635
#> 5 C16Mcal C16Mcal_5 33.44273 0.5189608 0.5477946 1.899751 1.900873 1.897076
#> X2499 X2500 h.distances
#> 1 1.870405 1.870702 139.8898
#> 2 1.891114 1.888507 149.5721
#> 3 1.826562 1.832022 140.5149
#> 4 1.835856 1.834857 148.3881
#> 5 1.899430 1.896130 152.2514