This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and \(R^2\) as performance measure are implemented.
The pengls package is available from BioConductor, and can be installed as follows:
Once installed, it can be loaded and version info printed.
suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 1.0.0
We first create a toy dataset with spatial coordinates.
library(nlme)
n <- 75 #Sample size
p <- 100 #Number of features
g <- 10 #Size of the grid
#Generate grid
Grid <- expand.grid("x" = seq_len(g), "y" = seq_len(g))
# Sample points from grid without replacement
GridSample <- Grid[sample(nrow(Grid), n, replace = FALSE),]
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
df <- data.frame("a" = a, "b" = b, GridSample)
The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.
# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corStruct <- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5))
Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter \(\lambda=0.2\).
#Fit the pengls model, for simplicity for a simple lambda
penglsFit <- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt = corStruct, lambda = 0.2, verbose = TRUE)
## Starting iterations...
## Iteration 1
## Iteration 2
## Iteration 3
Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.
## pengls model with correlation structure: corGaus
## and 29 non-zero coefficients
The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:
dfTime <- data.frame("a" = a, "b" = b, "t" = seq_len(n))
corStructTime <- corAR1(form = ~ t, value = 0.5)
The fitting command is similar, this time the \(\lambda\) parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose \(\alpha=0.5\) this time, fitting an elastic net model.
penglsFitTime <- pengls(data = dfTime, outVar = "a", verbose = TRUE,
xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
## Fitting naieve model...
## Starting iterations...
## Iteration 1
## Iteration 2
## Iteration 3
Show the output
## pengls model with correlation structure: corAR1
## and 42 non-zero coefficients
The pengls package also provides cross-validation for finding the optimal \(\lambda\) value. If the tuning parameter \(\lambda\) is not supplied, the optimal \(\lambda\) according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :
The function is called similarly to cv.glmnet:
penglsFitCV <- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt <- corStruct, nfolds = nfolds)
Check the result:
## Cross-validated pengls model with correlation structure: corGaus
## and 37 non-zero coefficients.
## 10 fold cross-validation yielded an estimated R2 of 0.7682299 .
By default, the 1 standard error is used to determine the optimal value of \(\lambda\) :
## [1] 0.0268676
## [1] 0.7682299
Extract coefficients and fold IDs.
## [1] 0.01530464 0.00000000 0.00000000 0.00000000 0.96883687 0.00000000
## 38 74 41 49 93 18 26 17 63 25 92 61 82 90 83 31 76 89 1 47
## 6 9 9 1 8 6 1 6 9 9 8 9 9 3 8 8 3 4 5 7
## 20 91 11 84 65 88 44 24 40 100 56 50 36 78 12 66 58 30 27 71
## 6 2 5 8 2 10 7 5 6 3 7 7 6 1 5 1 7 4 7 9
## 42 5 16 85 96 64 59 79 54 57 21 67 29 3 37 6 60 28 13 46
## 8 6 7 10 10 2 4 3 5 3 5 1 1 5 1 6 3 7 5 4
## 10 45 7 32 70 35 15 95 87 97 2 81 53 69 19
## 6 1 6 5 3 9 6 10 10 3 5 2 9 1 6
By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:
set.seed(5657)
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "random", "t")
blockedFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))
To perform random cross-validation
penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt <- corStructTime, nfolds = nfolds, cvType = "random")
To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:
penglsFitCVtimeCenter <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt <- corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter$cvOpt #Better performance
## [1] 0.9828069
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.28.0 nlme_3.1-153 pengls_1.0.0
##
## loaded via a namespace (and not attached):
## [1] knitr_1.36 magrittr_2.0.1 splines_4.1.1 lattice_0.20-45
## [5] R6_2.5.1 rlang_0.4.12 fastmap_1.1.0 foreach_1.5.1
## [9] highr_0.9 stringr_1.4.0 tools_4.1.1 parallel_4.1.1
## [13] grid_4.1.1 glmnet_4.1-2 xfun_0.27 jquerylib_0.1.4
## [17] htmltools_0.5.2 iterators_1.0.13 yaml_2.2.1 survival_3.2-13
## [21] digest_0.6.28 Matrix_1.3-4 sass_0.4.0 codetools_0.2-18
## [25] shape_1.4.6 evaluate_0.14 rmarkdown_2.11 stringi_1.7.5
## [29] compiler_4.1.1 bslib_0.3.1 jsonlite_1.7.2