TileDBArray 1.8.0
TileDB implements a framework for local and remote storage of dense and sparse arrays.
We can use this as a DelayedArray
backend to provide an array-level abstraction,
thus allowing the data to be used in many places where an ordinary array or matrix might be used.
The TileDBArray package implements the necessary wrappers around TileDB-R
to support read/write operations on TileDB arrays within the DelayedArray framework.
TileDBArray
Creating a TileDBArray
is as easy as:
X <- matrix(rnorm(1000), ncol=10)
library(TileDBArray)
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] -3.19740384 0.02366832 1.43882154 . 0.74027861 0.16222372
## [2,] -0.54855146 0.82797330 0.46894609 . -1.35604757 -1.88506855
## [3,] -1.71184926 1.03913644 1.10329503 . 1.41260597 0.10774741
## [4,] -0.17607240 0.85886833 1.31960227 . -0.50843544 -0.06373862
## [5,] 1.19107241 0.49569529 0.77627958 . -0.55301858 -0.85780744
## ... . . . . . .
## [96,] 0.54650623 0.10385205 -0.10166375 . 0.3948168 1.0171282
## [97,] -0.22359107 0.23612072 -0.83489274 . 3.8008054 -0.5671609
## [98,] 0.25223756 0.22192724 0.67664159 . -1.3718886 0.6820337
## [99,] -0.74772160 -0.47449242 0.63376284 . -1.9433443 -0.4596154
## [100,] -0.31072732 -1.12952122 -0.07455359 . 0.9120615 -0.5355549
Alternatively, we can use coercion methods:
as(X, "TileDBArray")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] -3.19740384 0.02366832 1.43882154 . 0.74027861 0.16222372
## [2,] -0.54855146 0.82797330 0.46894609 . -1.35604757 -1.88506855
## [3,] -1.71184926 1.03913644 1.10329503 . 1.41260597 0.10774741
## [4,] -0.17607240 0.85886833 1.31960227 . -0.50843544 -0.06373862
## [5,] 1.19107241 0.49569529 0.77627958 . -0.55301858 -0.85780744
## ... . . . . . .
## [96,] 0.54650623 0.10385205 -0.10166375 . 0.3948168 1.0171282
## [97,] -0.22359107 0.23612072 -0.83489274 . 3.8008054 -0.5671609
## [98,] 0.25223756 0.22192724 0.67664159 . -1.3718886 0.6820337
## [99,] -0.74772160 -0.47449242 0.63376284 . -1.9433443 -0.4596154
## [100,] -0.31072732 -1.12952122 -0.07455359 . 0.9120615 -0.5355549
This process works also for sparse matrices:
Y <- Matrix::rsparsematrix(1000, 1000, density=0.01)
writeTileDBArray(Y)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] 0 0 0 . 0 0
## [2,] 0 0 0 . 0 0
## [3,] 0 0 0 . 0 0
## [4,] 0 0 0 . 0 0
## [5,] 0 0 0 . 0 0
## ... . . . . . .
## [996,] 0.00 0.00 0.00 . 0 0
## [997,] 0.00 0.00 0.00 . 0 0
## [998,] 0.45 0.00 0.00 . 0 0
## [999,] 0.00 0.00 0.00 . 0 0
## [1000,] 0.00 0.00 0.00 . 0 0
Logical and integer matrices are supported:
writeTileDBArray(Y > 0)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "logical":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] FALSE FALSE FALSE . FALSE FALSE
## [2,] FALSE FALSE FALSE . FALSE FALSE
## [3,] FALSE FALSE FALSE . FALSE FALSE
## [4,] FALSE FALSE FALSE . FALSE FALSE
## [5,] FALSE FALSE FALSE . FALSE FALSE
## ... . . . . . .
## [996,] FALSE FALSE FALSE . FALSE FALSE
## [997,] FALSE FALSE FALSE . FALSE FALSE
## [998,] TRUE FALSE FALSE . FALSE FALSE
## [999,] FALSE FALSE FALSE . FALSE FALSE
## [1000,] FALSE FALSE FALSE . FALSE FALSE
As are matrices with dimension names:
rownames(X) <- sprintf("GENE_%i", seq_len(nrow(X)))
colnames(X) <- sprintf("SAMP_%i", seq_len(ncol(X)))
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 -3.19740384 0.02366832 1.43882154 . 0.74027861 0.16222372
## GENE_2 -0.54855146 0.82797330 0.46894609 . -1.35604757 -1.88506855
## GENE_3 -1.71184926 1.03913644 1.10329503 . 1.41260597 0.10774741
## GENE_4 -0.17607240 0.85886833 1.31960227 . -0.50843544 -0.06373862
## GENE_5 1.19107241 0.49569529 0.77627958 . -0.55301858 -0.85780744
## ... . . . . . .
## GENE_96 0.54650623 0.10385205 -0.10166375 . 0.3948168 1.0171282
## GENE_97 -0.22359107 0.23612072 -0.83489274 . 3.8008054 -0.5671609
## GENE_98 0.25223756 0.22192724 0.67664159 . -1.3718886 0.6820337
## GENE_99 -0.74772160 -0.47449242 0.63376284 . -1.9433443 -0.4596154
## GENE_100 -0.31072732 -1.12952122 -0.07455359 . 0.9120615 -0.5355549
TileDBArray
sTileDBArray
s are simply DelayedArray
objects and can be manipulated as such.
The usual conventions for extracting data from matrix-like objects work as expected:
out <- as(X, "TileDBArray")
dim(out)
## [1] 100 10
head(rownames(out))
## [1] "GENE_1" "GENE_2" "GENE_3" "GENE_4" "GENE_5" "GENE_6"
head(out[,1])
## GENE_1 GENE_2 GENE_3 GENE_4 GENE_5 GENE_6
## -3.1974038 -0.5485515 -1.7118493 -0.1760724 1.1910724 -0.7879552
We can also perform manipulations like subsetting and arithmetic.
Note that these operations do not affect the data in the TileDB backend;
rather, they are delayed until the values are explicitly required,
hence the creation of the DelayedMatrix
object.
out[1:5,1:5]
## <5 x 5> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5
## GENE_1 -3.19740384 0.02366832 1.43882154 0.14682888 -1.00587440
## GENE_2 -0.54855146 0.82797330 0.46894609 -2.38622853 -0.04772880
## GENE_3 -1.71184926 1.03913644 1.10329503 1.07415513 0.03067260
## GENE_4 -0.17607240 0.85886833 1.31960227 1.48146798 -0.02191980
## GENE_5 1.19107241 0.49569529 0.77627958 0.61041258 0.74381318
out * 2
## <100 x 10> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 -6.39480767 0.04733664 2.87764308 . 1.4805572 0.3244474
## GENE_2 -1.09710292 1.65594660 0.93789218 . -2.7120951 -3.7701371
## GENE_3 -3.42369852 2.07827289 2.20659006 . 2.8252119 0.2154948
## GENE_4 -0.35214480 1.71773666 2.63920455 . -1.0168709 -0.1274772
## GENE_5 2.38214482 0.99139057 1.55255917 . -1.1060372 -1.7156149
## ... . . . . . .
## GENE_96 1.0930125 0.2077041 -0.2033275 . 0.7896337 2.0342564
## GENE_97 -0.4471821 0.4722414 -1.6697855 . 7.6016107 -1.1343219
## GENE_98 0.5044751 0.4438545 1.3532832 . -2.7437772 1.3640675
## GENE_99 -1.4954432 -0.9489848 1.2675257 . -3.8866885 -0.9192308
## GENE_100 -0.6214546 -2.2590424 -0.1491072 . 1.8241230 -1.0711098
We can also do more complex matrix operations that are supported by DelayedArray:
colSums(out)
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5 SAMP_6
## -11.47185841 3.23025302 6.81328555 3.56418547 -2.06624438 1.69513440
## SAMP_7 SAMP_8 SAMP_9 SAMP_10
## -4.99499561 0.07331444 8.05382192 -2.01524549
out %*% runif(ncol(out))
## <100 x 1> matrix of class DelayedMatrix and type "double":
## y
## GENE_1 -0.03783534
## GENE_2 -0.44031880
## GENE_3 0.07813549
## GENE_4 1.53970394
## GENE_5 1.71854825
## ... .
## GENE_96 0.9407611
## GENE_97 0.3292237
## GENE_98 0.4010207
## GENE_99 -0.2996166
## GENE_100 1.3346309
We can adjust some parameters for creating the backend with appropriate arguments to writeTileDBArray()
.
For example, the example below allows us to control the path to the backend
as well as the name of the attribute containing the data.
X <- matrix(rnorm(1000), ncol=10)
path <- tempfile()
writeTileDBArray(X, path=path, attr="WHEE")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 2.1503318 -0.3608012 -0.6308427 . -0.4766048 0.1967894
## [2,] 1.7653689 0.6983906 0.3750172 . 0.6250747 0.4924169
## [3,] -0.1763985 0.9444900 -2.4345577 . -1.2473727 -0.4020099
## [4,] -1.9657527 1.2072019 0.3480542 . -1.0896892 -0.1820832
## [5,] 1.5241314 -1.6844846 0.1816733 . -0.4003402 -0.4688211
## ... . . . . . .
## [96,] 0.15053918 -0.28666526 0.18716210 . 0.25664918 0.84351329
## [97,] 0.07221993 -1.91933479 -0.54193589 . -1.81587018 -0.59655751
## [98,] 1.73447852 -0.95471990 -0.10241656 . 0.97917996 -0.03063937
## [99,] 0.50316422 0.67961357 0.28239582 . 1.17155834 -0.66700277
## [100,] -1.38325707 -0.66323155 0.32773349 . -0.90349532 -0.67769656
As these arguments cannot be passed during coercion, we instead provide global variables that can be set or unset to affect the outcome.
path2 <- tempfile()
setTileDBPath(path2)
as(X, "TileDBArray") # uses path2 to store the backend.
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 2.1503318 -0.3608012 -0.6308427 . -0.4766048 0.1967894
## [2,] 1.7653689 0.6983906 0.3750172 . 0.6250747 0.4924169
## [3,] -0.1763985 0.9444900 -2.4345577 . -1.2473727 -0.4020099
## [4,] -1.9657527 1.2072019 0.3480542 . -1.0896892 -0.1820832
## [5,] 1.5241314 -1.6844846 0.1816733 . -0.4003402 -0.4688211
## ... . . . . . .
## [96,] 0.15053918 -0.28666526 0.18716210 . 0.25664918 0.84351329
## [97,] 0.07221993 -1.91933479 -0.54193589 . -1.81587018 -0.59655751
## [98,] 1.73447852 -0.95471990 -0.10241656 . 0.97917996 -0.03063937
## [99,] 0.50316422 0.67961357 0.28239582 . 1.17155834 -0.66700277
## [100,] -1.38325707 -0.66323155 0.32773349 . -0.90349532 -0.67769656
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 20348)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] TileDBArray_1.8.0 DelayedArray_0.24.0 IRanges_2.32.0
## [4] S4Vectors_0.36.0 MatrixGenerics_1.10.0 matrixStats_0.62.0
## [7] BiocGenerics_0.44.0 Matrix_1.5-1 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 bslib_0.4.0 compiler_4.2.1
## [4] BiocManager_1.30.19 jquerylib_0.1.4 tools_4.2.1
## [7] digest_0.6.30 bit_4.0.4 jsonlite_1.8.3
## [10] evaluate_0.17 lattice_0.20-45 nanotime_0.3.7
## [13] rlang_1.0.6 cli_3.4.1 RcppCCTZ_0.2.11
## [16] yaml_2.3.6 xfun_0.34 fastmap_1.1.0
## [19] stringr_1.4.1 knitr_1.40 sass_0.4.2
## [22] bit64_4.0.5 grid_4.2.1 data.table_1.14.4
## [25] R6_2.5.1 rmarkdown_2.17 bookdown_0.29
## [28] tiledb_0.16.0 magrittr_2.0.3 htmltools_0.5.3
## [31] stringi_1.7.8 cachem_1.0.6 zoo_1.8-11