TileDBArray 1.4.0
TileDB implements a framework for local and remote storage of dense and sparse arrays.
We can use this as a DelayedArray
backend to provide an array-level abstraction,
thus allowing the data to be used in many places where an ordinary array or matrix might be used.
The TileDBArray package implements the necessary wrappers around TileDB-R
to support read/write operations on TileDB arrays within the DelayedArray framework.
TileDBArray
Creating a TileDBArray
is as easy as:
X <- matrix(rnorm(1000), ncol=10)
library(TileDBArray)
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.83261511 -1.00342817 -0.94001167 . -1.83622791 0.70146513
## [2,] 1.07392143 1.86069942 -0.02992467 . 0.95219348 -2.44056496
## [3,] -0.61544844 -0.28697673 0.21149619 . 0.31475884 0.01818955
## [4,] 0.54524703 -0.36441386 1.04791156 . -0.80146561 0.78412271
## [5,] 1.06011757 -0.22867717 0.46661335 . 1.83630410 0.67930119
## ... . . . . . .
## [96,] 1.197771060 -0.245390199 0.422046615 . 0.8300042 0.1278515
## [97,] -0.607855376 0.976175564 0.522377669 . 0.8446801 -0.8081294
## [98,] 1.406040871 1.212108604 1.890741981 . -1.4388907 1.6705455
## [99,] 0.700357985 1.004665556 -0.062667146 . 0.6439672 0.2651410
## [100,] -0.008339175 0.087676294 0.407465583 . 1.8500237 -1.4865219
Alternatively, we can use coercion methods:
as(X, "TileDBArray")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.83261511 -1.00342817 -0.94001167 . -1.83622791 0.70146513
## [2,] 1.07392143 1.86069942 -0.02992467 . 0.95219348 -2.44056496
## [3,] -0.61544844 -0.28697673 0.21149619 . 0.31475884 0.01818955
## [4,] 0.54524703 -0.36441386 1.04791156 . -0.80146561 0.78412271
## [5,] 1.06011757 -0.22867717 0.46661335 . 1.83630410 0.67930119
## ... . . . . . .
## [96,] 1.197771060 -0.245390199 0.422046615 . 0.8300042 0.1278515
## [97,] -0.607855376 0.976175564 0.522377669 . 0.8446801 -0.8081294
## [98,] 1.406040871 1.212108604 1.890741981 . -1.4388907 1.6705455
## [99,] 0.700357985 1.004665556 -0.062667146 . 0.6439672 0.2651410
## [100,] -0.008339175 0.087676294 0.407465583 . 1.8500237 -1.4865219
This process works also for sparse matrices:
Y <- Matrix::rsparsematrix(1000, 1000, density=0.01)
writeTileDBArray(Y)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] 0 0 0 . 0 0
## [2,] 0 0 0 . 0 0
## [3,] 0 0 0 . 0 0
## [4,] 0 0 0 . 0 0
## [5,] 0 0 0 . 0 0
## ... . . . . . .
## [996,] 0 0 0 . 0 0
## [997,] 0 0 0 . 0 0
## [998,] 0 0 0 . 0 0
## [999,] 0 0 0 . 0 0
## [1000,] 0 0 0 . 0 0
Logical and integer matrices are supported:
writeTileDBArray(Y > 0)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "logical":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] FALSE FALSE FALSE . FALSE FALSE
## [2,] FALSE FALSE FALSE . FALSE FALSE
## [3,] FALSE FALSE FALSE . FALSE FALSE
## [4,] FALSE FALSE FALSE . FALSE FALSE
## [5,] FALSE FALSE FALSE . FALSE FALSE
## ... . . . . . .
## [996,] FALSE FALSE FALSE . FALSE FALSE
## [997,] FALSE FALSE FALSE . FALSE FALSE
## [998,] FALSE FALSE FALSE . FALSE FALSE
## [999,] FALSE FALSE FALSE . FALSE FALSE
## [1000,] FALSE FALSE FALSE . FALSE FALSE
As are matrices with dimension names:
rownames(X) <- sprintf("GENE_%i", seq_len(nrow(X)))
colnames(X) <- sprintf("SAMP_%i", seq_len(ncol(X)))
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 0.83261511 -1.00342817 -0.94001167 . -1.83622791 0.70146513
## GENE_2 1.07392143 1.86069942 -0.02992467 . 0.95219348 -2.44056496
## GENE_3 -0.61544844 -0.28697673 0.21149619 . 0.31475884 0.01818955
## GENE_4 0.54524703 -0.36441386 1.04791156 . -0.80146561 0.78412271
## GENE_5 1.06011757 -0.22867717 0.46661335 . 1.83630410 0.67930119
## ... . . . . . .
## GENE_96 1.197771060 -0.245390199 0.422046615 . 0.8300042 0.1278515
## GENE_97 -0.607855376 0.976175564 0.522377669 . 0.8446801 -0.8081294
## GENE_98 1.406040871 1.212108604 1.890741981 . -1.4388907 1.6705455
## GENE_99 0.700357985 1.004665556 -0.062667146 . 0.6439672 0.2651410
## GENE_100 -0.008339175 0.087676294 0.407465583 . 1.8500237 -1.4865219
TileDBArray
sTileDBArray
s are simply DelayedArray
objects and can be manipulated as such.
The usual conventions for extracting data from matrix-like objects work as expected:
out <- as(X, "TileDBArray")
dim(out)
## [1] 100 10
head(rownames(out))
## [1] "GENE_1" "GENE_2" "GENE_3" "GENE_4" "GENE_5" "GENE_6"
head(out[,1])
## GENE_1 GENE_2 GENE_3 GENE_4 GENE_5 GENE_6
## 0.8326151 1.0739214 -0.6154484 0.5452470 1.0601176 0.4075643
We can also perform manipulations like subsetting and arithmetic.
Note that these operations do not affect the data in the TileDB backend;
rather, they are delayed until the values are explicitly required,
hence the creation of the DelayedMatrix
object.
out[1:5,1:5]
## <5 x 5> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5
## GENE_1 0.83261511 -1.00342817 -0.94001167 -1.51812602 -0.91440586
## GENE_2 1.07392143 1.86069942 -0.02992467 1.04611050 0.04954743
## GENE_3 -0.61544844 -0.28697673 0.21149619 1.69035158 -0.51138204
## GENE_4 0.54524703 -0.36441386 1.04791156 -1.07102621 -0.35983500
## GENE_5 1.06011757 -0.22867717 0.46661335 0.69073243 0.13377653
out * 2
## <100 x 10> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 1.66523022 -2.00685635 -1.88002334 . -3.67245581 1.40293026
## GENE_2 2.14784287 3.72139883 -0.05984933 . 1.90438696 -4.88112991
## GENE_3 -1.23089688 -0.57395346 0.42299237 . 0.62951767 0.03637911
## GENE_4 1.09049406 -0.72882773 2.09582311 . -1.60293123 1.56824541
## GENE_5 2.12023514 -0.45735433 0.93322670 . 3.67260821 1.35860237
## ... . . . . . .
## GENE_96 2.39554212 -0.49078040 0.84409323 . 1.6600083 0.2557030
## GENE_97 -1.21571075 1.95235113 1.04475534 . 1.6893601 -1.6162588
## GENE_98 2.81208174 2.42421721 3.78148396 . -2.8777813 3.3410910
## GENE_99 1.40071597 2.00933111 -0.12533429 . 1.2879344 0.5302821
## GENE_100 -0.01667835 0.17535259 0.81493117 . 3.7000474 -2.9730437
We can also do more complex matrix operations that are supported by DelayedArray:
colSums(out)
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5 SAMP_6 SAMP_7
## -4.544744 -13.273259 6.840967 14.439052 -9.161462 -1.080214 3.538718
## SAMP_8 SAMP_9 SAMP_10
## 5.621184 8.687487 -8.279503
out %*% runif(ncol(out))
## <100 x 1> matrix of class DelayedMatrix and type "double":
## y
## GENE_1 -3.4214147
## GENE_2 1.6686974
## GENE_3 0.2200603
## GENE_4 0.5250646
## GENE_5 3.3152727
## ... .
## GENE_96 -0.0473812
## GENE_97 1.4865229
## GENE_98 2.1544115
## GENE_99 3.1063345
## GENE_100 0.6349174
We can adjust some parameters for creating the backend with appropriate arguments to writeTileDBArray()
.
For example, the example below allows us to control the path to the backend
as well as the name of the attribute containing the data.
X <- matrix(rnorm(1000), ncol=10)
path <- tempfile()
writeTileDBArray(X, path=path, attr="WHEE")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.2804715 0.4075265 2.0777639 . -0.64076770 -0.27726149
## [2,] 0.1387126 1.6095495 -0.7829232 . -0.68911505 1.27515735
## [3,] 0.3621432 -1.0874491 -0.1261416 . 0.06978909 0.11965051
## [4,] -0.8058191 1.3357310 -1.0584065 . 0.21967996 -0.06772459
## [5,] 1.3403321 -0.5837746 0.3348681 . 1.57612840 -0.30323554
## ... . . . . . .
## [96,] -0.63083875 0.49984502 -1.69881001 . 0.26362404 -1.65421555
## [97,] 1.15434481 0.83036381 0.37897064 . -0.06797047 -0.78945270
## [98,] -1.58400409 0.25061301 -0.91197292 . -1.05748402 0.19295069
## [99,] -0.57070364 2.60279073 0.03152329 . -0.66345660 0.08810165
## [100,] 0.15439597 -2.43557143 -0.30193418 . 0.79697415 -1.11795946
As these arguments cannot be passed during coercion, we instead provide global variables that can be set or unset to affect the outcome.
path2 <- tempfile()
setTileDBPath(path2)
as(X, "TileDBArray") # uses path2 to store the backend.
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.2804715 0.4075265 2.0777639 . -0.64076770 -0.27726149
## [2,] 0.1387126 1.6095495 -0.7829232 . -0.68911505 1.27515735
## [3,] 0.3621432 -1.0874491 -0.1261416 . 0.06978909 0.11965051
## [4,] -0.8058191 1.3357310 -1.0584065 . 0.21967996 -0.06772459
## [5,] 1.3403321 -0.5837746 0.3348681 . 1.57612840 -0.30323554
## ... . . . . . .
## [96,] -0.63083875 0.49984502 -1.69881001 . 0.26362404 -1.65421555
## [97,] 1.15434481 0.83036381 0.37897064 . -0.06797047 -0.78945270
## [98,] -1.58400409 0.25061301 -0.91197292 . -1.05748402 0.19295069
## [99,] -0.57070364 2.60279073 0.03152329 . -0.66345660 0.08810165
## [100,] 0.15439597 -2.43557143 -0.30193418 . 0.79697415 -1.11795946
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] TileDBArray_1.4.0 DelayedArray_0.20.0 IRanges_2.28.0
## [4] S4Vectors_0.32.0 MatrixGenerics_1.6.0 matrixStats_0.61.0
## [7] BiocGenerics_0.40.0 Matrix_1.3-4 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 bslib_0.3.1 compiler_4.1.1
## [4] BiocManager_1.30.16 jquerylib_0.1.4 tools_4.1.1
## [7] digest_0.6.28 bit_4.0.4 jsonlite_1.7.2
## [10] evaluate_0.14 lattice_0.20-45 nanotime_0.3.3
## [13] rlang_0.4.12 RcppCCTZ_0.2.9 yaml_2.2.1
## [16] xfun_0.27 fastmap_1.1.0 stringr_1.4.0
## [19] knitr_1.36 sass_0.4.0 bit64_4.0.5
## [22] grid_4.1.1 R6_2.5.1 rmarkdown_2.11
## [25] bookdown_0.24 tiledb_0.9.7 magrittr_2.0.1
## [28] htmltools_0.5.2 stringi_1.7.5 zoo_1.8-9