TileDBArray 1.10.0
TileDB implements a framework for local and remote storage of dense and sparse arrays.
We can use this as a DelayedArray
backend to provide an array-level abstraction,
thus allowing the data to be used in many places where an ordinary array or matrix might be used.
The TileDBArray package implements the necessary wrappers around TileDB-R
to support read/write operations on TileDB arrays within the DelayedArray framework.
TileDBArray
Creating a TileDBArray
is as easy as:
X <- matrix(rnorm(1000), ncol=10)
library(TileDBArray)
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 1.63549257 0.46533113 -0.64362260 . 0.3182148 -0.3746532
## [2,] -0.68873934 -1.65071111 -1.43557121 . -0.5230861 0.4910293
## [3,] 1.14204030 -0.44399339 0.29166305 . 0.6314542 -0.8057484
## [4,] 1.26460430 -1.36323948 1.72579495 . -0.7376756 0.4030362
## [5,] 0.05788121 0.36403968 0.99787726 . 0.4713108 0.1343959
## ... . . . . . .
## [96,] -0.702913394 -0.390246848 0.883118963 . -0.2340585 -1.2547414
## [97,] -1.036727316 -0.176501976 -0.742875173 . -0.5115610 -0.4422906
## [98,] 0.086703553 -1.983729259 -1.877994787 . -0.3552945 -0.1713032
## [99,] -0.419210661 -0.719555109 1.200409705 . -0.2065206 0.6432031
## [100,] 0.878642020 -0.713506377 0.007804023 . 0.0868919 2.6659455
Alternatively, we can use coercion methods:
as(X, "TileDBArray")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 1.63549257 0.46533113 -0.64362260 . 0.3182148 -0.3746532
## [2,] -0.68873934 -1.65071111 -1.43557121 . -0.5230861 0.4910293
## [3,] 1.14204030 -0.44399339 0.29166305 . 0.6314542 -0.8057484
## [4,] 1.26460430 -1.36323948 1.72579495 . -0.7376756 0.4030362
## [5,] 0.05788121 0.36403968 0.99787726 . 0.4713108 0.1343959
## ... . . . . . .
## [96,] -0.702913394 -0.390246848 0.883118963 . -0.2340585 -1.2547414
## [97,] -1.036727316 -0.176501976 -0.742875173 . -0.5115610 -0.4422906
## [98,] 0.086703553 -1.983729259 -1.877994787 . -0.3552945 -0.1713032
## [99,] -0.419210661 -0.719555109 1.200409705 . -0.2065206 0.6432031
## [100,] 0.878642020 -0.713506377 0.007804023 . 0.0868919 2.6659455
This process works also for sparse matrices:
Y <- Matrix::rsparsematrix(1000, 1000, density=0.01)
writeTileDBArray(Y)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] 0 0 0 . 0 0
## [2,] 0 0 0 . 0 0
## [3,] 0 0 0 . 0 0
## [4,] 0 0 0 . 0 0
## [5,] 0 0 0 . 0 0
## ... . . . . . .
## [996,] 0 0 0 . 0 0
## [997,] 0 0 0 . 0 0
## [998,] 0 0 0 . 0 0
## [999,] 0 0 0 . 0 0
## [1000,] 0 0 0 . 0 0
Logical and integer matrices are supported:
writeTileDBArray(Y > 0)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "logical":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] FALSE FALSE FALSE . FALSE FALSE
## [2,] FALSE FALSE FALSE . FALSE FALSE
## [3,] FALSE FALSE FALSE . FALSE FALSE
## [4,] FALSE FALSE FALSE . FALSE FALSE
## [5,] FALSE FALSE FALSE . FALSE FALSE
## ... . . . . . .
## [996,] FALSE FALSE FALSE . FALSE FALSE
## [997,] FALSE FALSE FALSE . FALSE FALSE
## [998,] FALSE FALSE FALSE . FALSE FALSE
## [999,] FALSE FALSE FALSE . FALSE FALSE
## [1000,] FALSE FALSE FALSE . FALSE FALSE
As are matrices with dimension names:
rownames(X) <- sprintf("GENE_%i", seq_len(nrow(X)))
colnames(X) <- sprintf("SAMP_%i", seq_len(ncol(X)))
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 1.63549257 0.46533113 -0.64362260 . 0.3182148 -0.3746532
## GENE_2 -0.68873934 -1.65071111 -1.43557121 . -0.5230861 0.4910293
## GENE_3 1.14204030 -0.44399339 0.29166305 . 0.6314542 -0.8057484
## GENE_4 1.26460430 -1.36323948 1.72579495 . -0.7376756 0.4030362
## GENE_5 0.05788121 0.36403968 0.99787726 . 0.4713108 0.1343959
## ... . . . . . .
## GENE_96 -0.702913394 -0.390246848 0.883118963 . -0.2340585 -1.2547414
## GENE_97 -1.036727316 -0.176501976 -0.742875173 . -0.5115610 -0.4422906
## GENE_98 0.086703553 -1.983729259 -1.877994787 . -0.3552945 -0.1713032
## GENE_99 -0.419210661 -0.719555109 1.200409705 . -0.2065206 0.6432031
## GENE_100 0.878642020 -0.713506377 0.007804023 . 0.0868919 2.6659455
TileDBArray
sTileDBArray
s are simply DelayedArray
objects and can be manipulated as such.
The usual conventions for extracting data from matrix-like objects work as expected:
out <- as(X, "TileDBArray")
dim(out)
## [1] 100 10
head(rownames(out))
## [1] "GENE_1" "GENE_2" "GENE_3" "GENE_4" "GENE_5" "GENE_6"
head(out[,1])
## GENE_1 GENE_2 GENE_3 GENE_4 GENE_5 GENE_6
## 1.63549257 -0.68873934 1.14204030 1.26460430 0.05788121 0.39245721
We can also perform manipulations like subsetting and arithmetic.
Note that these operations do not affect the data in the TileDB backend;
rather, they are delayed until the values are explicitly required,
hence the creation of the DelayedMatrix
object.
out[1:5,1:5]
## <5 x 5> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5
## GENE_1 1.63549257 0.46533113 -0.64362260 -1.39333410 -1.30100081
## GENE_2 -0.68873934 -1.65071111 -1.43557121 -0.85907667 1.32968018
## GENE_3 1.14204030 -0.44399339 0.29166305 0.35838801 -0.88944242
## GENE_4 1.26460430 -1.36323948 1.72579495 1.07793464 -0.53582947
## GENE_5 0.05788121 0.36403968 0.99787726 0.33434179 0.40851717
out * 2
## <100 x 10> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 3.2709851 0.9306623 -1.2872452 . 0.6364296 -0.7493064
## GENE_2 -1.3774787 -3.3014222 -2.8711424 . -1.0461721 0.9820586
## GENE_3 2.2840806 -0.8879868 0.5833261 . 1.2629085 -1.6114968
## GENE_4 2.5292086 -2.7264790 3.4515899 . -1.4753513 0.8060724
## GENE_5 0.1157624 0.7280794 1.9957545 . 0.9426216 0.2687918
## ... . . . . . .
## GENE_96 -1.40582679 -0.78049370 1.76623793 . -0.4681170 -2.5094828
## GENE_97 -2.07345463 -0.35300395 -1.48575035 . -1.0231219 -0.8845812
## GENE_98 0.17340711 -3.96745852 -3.75598957 . -0.7105890 -0.3426064
## GENE_99 -0.83842132 -1.43911022 2.40081941 . -0.4130413 1.2864061
## GENE_100 1.75728404 -1.42701275 0.01560805 . 0.1737838 5.3318910
We can also do more complex matrix operations that are supported by DelayedArray:
colSums(out)
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5 SAMP_6 SAMP_7
## -8.512292 -10.569756 3.879029 -1.539773 -9.382439 13.111470 12.682793
## SAMP_8 SAMP_9 SAMP_10
## -1.327717 -1.146122 9.818975
out %*% runif(ncol(out))
## <100 x 1> matrix of class DelayedMatrix and type "double":
## y
## GENE_1 -3.08589177
## GENE_2 0.45263269
## GENE_3 -0.06259064
## GENE_4 1.00299188
## GENE_5 2.36269477
## ... .
## GENE_96 -0.9005265
## GENE_97 -2.9140180
## GENE_98 -2.1437842
## GENE_99 2.3402809
## GENE_100 4.5998286
We can adjust some parameters for creating the backend with appropriate arguments to writeTileDBArray()
.
For example, the example below allows us to control the path to the backend
as well as the name of the attribute containing the data.
X <- matrix(rnorm(1000), ncol=10)
path <- tempfile()
writeTileDBArray(X, path=path, attr="WHEE")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] -0.40852544 0.13130618 0.44756166 . 0.20834613 0.02483394
## [2,] 0.25715519 0.28722914 0.92322308 . -1.01607874 0.85291550
## [3,] -1.01189611 0.92951281 0.08694537 . -0.30865970 -1.75405486
## [4,] 1.14735806 -0.87960176 -0.35593901 . -1.90836841 -0.84561608
## [5,] -0.48078192 0.52409937 0.16005338 . 0.80144236 1.48227694
## ... . . . . . .
## [96,] -0.27072319 0.01640110 -0.32287880 . -1.45454530 0.78641165
## [97,] 2.46154122 0.15107511 0.03311423 . 0.04732815 -0.56739924
## [98,] -0.30806969 0.91446168 -0.42627074 . -0.88207217 0.85783061
## [99,] 0.63632936 -0.37757959 0.79860133 . 0.45079283 0.77615910
## [100,] -1.01750245 -0.24932846 -0.90623449 . 0.74356716 -0.08969766
As these arguments cannot be passed during coercion, we instead provide global variables that can be set or unset to affect the outcome.
path2 <- tempfile()
setTileDBPath(path2)
as(X, "TileDBArray") # uses path2 to store the backend.
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] -0.40852544 0.13130618 0.44756166 . 0.20834613 0.02483394
## [2,] 0.25715519 0.28722914 0.92322308 . -1.01607874 0.85291550
## [3,] -1.01189611 0.92951281 0.08694537 . -0.30865970 -1.75405486
## [4,] 1.14735806 -0.87960176 -0.35593901 . -1.90836841 -0.84561608
## [5,] -0.48078192 0.52409937 0.16005338 . 0.80144236 1.48227694
## ... . . . . . .
## [96,] -0.27072319 0.01640110 -0.32287880 . -1.45454530 0.78641165
## [97,] 2.46154122 0.15107511 0.03311423 . 0.04732815 -0.56739924
## [98,] -0.30806969 0.91446168 -0.42627074 . -0.88207217 0.85783061
## [99,] 0.63632936 -0.37757959 0.79860133 . 0.45079283 0.77615910
## [100,] -1.01750245 -0.24932846 -0.90623449 . 0.74356716 -0.08969766
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2022 x64 (build 20348)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RcppSpdlog_0.0.12 TileDBArray_1.10.0 DelayedArray_0.26.0
## [4] IRanges_2.34.0 S4Vectors_0.38.0 MatrixGenerics_1.12.0
## [7] matrixStats_0.63.0 BiocGenerics_0.46.0 Matrix_1.5-4
## [10] BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 knitr_1.42 rlang_1.1.0
## [4] xfun_0.39 data.table_1.14.8 jsonlite_1.8.4
## [7] zoo_1.8-12 bit_4.0.5 htmltools_0.5.5
## [10] nanotime_0.3.7 sass_0.4.5 rmarkdown_2.21
## [13] grid_4.3.0 evaluate_0.20 jquerylib_0.1.4
## [16] fastmap_1.1.1 yaml_2.3.7 bookdown_0.33
## [19] BiocManager_1.30.20 compiler_4.3.0 Rcpp_1.0.10
## [22] RcppCCTZ_0.2.12 lattice_0.21-8 digest_0.6.31
## [25] R6_2.5.1 tiledb_0.19.0 bslib_0.4.2
## [28] bit64_4.0.5 tools_4.3.0 spdl_0.0.4
## [31] cachem_1.0.7