Contents

1 Overview

DelayedMatrixStats ports the matrixStats API to work with DelayedMatrix objects from the DelayedArray package. It provides high-performing functions operating on rows and columns of DelayedMatrix objects, including all subclasses such as RleArray (from the DelayedArray package) and HDF5Array (from the HDF5Array) as well as supporting all types of seeds, such as matrix (from the base package) and Matrix (from the Matrix package).

2 How can DelayedMatrixStats help me?

The DelayedArray package allows developers to store array-like data using in-memory or on-disk representations (e.g., in HDF5 files) and provides a common and familiar array-like interface for interacting with these data.

The DelayedMatrixStats package is designed to make life easier for Bioconductor developers wanting to use DelayedArray by providing a rich set of column-wise and row-wise summary functions.

We briefly demonstrate and explain these two features using a simple example. We’ll simulate some (unrealistic) RNA-seq read counts data from 10,000 genes and 20 samples and store it on disk as a HDF5Array:

library(DelayedArray)

x <- do.call(cbind, lapply(1:20, function(j) {
  rpois(n = 10000, lambda = sample(20:40, 10000, replace = TRUE))
}))
colnames(x) <- paste0("S", 1:20)
x <- realize(x, "HDF5Array")
x
#> DelayedMatrix object of 10000 x 20 integers:
#>           S1  S2  S3  S4 ... S17 S18 S19 S20
#>     [1,]  29  22  31  29   .  43  27  29  28
#>     [2,]  27  33  22  25   .  25  32  27  36
#>     [3,]  42  26  31  35   .  40  20  35  37
#>     [4,]  27  41  35  32   .  42  26  19  42
#>     [5,]  25  29  26  30   .  36  36  34  32
#>      ...   .   .   .   .   .   .   .   .   .
#>  [9996,]  26  46  21  29   .  25  28  56  16
#>  [9997,]  30  21  29  33   .  32  33  36  28
#>  [9998,]  48  34  27  15   .  28  33  38  48
#>  [9999,]  35  27  26  18   .  25  18  25  45
#> [10000,]  35  40  22  40   .  20  25  25  18

Suppose you wish to compute the standard deviation of the read counts for each gene.

You might think to use apply() like in the following:

system.time(row_sds <- apply(x, 1, sd))
#>    user  system elapsed 
#>  121.67    8.18  129.85
head(row_sds)
#> [1] 5.114222 7.055233 7.835647 6.681475 5.153180 8.690436

This works, but takes quite a while.

Or perhaps you already know that the matrixStats package provides a rowSds() function:

matrixStats::rowSds(x)
#> Error in rowVars(x, rows = rows, cols = cols, na.rm = na.rm, center = center, : Argument 'x' must be a matrix or a vector.

Unfortunately (and perhaps unsurprisingly) this doesn’t work. matrixStats is designed for use on in-memory matrix objects. Well, why don’t we just first realize our data in-memory and then use matrixStats

system.time(row_sds <- matrixStats::rowSds(as.matrix(x)))
#>    user  system elapsed 
#>    0.02    0.00    0.01
head(row_sds)
#> [1] 5.114222 7.055233 7.835647 6.681475 5.153180 8.690436

This works and is many times faster than the apply()-based approach! However, it rather defeats the purpose of using a HDF5Array for storing the data since we have to bring all the data into memory at once to compute the result.

Instead, we can use DelayedMatrixStats::rowSds(), which has the speed benefits of matrixStats::rowSds()1 but without having to load the entire data into memory at once2:

library(DelayedMatrixStats)

system.time(row_sds <- rowSds(x))
#>    user  system elapsed 
#>    0.15    0.00    0.16
head(row_sds)
#> [1] 5.114222 7.055233 7.835647 6.681475 5.153180 8.690436

Finally, by using DelayedMatrixStats we can use the same code, (colMedians(x)) regardless of whether the input is an ordinary matrix or a DelayedMatrix. This is useful for packages wishing to support both types of objects, e.g., packages wanting to retain backward compatibility or during a transition period from matrix-based to DelayeMatrix-based objects.

3 Supported methods

The initial release of DelayedMatrixStats supports the complete set of column-wise and row-wise matrixStats API3. Please see the matrixStats vignette (available online) for a summary these methods. The following table documents the API coverage and availability of ‘seed-aware’ methods in the current version of DelayedMatrixStats.

Method Block processing base::matrix optimized Matrix::Matrix optimized DelayedArray::RleArray (SolidRleArraySeed) optimized DelayedArray::RleArray (ChunkedRleArraySeed) optimized HDF5Array::HDF5Matrix optimized base::data.frame optimized S4Vectors::DataFrame optimized
colAlls() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colAnyMissings() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colAnyNAs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colAnys() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colAvgsPerRowSet() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colCollapse() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colCounts() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colCummaxs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colCummins() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colCumprods() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colCumsums() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colIQRDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colIQRs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colLogSumExps() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colMadDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colMads() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colMaxs() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colMeans2() <U+2714> <U+2714> <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C>
colMedians() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colMins() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colOrderStats() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colProds() <U+2714> <U+2714> <U+274C> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C>
colQuantiles() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colRanges() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colRanks() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colSdDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colSds() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colSums2() <U+2714> <U+2714> <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C>
colTabulates() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colVarDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colVars() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colWeightedMads() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colWeightedMeans() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colWeightedMedians() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colWeightedSds() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
colWeightedVars() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowAlls() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowAnyMissings() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowAnyNAs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowAnys() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowAvgsPerColSet() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowCollapse() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowCounts() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowCummaxs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowCummins() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowCumprods() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowCumsums() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowIQRDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowIQRs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowLogSumExps() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowMadDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowMads() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowMaxs() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowMeans2() <U+2714> <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowMedians() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowMins() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowOrderStats() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowProds() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowQuantiles() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowRanges() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowRanks() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowSdDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowSds() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowSums2() <U+2714> <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowTabulates() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowVarDiffs() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowVars() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowWeightedMads() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowWeightedMeans() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowWeightedMedians() <U+2714> <U+2714> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowWeightedSds() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>
rowWeightedVars() <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C> <U+274C>

4 ‘Seed-aware’ methods

As well as offering a familiar API, DelayedMatrixStats provides ‘seed-aware’ methods that are optimized for specific types of DelayedMatrix objects.

To illustrate this idea, we will compare two ways of computing the column sums of a DelayedMatrix object:

  1. The ‘block-processing’ strategy. This was developed in the DelayedArray package and is available for all methods in the DelayedMatrixStats through the force_block_processing argument
  2. The ‘seed-aware’ strategy. This is implemented in the DelayedMatrixStats and is optimized for both speed and memory but only for DelayedMatrix objects with certain types of seed.

We will demonstrate this by computing the column sums matrices with 20,000 rows and 600 columns where the data have different structure and are stored in DelayedMatrix objects with different types of seed:

We use the microbenchmark package to measure running time and the profmem package to measure the total memory allocations of each method.

In each case, the ‘seed-aware’ method is many times faster and allocates substantially lower total memory.

library(DelayedMatrixStats)
library(Matrix)
library(microbenchmark)
library(profmem)

set.seed(666)

# -----------------------------------------------------------------------------
# Dense with values in (0, 1)
# Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed
#

# Generate some data
dense_matrix <- matrix(runif(20000 * 600), 
                       nrow = 20000,
                       ncol = 600)

# Benchmark
dm_matrix <- DelayedArray(dense_matrix)
class(seed(dm_matrix))
#> [1] "matrix"
dm_matrix
#> DelayedMatrix object of 20000 x 600 doubles:
#>                [,1]       [,2]       [,3] ...     [,599]     [,600]
#>     [1,]  0.7743685  0.6601787  0.4098798   . 0.89118118 0.05776471
#>     [2,]  0.1972242  0.8436035  0.9198450   . 0.31799523 0.63099417
#>     [3,]  0.9780138  0.2017589  0.4696158   . 0.31783791 0.02830454
#>     [4,]  0.2013274  0.8797239  0.6474768   . 0.55217184 0.09678816
#>     [5,]  0.3612444  0.8158778  0.5928599   . 0.08530977 0.39224147
#>      ...          .          .          .   .          .          .
#> [19996,] 0.19490291 0.07763570 0.56391725   . 0.09703424 0.62659353
#> [19997,] 0.61182993 0.01910121 0.04046034   . 0.59708388 0.88389731
#> [19998,] 0.12932744 0.21155070 0.19344085   . 0.51682032 0.13378223
#> [19999,] 0.18985573 0.41716539 0.35110782   . 0.62939661 0.94601427
#> [20000,] 0.87889047 0.25308041 0.54666920   . 0.81630322 0.73272217
microbenchmark(
  block_processing = colSums2(dm_matrix, force_block_processing = TRUE),
  seed_aware = colSums2(dm_matrix),
  times = 10)
#> Unit: milliseconds
#>              expr       min       lq     mean    median        uq       max
#>  block_processing 437.74471 465.8673 496.4901 483.59275 485.03869 719.37619
#>        seed_aware  18.18125  18.7199  21.3551  19.91232  25.40625  26.50277
#>  neval cld
#>     10   b
#>     10  a
total(profmem(colSums2(dm_matrix, force_block_processing = TRUE)))
#> [1] 389679904
total(profmem(colSums2(dm_matrix)))
#> [1] 171312

# -----------------------------------------------------------------------------
# Sparse (60% zero) with values in (0, 1)
# Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed
#

# Generate some data
sparse_matrix <- dense_matrix
zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix))
sparse_matrix[zero_idx] <- 0

# Benchmark
dm_dgCMatrix <- DelayedArray(Matrix(sparse_matrix, sparse = TRUE))
class(seed(dm_dgCMatrix))
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dm_dgCMatrix
#> DelayedMatrix object of 20000 x 600 doubles:
#>               [,1]      [,2]      [,3] ...     [,599]     [,600]
#>     [1,] 0.7743685 0.0000000 0.4098798   .  0.8911812  0.0000000
#>     [2,] 0.0000000 0.0000000 0.9198450   .  0.3179952  0.6309942
#>     [3,] 0.9780138 0.0000000 0.4696158   .  0.0000000  0.0000000
#>     [4,] 0.0000000 0.8797239 0.0000000   .  0.0000000  0.0000000
#>     [5,] 0.0000000 0.0000000 0.5928599   .  0.0000000  0.3922415
#>      ...         .         .         .   .          .          .
#> [19996,] 0.1949029 0.0000000 0.5639173   . 0.09703424 0.62659353
#> [19997,] 0.6118299 0.0000000 0.0000000   . 0.00000000 0.88389731
#> [19998,] 0.0000000 0.0000000 0.1934408   . 0.51682032 0.00000000
#> [19999,] 0.0000000 0.0000000 0.0000000   . 0.62939661 0.94601427
#> [20000,] 0.8788905 0.0000000 0.0000000   . 0.81630322 0.00000000
microbenchmark(
  block_processing = colSums2(dm_dgCMatrix, force_block_processing = TRUE),
  seed_aware = colSums2(dm_dgCMatrix),
  times = 10)
#> Unit: milliseconds
#>              expr       min        lq      mean    median        uq
#>  block_processing 714.21011 716.46637 781.44165 758.72519 777.69690
#>        seed_aware  26.45579  26.66552  29.41674  27.79558  30.18533
#>         max neval cld
#>  1046.79463    10   b
#>    42.17232    10  a
total(profmem(colSums2(dm_dgCMatrix, force_block_processing = TRUE)))
#> [1] 445519080
total(profmem(colSums2(dm_dgCMatrix)))
#> [1] 14320

# -----------------------------------------------------------------------------
# Dense with values in {0, 100} featuring runs of identical values
# Fast, memory-efficient column sums of DelayedMatrix with Rle-based seed
#

# Generate some data
runs <- rep(sample(100, 500000, replace = TRUE), rpois(500000, 100))
runs <- runs[seq_len(20000 * 600)]
runs_matrix <- matrix(runs, 
                      nrow = 20000,
                      ncol = 600)

# Benchmark
dm_rle <- RleArray(Rle(runs),
                   dim = c(20000, 600))
class(seed(dm_rle))
#> [1] "SolidRleArraySeed"
#> attr(,"package")
#> [1] "DelayedArray"
dm_rle
#> RleMatrix object of 20000 x 600 integers:
#>            [,1]   [,2]   [,3]   [,4] ... [,597] [,598] [,599] [,600]
#>     [1,]     72     75     47     89   .     46     45     91     99
#>     [2,]     72     75     47     89   .     46     45     91     99
#>     [3,]     72     75     47     89   .     46     45     91     99
#>     [4,]     72     75     47     89   .     46     45     91     99
#>     [5,]     72     75     47     89   .     46     45     91     99
#>      ...      .      .      .      .   .      .      .      .      .
#> [19996,]     75     47     89     86   .     45     60     99     50
#> [19997,]     75     47     89     86   .     45     60     99     50
#> [19998,]     75     47     89     86   .     45     60     99     50
#> [19999,]     75     47     89     86   .     45     60     99     50
#> [20000,]     75     47     89     86   .     45     91     99     50
microbenchmark(
  block_processing = colSums2(dm_rle, force_block_processing = TRUE),
  seed_aware = colSums2(dm_rle),
  times = 10)
#> Unit: milliseconds
#>              expr       min         lq      mean   median          uq
#>  block_processing 956.72433 969.562647 986.59291 983.4776 1001.330607
#>        seed_aware   5.83331   6.044327  18.84198   6.5330    8.107515
#>        max neval cld
#>  1025.7177    10   b
#>   117.0049    10  a
total(profmem(colSums2(dm_rle, force_block_processing = TRUE)))
#> [1] 436698568
total(profmem(colSums2(dm_rle)))
#> [1] 47296

The development of ‘seed-aware’ methods is ongoing work (see the Roadmap), and for now only a few methods and seed-types have a ‘seed-aware’ method.

An extensive set of benchmarks is under development at http://peterhickey.org/BenchmarkingDelayedMatrixStats/.

5 Delayed operations

A key feature of a DelayedArray is the ability to register ‘delayed operations’. For example, let’s compute sin(dm_matrix):

system.time(sin_dm_matrix <- sin(dm_matrix))
#>    user  system elapsed 
#>       0       0       0

This instantaneous because the operation is not actually performed, rather it is registered and only performed when the object is realized. All methods in DelayedMatrixStats will correctly realise these delayed operations before computing the final result. For example, let’s compute
colSums2(sin_dm_matrix) and compare check we get the correct answer:

all.equal(colSums2(sin_dm_matrix), colSums(sin(as.matrix(dm_matrix))))
#> [1] TRUE

6 Roadmap

The initial version of DelayedMatrixStats provides complete coverage of the matrixStats column-wise and row-wise API4, allowing package developers to use these functions with DelayedMatrix objects as well as with ordinary matrix objects. This should simplify package development and assist authors to support to their software for large datasets stored in disk-backed data structures such as HDF5Array. Such large datasets are increasingly common with the rise of single-cell genomics.

Future releases of DelayedMatrixStats will improve the performance of these methods, specifically by developing additional ‘seed-aware’ methods. The plan is to prioritise commonly used methods (e.g.,
colMeans2()/rowMeans2(), colSums2()/rowSums2(), etc.) and the development of ‘seed-aware’ methods for the HDF5Matrix class. To do so, we will leverage the beachmat package. Proof-of-concept code has shown that this can greatly increase the performance when analysing such disk-backed data.

Importantly, all package developers using methods from DelayedMatrixStats will immediately gain from performance improvements to these low-level routines. By using DelayedMatrixStats, package developers will be able to focus on higher level programming tasks and address important scientific questions and technological challenges in high-throughput biology.