stream-stats {matter} | R Documentation |
These functions allow calculation of streaming statistics. They are useful, for example, for calculating summary statistics on small chunks of a larger dataset, and then combining them to calculate the summary statistics for the whole dataset.
This is not particularly interesting for simpler, commutative statistics like sum()
, but it is useful for calculating non-commutative statistics like running sd()
or var()
on pieces of a larger dataset.
# calculate streaming univariate statistics s_range(x, ..., na.rm = FALSE) s_min(x, ..., na.rm = FALSE) s_max(x, ..., na.rm = FALSE) s_prod(x, ..., na.rm = FALSE) s_sum(x, ..., na.rm = FALSE) s_mean(x, ..., na.rm = FALSE) s_var(x, ..., na.rm = FALSE) s_sd(x, ..., na.rm = FALSE) s_any(x, ..., na.rm = FALSE) s_all(x, ..., na.rm = FALSE) s_nnzero(x, ..., na.rm = FALSE) # calculate streaming matrix statistics colstreamStats(x, stat, na.rm = FALSE, ...) rowstreamStats(x, stat, na.rm = FALSE, ...) # calculate combined summary statistics stat_c(x, y, ...)
x, y, ... |
Object(s) on which to calculate a summary statistic, or a summary statistic to combine. |
stat |
The name of a summary statistic to compute over the rows or columns of a matrix. Allowable values include: "range", "min", "max", "prod", "sum", "mean", "var", "sd", "any", "all", and "nnzero". |
na.rm |
If |
These summary statistics methods are intended to be applied to chunks of a larger dataset. They can then be combined either with the individual summary statistic functions, or with stat_c()
, to produce the combined summary statistic for the full dataset. This is most useful for calculating running variances and standard deviations iteratively, which would be difficult or impossible to calculate on the full dataset.
The variances and standard deviations are calculated using running sum of squares formulas which can be calculated iteratively and are accurate for large floating-point datasets (see reference).
For all univariate functions except s_range()
, a single number giving the summary statistic. For s_range()
, two numbers giving the minimum and the maximum values.
For colstreamStats()
and rowstreamStats()
, a vector of summary statistics.
Kylie A. Bemis
B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 1-3, Aug. 1962.
B. O'Neill, “Some Useful Moment Results in Sampling Problems,” The American Statistician, vol. 68, no. 4, pp. 282-296, Sep. 2014.
set.seed(1) x <- sample(1:100, size=10) y <- sample(1:100, size=10) sx <- s_var(x) sy <- s_var(y) var(c(x, y)) stat_c(sx, sy) # should be the same sxy <- stat_c(sx, sy) # calculate with 1 new observation var(c(x, y, 99)) stat_c(sxy, 99) # calculate over rows of a matrix set.seed(2) A <- matrix(rnorm(100), nrow=10) B <- matrix(rnorm(100), nrow=10) sx <- rowstreamStats(A, "var") sy <- rowstreamStats(B, "var") apply(cbind(A, B), 1, var) stat_c(sx, sy) # should be the same