The 2-Wasserstein distance W is a metric to describe the distance between two distributions, representing two different conditions \(A\) and \(B\).
For continuous distributions, it is given by
\[W := W(F_A, F_B) = \bigg( \int_0^1 \big|F_A^{-1}(u) - F_B^{-1}(u) \big|^2 du \bigg)^\frac{1}{2},\]
where \(F_A\) and \(F_B\) are the corresponding cumulative distribution functions (CDFs) and \(F_A^{-1}\) and \(F_B^{-1}\) the respective quantile functions.
We specifically consider the squared 2-Wasserstein distance \(d := W^2\) which offers the following decomposition into location, size, and shape terms: \[d := d(F_A, F_B) = \int_0^1 \big|F^{-1}(u) - F^{-1}(u) \big|^2 du = \underbrace{\big(\mu_A - \mu_B\big)^2}_{\text{location}} + \underbrace{\big(\sigma_A - \sigma_B\big)^2}_{\text{size}} + \underbrace{2\sigma_A \sigma_B \big(1 - \rho^{A,B}\big)}_{\text{shape}},\]
where \(\mu_A\) and \(\mu_B\) are the respective means, \(\sigma_A\) and\(\sigma_B\) are the respective standard deviations, and \(\rho^{A,B}\) is the Pearson correlation of the points in the quantile-quantile plot of \(F_A\) and \(F_B\).
In case the distributions \(F_A\) and \(F_B\) are not explicitly given and information is only availbale in the form of samples from \(F_A\) and \(F_B\), respectively, we use the corresponding empirical CDFs \(\hat{F}_A\) and \(\hat{F}_B\). Then, the 2-Wasserstein distance is computed by
\[d(\hat{F}_A, \hat{F}_B) \approx \frac{1}{K} \sum_{k=1}^K \big(Q_A^{\alpha_k} - Q_B^{\alpha_k} \big) \approx \big(\hat{\mu}_A - \hat{\mu}_B\big)^2 + \big(\hat{\sigma}_A - \hat{\sigma}_B\big)^2 + 2\hat{\sigma}_A \hat{\sigma}_B \big(1 - \hat{\rho}^{A,B}\big).\]
Here, \(Q_A\) and \(Q_B\) denote equidistant quantiles of \(F_A\) and \(F_B\), respectively, at the levels \(\alpha_k := \frac{k-0.5}{K}, k = 1, \dots , K\) using \(K=1000\) in our implementation. Moreover, \(\hat{\mu}_A, \hat{\mu}_B, \hat{\sigma}_A, \hat{\sigma}_B,\) and \(\hat{\rho}_{A,B}\) denote the empirical versions of the corresponding quantiles from the original decomposition of \(d\).
The package waddR
offers three functions to compute the 2-Wasserstein distance in two-sample settings.
We will use samples from normal distributions to illustrate them.
The first function, wasserstein_metric
offers a faster reimplementation in Cpp of the function wasserstein1d
from the R package transport
, which computes the original 2-Wasserstein distance \(W\).
The corresponding value of the squared 2-Wasserstein distance \(d\) is then:
The second function, squared_wass_approx
, computes the squared 2-Wasserstein distance by calculating the mean squared difference of the equidistant quantiles (first approximation in the previous formula). This function is currently used to compute the 2-Wasserstein distance in the testing procedures.
The third function, squared_wass_decomp
, approximates the squared 2-Wasserstein distance by addding the location, size, and shape terms from the above decomposition (second apporximation in the previous formula). It also returns the respective decomposition values.
squared_wass_decomp(x, y)
#> $distance
#> [1] 4.180458
#>
#> $location
#> [1] 4.114983
#>
#> $size
#> [1] 0.002307
#>
#> $shape
#> [1] 0.06316766
The decomposition results reflect that in the considered example, the two distributions differ with respect to location (mean), but not in terms of size and shape, thus confirming the underlying normal model.
The waddR
package
Two-sample test based on a decomposition of the Wasserstein distance between two distributions to check for differences
Detect differential gene expression distributions in scRNAseq data
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] waddR_1.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.16.1 tidyselect_1.0.0
#> [3] xfun_0.12 purrr_0.3.3
#> [5] splines_3.6.3 lattice_0.20-40
#> [7] vctrs_0.2.4 htmltools_0.4.0
#> [9] stats4_3.6.3 BiocFileCache_1.10.2
#> [11] yaml_2.2.1 blob_1.2.1
#> [13] rlang_0.4.5 nloptr_1.2.2.1
#> [15] pillar_1.4.3 glue_1.3.2
#> [17] DBI_1.1.0 BiocParallel_1.20.1
#> [19] rappdirs_0.3.1 SingleCellExperiment_1.8.0
#> [21] BiocGenerics_0.32.0 bit64_0.9-7
#> [23] dbplyr_1.4.2 matrixStats_0.56.0
#> [25] GenomeInfoDbData_1.2.2 stringr_1.4.0
#> [27] zlibbioc_1.32.0 coda_0.19-3
#> [29] memoise_1.1.0 evaluate_0.14
#> [31] Biobase_2.46.0 knitr_1.28
#> [33] IRanges_2.20.2 GenomeInfoDb_1.22.0
#> [35] parallel_3.6.3 curl_4.3
#> [37] Rcpp_1.0.4 arm_1.10-1
#> [39] DelayedArray_0.12.2 S4Vectors_0.24.3
#> [41] XVector_0.26.0 abind_1.4-5
#> [43] bit_1.1-15.2 lme4_1.1-21
#> [45] digest_0.6.25 stringi_1.4.6
#> [47] dplyr_0.8.5 GenomicRanges_1.38.0
#> [49] grid_3.6.3 tools_3.6.3
#> [51] bitops_1.0-6 magrittr_1.5
#> [53] RCurl_1.98-1.1 tibble_2.1.3
#> [55] RSQLite_2.2.0 crayon_1.3.4
#> [57] pkgconfig_2.0.3 MASS_7.3-51.5
#> [59] Matrix_1.2-18 minqa_1.2.4
#> [61] assertthat_0.2.1 rmarkdown_2.1
#> [63] httr_1.4.1 boot_1.3-24
#> [65] R6_2.4.1 nlme_3.1-145
#> [67] compiler_3.6.3