1 Introduction to ALDEx2

This guide provides an overview of the R package ALDEx version 2 (ALDEx2) for differential (relative) abundance analysis of high throughput sequencing count-compositional data1 all high throughput sequencing data are compositional (Gloor et al. 2017) because of constraints imposed by the instruments. The package was developed and used initially for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-RNA-Seq)2 Macklaim et al. (2013), but testing showed that it performed very well with traditional RNA-Seq datasets3 Quinn, Crowley, and Richardson (2018), 16S rRNA gene variable region sequencing4 Bian et al. (n.d.) and selective growth-type (SELEX) experiments5 McMurrough et al. (2014);Wolfs et al. (2016). In principle, the analysis method should be applicable to nearly any type of data that is generated by high-throughput sequencing that generates tables of per-feature counts for each sample(Fernandes et al. 2014): in addition to the examples outlined above, this would include ChIP-Seq or metagenome sequencing.

1.1 What ALDEx2 does differently

The ALDEx2 package estimates per-feature technical variation within each sample using Monte-Carlo instances drawn from the Dirichlet distribution. Sampling from this distribution returns the posterior probability distribution of the observed data under a repeated sampling model. All outputs from ALDEx2 are outputs from the posterior distributions, either expected values or confidence intervals.

ALDEx2 uses the centred log-ratio (clr) transformation (or closely related log-ratio transforms) which ensures the data are scale invariant and sub-compositionally coherent6 Aitchison (1986). The scale invariance property removes the need for a between sample data normalization step since the data are all placed on a consistent numerical co-ordinate. The sub-compositional coherence property ensures that the answers obtained are consistent when parts of the dataset are removed (e.g., removal of rRNA reads from RNA-seq studies or rare OTU species from 16S rRNA gene amplicon studies). All feature abundance values are expressed relative to the geometric mean abundance of other features in a sample.This is conceptually similar to a quantitative PCR where abundances are expressed relative to a standard: in the case of the clr transformation, the standard is the per-sample geometric mean abundance. See Aitchison (1986) for a complete description.

In contrast, most commonly used tools to analyze differential abundance of high throughput sequencing (HTS) data make the assumption that throughput sequencing data are delivered as counts7 Anders et al. (2013);Anders and Huber (2010);Gierliński et al. (2015). Much work has been done to normalize the datasets such that they approximate this assumption8 Bullard et al. (2010);Dillies et al. (2013). Even so, that the data are counts can be simply disproven by running the same library on instruments with different capacities and attempting to normalize. The relative relationships between the features (genes, OTUs, functions) are preserved but the actual count values are not.

With this simple realization, a number of groups have realized that HTS datasets are actually count-compositions9 Lovell et al. (2011);Friedman and Alm (2012);Fernandes et al. (2013);Fernandes et al. (2014);Gloor et al. (2017);Thomas P. Quinn et al. (2017) which have dramatically different statistical properties than do counts10 Aitchison (1986);Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015);Pawlowsky-Glahn and Buccianti (2011). Thus, ALDEx2 is an R package for differential relative abundance that takes into account the count-compositional nature of these data.

2 Installation

There are two ways to download and install the most current of ALDEx2. The most recent version of the package will be found at github.com/ggloor/ALDEx_bioc. The package will run with only the base R packages and a minimal Bioconductor install, and is capable of running several functions with the `parallel’ package if installed. It has been tested with version R version 3, but should run on version 2.12 onward as long as dependencies are met. It is recommended that the package be run on the most up-to-date R and Bioconductor versions. ALDEx2 will make use of the BiocParallel package if possible, otherwise, ALDEx2 will run in serial mode.

  • Install development version from github:
install.packages("devtools")
devtools::install_github("ggloor/ALDEx_bioc")
  • Install stable version from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ALDEx2")

3 Quick Start: aldex with 2 groups:

The ALDEx2 package in Bioconductor is modular and is suitable for the comparison of many different experimental designs. This is achieved by exposing the underlying centre log-ratio transformed Dirichlet Monte-Carlo replicate values to make it possible for anyone to add the specific R code for their experimental design — a guide to these values is outlined below.

However, ALDEx2 contains an aldex wrapper function that can perform many simple analyses. This wrapper will link the modular elements together to emulate ALDEx2 prior to the modular approach. In the simplest incarnation, which we will use below, aldex does a two-sample t-test and calculates effect sizes. If the test is ‘t’, then effect should be set to TRUE. The ‘t’ option evaluates the data as a two-factor experiment using both the Welch’s t and the Wilcoxon rank tests. In more complex incarnations for multi-sample tests using ANOVA modules the effect size should not be calculated and then the effect parameter should be FALSE. The ‘kw’ option evaluates the data as a one-way ANOVA using the glm and Kruskal-Wallace tests. All tests include a Benjamini-Hochberg correction of the raw P values. The data can be plotted onto Bland-Altmann11 Altman and Bland (1983) (MA) or effect (MW) plots12 Gloor, Macklaim, and Fernandes (2016) for for two-way tests using the aldex.plot function. See the end of the modular section for examples of the plots.

3.1 Case study t-test of a growth selection type (SELEX) experiment.

This section contains an analysis of a dataset collected where a single gene library was made that contained 1600 sequence variants at 4 codons in the sequence13 McMurrough et al. (2014). These variants were cloned into an expression vector at equimolar amounts. The wild-type version of the gene conferred resistance to a topoisomerase toxin. Seven independent growths of the gene library were conducted under selective and non-selective conditions and the resulting abundances of each variant was read out by sequencing a pooled, barcoded library on an Illumina MiSeq. The data table is included as selex_table.txt in the package. In this data table, there are 1600 features and 14 samples. The analysis takes approximately 2 minutes and memory usage tops out at less than 1Gb of RAM on a mobile i7 class processor when we use 128 Dirichlet Monte-Carlo Instances (DMC). For speed concerns we use only the first 400 features and perform only 16 DMC. The command used for ALDEx2 is presented below:

First we load the library and the included selex dataset. Then we set the comparison groups. This must be a vector of conditions in the same order as the samples in the input counts table. The aldex command is calling several other functions in the background, and each of them returns diagnostics. These diagnostics are suppressed for this vignette.

library(ALDEx2)
data(selex)
#subset only the last 400 features for efficiency
selex.sub <- selex[1:400,]

conds <- c(rep("NS", 7), rep("S", 7))
x.all <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE,
     include.sample.summary=FALSE, denom="all", verbose=FALSE, paired.test=FALSE)

par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch", xlab="Log-ratio abundance",
    ylab="Difference")
aldex.plot(x.all, type="MW", test="welch", xlab="Dispersion",
    ylab="Difference")
MA and Effect plots of ALDEx2 output. The left panel is an Bland-Altman or MA plot that shows the relationship between (relative) Abundance and Difference. The right panel is an effect plot that shows the relationship between Difference and Dispersion. In both plots features that are not significant are in grey or black. Features that are statistically significant are in red. The Log-ratio abundance axis is the clr value for the feature.

Figure 1: MA and Effect plots of ALDEx2 output
The left panel is an Bland-Altman or MA plot that shows the relationship between (relative) Abundance and Difference. The right panel is an effect plot that shows the relationship between Difference and Dispersion. In both plots features that are not significant are in grey or black. Features that are statistically significant are in red. The Log-ratio abundance axis is the clr value for the feature.

4 Using ALDEx2 modules

The modular approach exposes the underlying intermediate data so that users can generate their own tests. The simple approach outlined above just calls aldex.clr, aldex.ttest, aldex.effect in turn and then merges the data into one dataframe called x.all. We will show these modules in turn, and then examine additional modules.

4.1 The aldex.clr module

The workflow for the modular approach first generates random instances of the centred log-ratio transformed values. There are three inputs: counts table, a vector of conditions, and the number of Monte-Carlo (DMC) instances; and several parameters: a string indicating if iqlr, zero or all feature are used as the denominator is required, and level of verbosity (TRUE or FALSE). We recommend 128 or more mc.samples for the t-test, 1000 for a rigorous effect size calculation, and at least 16 for ANOVA.14 in fact we recommend that the number of samples in the smallest group multiplied by the number of DMC be equal at least 1000 in order to generate a reasonably stable estimate of the posterior distribution

This operation is fast.

# the output is in the S3 object 'x'
x <- aldex.clr(selex.sub, conds, mc.samples=16, denom="all", verbose=F)

4.2 The aldex.ttest module

The next operation performs the Welch’s t and Wilcoxon rank test for the situation when there are only two conditions. There are only two inputs: the aldex object from aldex.clr and whether a paired test should be conducted or not (TRUE or FALSE).

This operation is reasonably fast.

x.tt <- aldex.ttest(x, paired.test=FALSE, verbose=FALSE)

4.3 The aldex.kw and the aldex.glm modules

Alternatively to the t-test, the user can perform the glm and Kruskal Wallace tests for one-way ANOVA of two or more conditions. Here there are only two inputs: the aldex object from aldex.clr, and the vector of conditions. Note that this is slow! and is not evaluated for this documentation.

The aldex.glm module is the preferred method for testing of more than two conditons or for complex study designs. See the `Complex study designs’ section below.

x.kw <- aldex.kw(x)

4.4 The aldex.effect module

Finally, we estimate effect size and the within and between condition values in the case of two conditions. This step is required for plotting, in our lab we base our conclusions primarily on the output of this function15 Macklaim et al. (2013);McMurrough et al. (2014);Bian et al. (n.d.). There is one input: the aldex object from aldex.clr; and several useful parameters. It is possible to include the 95% confidence interval information for the effect size estimate with the boolean flag CI=TRUE. This can be helpful when deciding whether to include or exclude specific features from consideration. We find that a large effect but that is an outlier for the extremes of the effect distribution can be false positives. New is the option to calculate effect sizes for paired t-tests or repeated samples with the boolean paired.test=TRUE option. In this case the difference between and difference within values are calculated as the median paired difference and the median absolute deviation of that difference. The standardized effect size is their ratio and is usually slightly larger than the unpaired effect size. The supplied selex dataset is actually a paired dataset and can be used to explore this option.

x.effect <- aldex.effect(x, CI=T, verbose=FALSE, paired.test=FALSE)

4.5 The aldex.plot module

Finally, the t-test and effect data are merged into one object.

x.all <- data.frame(x.tt,x.effect)

And the data are plotted. We see that the plotted data in Figure 1 and 2 are essentially the same.

par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch")
aldex.plot(x.all, type="MW", test="welch")
Output from aldex.plot function. The left panel is the MA plot, the right is the MW (effect) plot. In both plots red represents features called as differentially abundant with q $<0.1$; grey are abundant, but not differentially abundant; black are rare, but not differentially abundant. This function uses the combined output from the aldex.ttest and aldex.effect functions

Figure 2: Output from aldex.plot function
The left panel is the MA plot, the right is the MW (effect) plot. In both plots red represents features called as differentially abundant with q \(<0.1\); grey are abundant, but not differentially abundant; black are rare, but not differentially abundant. This function uses the combined output from the aldex.ttest and aldex.effect functions

4.5.1 The effect confidence interval

As noted above, the ALDEx2 package generates a posterior distribution of the probability of observing the count given the data collected. Here we show the importance of this approach by examining the 95% CI of the effect size. Throughout, we use a standardized effect size, similar to the Cohen’s d metric, although our effect size is more robust and more conservative (being approximately 0.7 Cohen’s d when the data are Normally distributed)16 Fernandes Gloor unpublished.

Comparing the outputs of Figures 2 and 3 shows a very important point: there is a tremendous amount of latent variation in sequencing data We see in Figure 2 that there are a few features that have an expected q value that is statistically significantly different, which are both relatively rare and which have a relatively small difference. Even when identifying features based on the expected effect size can be misleading. We find that the safest approach is to identify those features that where the 95% CI of the effect size does not cross 0.

Examining the figures we see that the features that are the rarest are least likely to reproduce the effect size with simple random sampling. The behaviour of the 95% CI metric is exactly in line with our intuition: the precision of estimation of rare features is poor—if you want more confidence in rare features you must spend more money to sequence more deeply.

With this approach we are accepting the biological variation in the data as received17 i.e. we are not inferring any additional biological variation: i.e., the experimental design is always as given, but are identifying those features where simple random sampling of the library would be expected to give the same result every time. This is the approach that was used in (Macklaim et al. 2013), the results of which were independently validated and found to be very robust (Nelson et al. 2015).

Comparing the mean effect with the 95\% CI.  The left panel is MA plot, the right is the MW (effect) plot. In both plots red points indicate features that have an expected effect value of $>2$; those with a blue circle have the 95\% CI that does not overlap 0; and grey are features that are not of interest. This plot uses only the output of the aldex.effect function. The grey lines in the effect plot show the effect=2 isopleth.

Figure 3: Comparing the mean effect with the 95% CI
The left panel is MA plot, the right is the MW (effect) plot. In both plots red points indicate features that have an expected effect value of \(>2\); those with a blue circle have the 95% CI that does not overlap 0; and grey are features that are not of interest. This plot uses only the output of the aldex.effect function. The grey lines in the effect plot show the effect=2 isopleth.

4.6 Complex study designs and the aldex.glm module

The aldex.glm module is included so that the probabilistic compositional approach can be used for complex study designs. This module is substantially slower than the two-comparison tests above, but we think it is worth it if you have complex study designs.

Essentially, the approach is the modular approach above but using a model matrix and covariates supplied to the glm function in R. The values returned are the expected values of the glm function given the inputs. In the example below, we are measuring the predictive value of variables A and B independently. See the documentation for the R formula function, or http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf for more information.

Validation of features that are differential under any of the variables identified by the aldex.glm function should be performed using the aldex.effect function as a post-hoc test.

Note that aldex.clr will accept the denom="all" or a user-defined vector of denominator offsets when a model matrix is supplied. Therefore, when it is intended that the downstream analysis is the aldex.glm function only those two denominator options are available. This will be addressed in a future release.

data(selex)
selex.sub <- selex[1:500, ]
 covariates <- data.frame("A" = sample(0:1, 14, replace = TRUE),
     "B" = c(rep(0, 7), rep(1, 7)),
     "Z" = sample(c(1,2,3), 14, replace=TRUE))
 mm <- model.matrix(~ A + Z + B, covariates)
 x <- aldex.clr(selex.sub, mm, mc.samples=4, denom="all", verbose=F)
 glm.test <- aldex.glm(x, mm)
## |------------(25%)----------(50%)----------(75%)----------|
 glm.effect <- aldex.glm.effect(x)

The aldex.glm.effect function will calculate the effect size for all models in the matrix that are binary. The effect size calculations for each binary predictor are output to a named list. These effect sizes, and outputs from aldex.glm can be plotted as in the example code block below which plots the BH-corrected glm values for the actual test case vs. the effect size for the binary predictor.

aldex.plot(glm.effect[["B"]], test="effect", cutoff=2)
sig <- glm.test[,20]<0.05
points(glm.effect[["B"]]$diff.win[sig],
    glm.effect[["B"]]$diff.btw[sig], col="blue")
sig <- glm.test[,20]<0.2
points(glm.effect[["B"]]$diff.win[sig],
    glm.effect[["B"]]$diff.btw[sig], col="blue")

5 ALDEx2 outputs

5.1 Expected values

a ALDEx2 returns expected values for summary statistics. It is important to note that ALDEx uses Bayesian sampling from a Dirichlet distribution to estimate the underlying technical variation. This is controlled by the number of , in practice we find that setting this to 16 or 128 is sufficient for most cases as ALDEx2 is estimating the expected value of the distributions18 Fernandes et al. (2013);Fernandes et al. (2014);Gloor et al. (2016).

In practical terms, ALDEx2 takes the biological observations as given, but infers technical variation (sequencing the same sample again) multiple times using the aldex.clr function. Thus, the expected values returned are those that would likely have been observed . The user is cautioned that the number of features called as differential will vary somewhat between runs because of this sampling procedure. However, only features with values close to the chosen significance cutoff will vary between runs.

Several papers have suggested that ALDEx2 is unable to properly control for the false discovery rate since the P values returned do not follow a random uniform distribution, but rather tend to cluster near a value of 0.519 Hawinkel et al. (2018);Thorsen et al. (2016). These studies indicate that point estimate approaches are very sensitive to particular experimental designs and differences in sparsity and read depth. However, ALDEx2 is not sensitive to these characteristics of the data, but seem to under-report the true FDR. The criticisms miss the mark on ALDEx2 because ALDEx2 reports the P value across the Dirichlet Monte-Carlo replicates. Features that are differential simply because of the vagaries of random sampling will indeed have a random uniform P value as point estimates, but will have an expected P value after repeated random sampling of 0.5. In contrast, features that are differential because of true biological variation are robust to repeated random sampling. Thus, ALDEx2 identifies as differential only those features where simple random sampling (the minimal NULL hypothesis) cannot explain the difference.

In our experience, we observe that ALDEx2 returns a set of features that is very similar to the set returned as the intersect of multiple independent tools—a common recommendation when examining HTS datasets20 Soneson and Delorenzi (2013)

5.2 Explaining the outputs

variant we.ep we.eBH wi.ep wi.eBH kw.ep
A:D:A:D 4.03010e-01 0.63080705 0.239383012 0.43732819 0.21532060
A:D:A:E 1.15463e-01 0.34744596 0.040901806 0.15725841 0.03745315
A:E:A:D 8.98797e-05 0.00329076 0.000582750 0.00820759 0.00174511
variant kw.eBH glm.ep glm.eBH rab.all rab.win.NS rab.win.S
A:D:A:D 0.3932743 3.61061e-01 5.23582e-01 1.42494 1.30886 2.45384
A:D:A:E 0.1486590 8.12265e-02 1.92292e-01 1.71230 1.49767 4.23315
A:E:A:D 0.0245786 7.73660e-08 3.35492e-06 3.97484 1.41163 11.02154
variant diff.btw diff.win effect overlap
A:D:A:D 1.12261 1.72910 0.471043 0.267260701
A:D:A:E 2.73090 2.38134 1.034873 0.135857781
A:E:A:D 9.64287 2.85008 3.429068 0.000156632

In the list below, the aldex.ttest function returns the values highlighted with \(\ast\), the aldex.kw function returns the values highlighted with \(\circ\), and the aldex.effect function returns the values highlighted with \(\diamondsuit\).

  1. \(\ast\) we.ep - Expected P value of Welch’s t test
  2. \(\ast\) we.eBH - Expected Benjamini-Hochberg corrected P value of Welch’s t test
  3. \(\ast\) wi.ep - Expected P value of Wilcoxon rank test
  4. \(\ast\) wi.eBH - Expected Benjamini-Hochberg corrected P value of Wilcoxon test
  5. \(\circ\) kw.ep - Expected P value of Kruskal-Wallace test
  6. \(\circ\) kw.eBH - Expected Benjamini-Hochberg corrected P value of Kruskal-Wallace test
  7. \(\circ\) glm.ep - Expected P value of glm test
  8. \(\circ\) glm.eBH - Expected Benjamini-Hochberg corrected P value of glm test
  9. \(\diamondsuit\) rab.all - median clr value for all samples in the feature
  10. \(\diamondsuit\) rab.win.NS - median clr value for the NS group of samples
  11. \(\diamondsuit\) rab.win.S - median clr value for the S group of samples
  12. \(\diamondsuit\) rab.X1_BNS.q50 - median expression value of features in sample X1_BNS if include.item.summary=TRUE
  13. \(\diamondsuit\) dif.btw - median difference in clr values between S and NS groups
  14. \(\diamondsuit\) dif.win - median of the largest difference in clr values within S and NS groups
  15. \(\diamondsuit\) effect - median effect size: diff.btw / max(diff.win) for all instances
  16. \(\diamondsuit\) overlap - proportion of effect size that overlaps 0 (i.e. no effect)

The output from the aldex.glm function is somewhat different, although it still returns the expected values of the test statistics.

For example, using the selex dataset and the model matrix from the aldex.glm help information:

conds <- data.frame("A" = sample(0:1, 14, replace = TRUE), "B" = c(rep(0, 7), rep(1, 7)))

which we turn into a model matrix: mm <- model.matrix(~ A + B, conds)

Note that we have generated a random model A'' and that modelB’’ is simply the correct comparison. We anticipate that model A will give no statistically significant outputs, and that B will give similar results to a t-test, and indeed this is the case.

The outputs are expected values of the intercept and its coefficients, and the estimates for each model and their coefficients. Also calculated is the Family Wide Error Rate (FWER) for each P value (reported in the model.x Pr(>|t|).holm output. This is a Holm-Bonferroni FWER correction that is more powerful than the Bonferroni correction, and is a step-down correction that is more stringent than a false discovery rate correction such as Benjamini-Hochberg. Note that for a generalized linear model, Tukey’s HSD is not applicable to control the FWER. This is because Tukey’s HSD depends on an ANOVA in the background in R and packages that calculate Tukey’s HSD essentially all do an aov, thus the FWER is calculated on a different test statistic than is calculated by the glm.

The expected value of the Holm-Bonferroni test can be used as the post-hoc standin for Tukey’s HSD. Note, that this is true only when there are a relatively small number of contrasts (up to 10) in the model21 Kim (2015).

5.3 A word about effect size and overlap

The effect size metric used by ALDEx2 is a standardized distributional effect size metric developed specifically for this package. The measure is somewhat robust, allowing up to 20% of the samples to be outliers before the value is affected, returns an effect size that is 71% the size of Cohen’s d on a Normal distribution, and requires at worst twice the number of samples as fully parametric methods (which are not robust) would to estimate values with the same precision. The metric is equally valid for Normal, random uniform and Cauchy distributions^((???) submitted).

We prefer to use the effect size whenever possible rather than statistical significance since an effect size tells the scientist what they want to know—“what is reproducibly different between groups”; this is emphatically not something that P values deliver. We find that using the effect size returns a consistent set of true positive features irregardless of sample size, unlike P value based methods. Furthermore, over half of the the false positive features that are observed at low sample sizes have and effect size \(> 0.5 \times \mathrm{E}\) the chosen effect size cutoff \(\mathrm{E}\). This is true regardless of the source of the dataset ((???) submitted).

We suggest that an effect size cutoff of 1 or greater be used when analyzing HTS datasets. If preferred the user can also set a fold-change cutoff as is commonly done with P value based methods.

The plot below shows the relationship between effect size and P values and BH-adjusted P values in the test dataset.

par(mfrow=c(1,2))
plot(x.all$effect, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Effect size", ylab="P value", main="Effect size plot")
points(x.all$effect, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")
legend(15,1, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))

plot(x.all$diff.btw, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Difference", ylab="P value", main="Volcano plot")
points(x.all$diff.btw, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")
Relationship between effect, difference, and P values. We can see that the effect size has a much tighter relationship to the P value than does the raw difference. The effect size is relatively stable across datasets, but the P value become progressively smaller as the sample size is increased.

Figure 4: Relationship between effect, difference, and P values
We can see that the effect size has a much tighter relationship to the P value than does the raw difference. The effect size is relatively stable across datasets, but the P value become progressively smaller as the sample size is increased.

5.3.1 Alternative plotting of outputs

The built-in aldex.plot function described above will usually be sufficient, but for more user control the example in Figure 4 shows a plot that shows which features are found by the Welchs’ or Wilcoxon test individually (blue) or by both (red).

# identify which values are significant in both the t-test and glm tests
found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)

# identify which values are significant in fewer than all tests
found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)

# plot the within and between variation of the data
plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=0.3, col=rgb(0,0,0,0.3),
 xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
 cex=0.7, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
 cex=0.7, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)
Differential abundance in the selex dataset using the Welch's t-test or Wilcoxon rank test. Features identified by both tests shown in red. Features identified by only one test are shown in blue dots.  Non-significant features represent rare features if black and abundant features if grey dots.

Figure 5: Differential abundance in the selex dataset using the Welch’s t-test or Wilcoxon rank test
Features identified by both tests shown in red. Features identified by only one test are shown in blue dots. Non-significant features represent rare features if black and abundant features if grey dots.

6 Troubleshooting datasets

7 Correcting for asymmetric datasets

In some cases we observe that the data returned by the centre log-ratio can be asymmetric. This occurs when the data are extremely asymmetric, such as when one group is largely composed of features that are absent in the other group. In this case the geometric mean will not accurately represent the appropriate basis of comparison for each group. An asymmetry can arise for many reasons: in RNA-seq it could arise because samples in one group contain a plasmid and the samples in the other group do not; in metagenomics or 16S rRNA gene sequencing it can arise when the samples in the two groups are taken from different environments; in a selex experiment it can arise because the two groups are under different selective constraints. The asymmetry can manifest either as a difference in sparsity (i.e., one group contains more 0 value features than the other) or as a systematic difference in abundance. When this occurs the geometric mean of each sample and group can be markedly different, and thus an inherent skew in the dataset can occur that leads to false positive and false negative feature calls. Asymmetry generally shows as a the centre of mass of the histogram for the x.all$diff.btw or x.all$effect being not centred around zero. We recommend that all datasets be examined for asymmetry.

The approach taken by ALDEx2 is to identify those features that are relatively invariant in all features in the entire dataset even though many features could be asymmetric between the groups. Fundamentally, the log-ratio approach requires that the denominator across all samples be comparable. The output of aldex.clr contains the offset of the features used for the denominator in the @denom slot.

7.1 Methods to correct for asymmetry

The aldex.clr function incorporates multiple approaches to select the denominator that can best deal with asymmetric datasets:

IMPORTANT: no rows should contain all 0 values as they will be removed by the aldex.clr function

  1. all: The default is to calculate the geometric mean of all features using the centred log-ratio of Aitchison22 Aitchison:1986. This is the default method for the compositional data analysis approach.

  2. _iqlr: The iqlr method identifies those features that exhibit reproducible variance in the entire dataset. This is called the inter-quartile log-ratio or iqlr approach. For this, a uniform prior of 0.5 is applied to the dataset, the clr transformation is applied, and the variance of each feature is calculated. Those features that have variance values that fall between the first and third quartiles of variance for all features in all groups in the dataset are retained. When aldex.clr is called, the geometric mean abundance of only the retained features is calculated and used as the denominator for log-ratio calculations. Modelling shows that this approach is effective in dealing with datasets with up to 25% of the features being asymmetric. The approach has the advantage it has little or no effect on symmetric datasets and so is a safe approach if the user is unsure if the data is mildly asymmetric.

  3. lvha: This method identifies those features that in the bottom quartile for variance in each group and the top quartile for relative abundance for each sample and across the entire dataset. This method is appropriate when the groups are very asymmetric, but there are some features that are expected to be relatively constant. The basic idea here is to identify those features that are relatively constant across all samples, similar to the features that would be chosen as internal standards in qPCR. Experience suggests that meta-genomic and meta-transcriptomic datasets can benefit from this method of choosing the denominator. This method does not work with the selex dataset, since no features fit the criteria.

  4. zero: This approach identifies and uses only those features that are non-zero in each group. In this approach the per-group non-zero features are used when aldex.clr calculates the geometric mean in the clr transformation. This method is appropriate when the groups are very asymmetric, but the experimentalist must ask whether the comparison is valid in these extreme cases.

  5. user: The last new approach is to let the user define the set of `invariant’ features. In the case of meta-rna-seq, it could be argued that the levels of housekeeping genes should be standard for all samples. In this case the user could define the row indices that correspond to the particular set of housekeeping genes to use as the standard.

  6. iterate: This method identifies those features that are not statistically significantly different between groups using the statistical test of choice. It may be combined with other methods.

Figure 5 shows the effect of the iqlr correction on the example dataset. When the denominator is all, we see that the bulk of the points fall off the midpoint (dotted line), but that the bulk of the points are centered around 0 for the iqlr and lvha analysis. Thus, we have a demonstrably better centring of the data in the latter two methods. Practically speaking, we alter the p values and effect sizes of features near the margin of significance following iqlr or lvha transformation. The effect is largest for those features that are close to the bulk datapoints.

First the code:

# small synthetic dataset for illustration
# denominator features in x@denominator

data(synth2)
blocks <- c(rep("N", 10),rep("S", 10))
x <- aldex.clr(synth2, blocks, denom="all")
x.e <- aldex.effect(x)
plot(x.e$diff.win, x.e$diff.btw, pch=19, col=rgb(0,0,0,0.1), cex=0.5, xlab="dispersion", ylab="difference", main="all")
points(x.e$diff.win[x@denom], x.e$diff.btw[x@denom], pch=19, col=rgb(0.8,0.5,0,0.7), cex=0.5)
points(x.e$diff.win[47:86], x.e$diff.btw[47:86], col=rgb(0.8,0,0,0.7), cex=0.5)
points(x.e$diff.win[980:1000], x.e$diff.btw[980:1000], col=rgb(0.8,0,0,0.7), cex=0.5)
abline(0,1)
abline(0,-1)
abline(h=0, col="gray", lty=2)
Differential abundance in a synthetic dataset using different denominators for the clr calculation. In this data 2% of the features are modelled to be sparse in one group but not the other. Features modelled to be different between two groups are shown in red. Features that are non-significant are in grey (or brown). Features used in the denominator shown in brown: the geometric mean of these features is used as the denominator when calculating the clr transformation. Lines of constant effect are drawn at 0, and  $\pm$ 1. Note that the iqlr and lvha denominators place the middle of the non-asymmetric features at a between group difference of 0.

Figure 6: Differential abundance in a synthetic dataset using different denominators for the clr calculation
In this data 2% of the features are modelled to be sparse in one group but not the other. Features modelled to be different between two groups are shown in red. Features that are non-significant are in grey (or brown). Features used in the denominator shown in brown: the geometric mean of these features is used as the denominator when calculating the clr transformation. Lines of constant effect are drawn at 0, and \(\pm\) 1. Note that the iqlr and lvha denominators place the middle of the non-asymmetric features at a between group difference of 0.

8 Contributors

I am grateful that ALDEx2 has taken on a life of its own.

  • Andrew Fernandes wrote the original ALDEx code, and designed ALDEx2.
  • Jean Macklaim found and squished many bugs, performed unit testing, did much of the original validation. Jean also made seminal contributions to simplifying and explaining the output of ALDEx2.
  • Matt Links incorporated several ALDEx2 functions into a multicore environment.
  • Adrianne Albert wrote the correlation and the one-way ANOVA modules in aldex.kw.
  • Ruth Grace Wong added function definitions and made the parallel code functional with BioConducter.
  • Jia Rong Wu developed and implemented the alternate denominator method to correct for asymmetric datasets.
  • Andrew Fernandes, Jean Macklaim and Ruth Grace Wong contributed to the Sum-FunctionsAitchison.R code.
  • Thom Quinn rewrote the t-test and Wilcoxon functions to make them substantially faster. He also wrote the current aldex.glm function. Tom’s propr23 T. Quinn et al. (2017) R package is able to integrate with the output from aldex.clr.
  • Vladimir Mikryukov and everyone named above have contributed bug fixes
  • Greg Gloor currently maintains ALDEx2 and had and has roles in documentation, design, testing and implementation.

9 Version information

Version 1.04 of ALDEx was the version used for the analysis in Macklaim et al.24 Macklaim et al. (2013);Fernandes et al. (2013). This version was suitable only for two-sample two-group comparisons, and provided only effect size estimates of difference between groups. ALDEx v1.0.4 is available at:

. No further changes are expected for that version since it can be replicated completely within ALDEx2 by using only the aldex.clr and aldex.effect commands.

Versions 2.0 to 2.05 were development versions that enabled P value calculations. Version 2.06 of ALDEx2 was the version used for the analysis in25 Fernandes et al. (2014). This version enabled large sample comparisons by calculating effect size from a random sample of the data rather than from an exhaustive comparison.

Version 2.07 of ALDEx2 was the initial the modular version that exposed the intermediate calculations so that investigators could write functions to analyze different experimental designs. As an example, this version contains an example one-way ANOVA module. This is identical to the version submitted to Bioconductor as 0.99.1.

Future releases of ALDEx2 now use the Bioconductor versioning numbering.

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ALDEx2_1.32.0         zCompositions_1.4.0-1 truncnorm_1.0-9      
## [4] NADA_1.6-1.1          survival_3.5-5        MASS_7.3-59          
## [7] BiocStyle_2.28.0     
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.5-4                jsonlite_1.8.4             
##  [3] highr_0.10                  compiler_4.3.0             
##  [5] BiocManager_1.30.20         Rcpp_1.0.10                
##  [7] SummarizedExperiment_1.30.0 Biobase_2.60.0             
##  [9] magick_2.7.4                GenomicRanges_1.52.0       
## [11] bitops_1.0-7                parallel_4.3.0             
## [13] jquerylib_0.1.4             splines_4.3.0              
## [15] IRanges_2.34.0              BiocParallel_1.34.0        
## [17] yaml_2.3.7                  fastmap_1.1.1              
## [19] lattice_0.21-8              R6_2.5.1                   
## [21] XVector_0.40.0              GenomeInfoDb_1.36.0        
## [23] knitr_1.42                  BiocGenerics_0.46.0        
## [25] DelayedArray_0.26.0         bookdown_0.33              
## [27] MatrixGenerics_1.12.0       RcppZiggurat_0.1.6         
## [29] GenomeInfoDbData_1.2.10     bslib_0.4.2                
## [31] rlang_1.1.0                 cachem_1.0.7               
## [33] xfun_0.39                   sass_0.4.5                 
## [35] multtest_2.56.0             cli_3.6.1                  
## [37] magrittr_2.0.3              zlibbioc_1.46.0            
## [39] digest_0.6.31               grid_4.3.0                 
## [41] S4Vectors_0.38.0            Rfast_2.0.7                
## [43] evaluate_0.20               codetools_0.2-19           
## [45] stats4_4.3.0                RCurl_1.98-1.12            
## [47] rmarkdown_2.21              matrixStats_0.63.0         
## [49] tools_4.3.0                 htmltools_0.5.5

Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London, England: Chapman & Hall.

Altman, D. G., and J. M. Bland. 1983. “Measurement in Medicine: The Analysis of Method Comparison Studies.” Journal of the Royal Statistical Society. Series D (the Statistician) 32 (3): pp. 307–17. http://www.jstor.org/stable/2987937.

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biol 11 (10): R106. https://doi.org/10.1186/gb-2010-11-10-r106.

Anders, Simon, Davis J McCarthy, Yunshun Chen, Michal Okoniewski, Gordon K Smyth, Wolfgang Huber, and Mark D Robinson. 2013. “Count-Based Differential Expression Analysis of RNA Sequencing Data Using R and Bioconductor.” Nat Protoc 8 (9): 1765–86. https://doi.org/10.1038/nprot.2013.099.

Bian, Gaorui, Gregory B Gloor, Aihua Gong, Changsheng Jia, Wei Zhang, Jun Hu, Hong Zhang, et al. n.d. “The Gut Microbiota of Healthy Aged Chinese Is Similar to That of the Healthy Young.” mSphere 2 (5): e00327–17. https://doi.org/10.1128/mSphere.00327-17.

Bullard, James H, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in MRNA-Seq Experiments.” BMC Bioinformatics 11: 94. https://doi.org/10.1186/1471-2105-11-94.

Dillies, Marie-Agnès, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, Céline Keime, et al. 2013. “A Comprehensive Evaluation of Normalization Methods for Illumina High-Throughput RNA Sequencing Data Analysis.” Brief Bioinform 14 (6): 671–83. https://doi.org/10.1093/bib/bbs046.

Fernandes, Andrew D, Jean M Macklaim, Thomas G Linn, Gregor Reid, and Gregory B Gloor. 2013. “ANOVA-Like Differential Expression (Aldex) Analysis for Mixed Population Rna-Seq.” PLoS One 8 (7): e67019. https://doi.org/10.1371/journal.pone.0067019.

Fernandes, Andrew D, Jennifer Ns Reid, Jean M Macklaim, Thomas A McMurrough, David R Edgell, and Gregory B Gloor. 2014. “Unifying the Analysis of High-Throughput Sequencing Datasets: Characterizing RNA-Seq, 16S RRNA Gene Sequencing and Selective Growth Experiments by Compositional Data Analysis.” Microbiome 2: 15.1–15.13. https://doi.org/10.1186/2049-2618-2-15.

Friedman, Jonathan, and Eric J Alm. 2012. “Inferring Correlation Networks from Genomic Survey Data.” PLoS Comput Biol 8 (9): e1002687. https://doi.org/10.1371/journal.pcbi.1002687.

Gierliński, Marek, Christian Cole, Pietà Schofield, Nicholas J Schurch, Alexander Sherstnev, Vijender Singh, Nicola Wrobel, et al. 2015. “Statistical Models for Rna-Seq Data Derived from a Two-Condition 48-Replicate Experiment.” Bioinformatics 31 (22): 3625–30. https://doi.org/10.1093/bioinformatics/btv425.

Gloor, Gregory B, Jean M. Macklaim, and Andrew D. Fernandes. 2016. “Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes.” Journal of Computational and Graphical Statistics 25 (3C): 971–79. https://doi.org/10.1080/10618600.2015.1131161.

Gloor, Gregory B., Jean M. Macklaim, Vera Pawlowsky-Glahn, and Juan J. Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Frontiers in Microbiology 8: 2224. https://doi.org/10.3389/fmicb.2017.02224.

Gloor, Gregory B, Jean M Macklaim, Michael Vu, and Andrew D Fernandes. 2016. “Compositional Uncertainty Should Not Be Ignored in High-Throughput Sequencing Data Analysis.” Austrian Journal of Statistics 45: 73–87. https://doi.org/doi:10.17713/ajs.v45i4.122.

Hawinkel, Stijn, Federico Mattiello, Luc Bijnens, and Olivier Thas. 2018. “A Broken Promise : Microbiome Differential Abundance Methods Do Not Control the False Discovery Rate.” BRIEFINGS IN BIOINFORMATICS. http://dx.doi.org/10.1093/bib/bbx104.

Kim, Hae-Young. 2015. “Statistical Notes for Clinical Researchers: Post-Hoc Multiple Comparisons.” Restorative Dentistry & Endodontics 40 (2): 172–76.

Lovell, David, Warren Müller, Jen Taylor, Alec Zwart, and Chris Helliwell. 2011. “Proportions, Percentages, Ppm: Do the Molecular Biosciences Treat Compositional Data Right?” In Compositional Data Analysis: Theory and Applications, edited by Vera Pawlowsky-Glahn and Antonella Buccianti, 193–207. London: John Wiley; Sons New York, NY.

Macklaim, M Jean, D Andrew Fernandes, M Julia Di Bella, Jo-Anne Hammond, Gregor Reid, and Gregory B Gloor. 2013. “Comparative Meta-RNA-Seq of the Vaginal Microbiota and Differential Expression by Lactobacillus Iners in Health and Dysbiosis.” Microbiome 1: 15. https://doi.org/doi: 10.1186/2049-2618-1-12.

McMurrough, Thomas A, Russell J Dickson, Stephanie M F Thibert, Gregory B Gloor, and David R Edgell. 2014. “Control of Catalytic Efficiency by a Coevolving Network of Catalytic and Noncatalytic Residues.” Proc Natl Acad Sci U S A 111 (23): E2376–83. https://doi.org/10.1073/pnas.1322352111.

Nelson, Tiffanie M, Joanna-Lynn C Borgogna, Rebecca M Brotman, Jacques Ravel, Seth T Walk, and Carl J Yeoman. 2015. “Vaginal Biogenic Amines: Biomarkers of Bacterial Vaginosis or Precursors to Vaginal Dysbiosis?” Frontiers in Physiology 6.

Pawlowsky-Glahn, Vera, and Antonella Buccianti. 2011. Compositional Data Analysis: Theory and Applications. John Wiley & Sons.

Pawlowsky-Glahn, Vera, Juan José Egozcue, and Raimon Tolosana-Delgado. 2015. Modeling and Analysis of Compositional Data. John Wiley & Sons.

Quinn, Thomas P, Tamsyn M Crowley, and Mark F Richardson. 2018. “Benchmarking Differential Expression Analysis Tools for Rna-Seq: Normalization-Based Vs. Log-Ratio Transformation-Based Methods.” BMC Bioinformatics 19 (1): 274. https://doi.org/10.1186/s12859-018-2261-8.

Quinn, Thomas P., Ionas Erb, Mark F. Richardson, and Tamsyn M. Crowley. 2017. “Understanding Sequencing Data as Compositions: An Outlook and Review.” bioRxiv. https://doi.org/10.1101/206425.

Quinn, Thomas, Mark F Richardson, David Lovell, and Tamsyn Crowley. 2017. “Propr: An R-Package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.” bioRxiv. https://doi.org/10.1101/104935.

Soneson, Charlotte, and Mauro Delorenzi. 2013. “A Comparison of Methods for Differential Expression Analysis of RNA-seq Data.” BMC Bioinformatics 14: 91. https://doi.org/10.1186/1471-2105-14-91.

Thorsen, Jonathan, Asker Brejnrod, Martin Mortensen, Morten A Rasmussen, Jakob Stokholm, Waleed Abu Al-Soud, Søren Sørensen, Hans Bisgaard, and Johannes Waage. 2016. “Large-Scale Benchmarking Reveals False Discoveries and Count Transformation Sensitivity in 16S RRNA Gene Amplicon Data Analysis Methods Used in Microbiome Studies.” Microbiome 4 (1): 62. https://doi.org/10.1186/s40168-016-0208-8.

Wolfs, Jason M, Thomas A Hamilton, Jeremy T Lant, Marcon Laforet, Jenny Zhang, Louisa M Salemi, Gregory B Gloor, Caroline Schild-Poulter, and David R Edgell. 2016. “Biasing Genome-Editing Events Toward Precise Length Deletions with an Rna-Guided Tevcas9 Dual Nuclease.” Proc Natl Acad Sci U S A, December. https://doi.org/10.1073/pnas.1616343114.