Comparative Identification and Assessment of Single Nucleotide Variants Through Shifts in Base Call Frequencies

1 Motivation

The advances in high-throughput nucleotide sequencing have lead to new possibilities of identifying genomic alterations, which is of particular interest for cancer research. While the detection of genomic variants is often conducted by comparing against a population-based reference sequence (such as the human GRCh 37), in the investigation of tumors a comparison between matched samples is predominantly pursued. Considering a test sample $T$ and a control sample $C$, the aim is to identify single nucloetide variants (SNVs) which show significant changes between the two. Typically, this is applied for comparing a tumor sample against a matched normal/constitutive sample of the same patient, focusing on alterations that may have occurred in the transition to the cancerous tissue, and therefore are often termed somatic variants. However, a broader range of questions can be addressed by extending the focus to include genomic changes of temporally or spatially separated samples to each other, to investigate tumor evolution and subclonal shifts [11].

Extending the scope of comparative variant analyses in a cancer related setting may require an accompanying change in definitions. In the classical tumor versus normal setting, a somatic event is characterized by both the presence of the variant in the tumor and the absence in the matched normal [16]. This definition is not satisfactory for comparing different tumor stages against each other. If we imagine two stages of a tumor haboring multiple subclones, variants may be present in both stages with changing abundances. Here, the classical definition of a somatic variant would result in missing the event. In the following, we will therefore focus on estimating the shift in variant allele frequencies between a test and a control sample. The classical somatic definition requiring the variant to be absent in the control constitutes a special case of this.

Numerous approaches focus on more narrowly defined somatic comparisons, motivated by scientific questions in the field of cancer genomics [15] [16]. Most existing methods do not allow an explicit statement about the evidence for the absence of a putative variant in a sample, and a clear distinction is needed whether a negative call results from a true negative site or the lack of statistical power to detect it. As an example, imagine a variant calling approach using a hypothesis test with the null hypothesis that the variant frequencies at a locus are identical for both samples. While a significant p-value provides evidence for the presence of a variant, inverting this argument does not work: A non-significant p-value can both be indicative of the absence or insufficient power. Furthermore, most methods return only a point estimate for the variant frequency in the tumor, without stating the confidence of the estimate. However, this information would be critical to reliably differentiate between variants arising from competing subclonal populations.

With the Rariant package, we pursue a generalized approach for identifying and quantifying SNV alterations from high-throughput sequencing data in comparative settings. By focusing on shifts of the non-consensus base call frequencies, events like loss of heterozygosity and clonal expansions in addition to somatic variants can be detected. In the following, we will 1) outline the methodological framework, 2) provide an example workflow on how to obtain a call set starting from short-read alignments, and 3) illustrate potential benefits of the methods on a biological cancer data set. This methodology can be used for a high performance identification of variant sites as well as to quantitatively assess the presence or absence in comparisons between matched samples.

2 Methodology

2.2 Confidence intervals

Distinguishing biological variants from noise requires knowledge about the variance of the point estimate $\hat{d_{i}}$. By constructing a confidence interval (CI) for $d_{i}$ with confidence level $\beta$ [14], we assess the certainty of the estimated shift in non-consensus frequencies. The probability of the true value being outside the confidence interval is less than $\alpha = 1 - \beta$. This is in concordance with the type I or $\alpha$ error definition in statistical testing.

Under the assumption that the non-consensus counts $K^{S}_{i}$ in our samples follow binomial distributions with parameters $p^{S}_{i}$ and $N^{S}_{i}$, several methods have been established for estimating confidence intervals for the difference of two rate parameters [12] [7]. The performance of an approach is generally described in terms of its coverage probability indicating the probability of a confidence interval to cover the true value (see 4.2). Coverage probabilities greater and less than the confidence level $\beta$ describe conservative and liberal behaviors, respectively. Due to the conservative coverage probabilities and high computational effort of exact confidence interval estimates, approximate methods are generally preferred [1] [7].

The Agresti-Caffo (AC) confidence interval [2]

$$\tilde{p}^{T} - \tilde{p}^{C} \pm z \sqrt{ \frac{\tilde{p}^{T} (1 - \tilde{p}^{T})} {\tilde{N}^{T}} + \frac{\tilde{p}^{C}(1 - \tilde{p}^{C})} {\tilde{N}^{C}} }$$

with

$$\tilde{p}^{X} = \frac{K^{X}+\zeta}{N^{X}+2\zeta},$$

$$\tilde{N}^{X} = N^{X} + 2\zeta,$$

$$\zeta = \frac{1}{4} z^2,$$

and $z = z_{(1-\beta)/2}$ as the upper $(1-\beta)/2$ percentile of the standard normal distribution), is an approximation of the score test-based confidence interval. Several publications emphasize the usefulness and advantages of the AC method over related approaches [7] [3] [5].

3 Workflow

3.1 Multiple Sample Simulation Study

We want to further demostrate the usage and abilities of the Rariant package on a real-life data set. Due to legal and privacy issues, most human cancer sequencing data is not publicly accessible and therefore cannot serve as an example data set here. Alternatively, we conduct an analysis to mimic the characteristics of current cancer sequencing studies.

For the purpose of the analysis, we compare three samples from the 1000 Genomes project [6], serving as a control/normal (control) and two related test/tumor samples (test and test2). Further, we simulate a clonal mixture (mix) of the two test samples by combining their reads.

library(Rariant)

library(GenomicRanges)
library(ggbio)

tp53_region = GRanges("chr17", IRanges(7565097, 7590856))

control_bam = system.file("extdata", "platinum", "control.bam", package = "Rariant", mustWork = TRUE)
test1_bam = system.file("extdata", "platinum", "test.bam", package = "Rariant", mustWork = TRUE)
test2_bam = system.file("extdata", "platinum", "test2.bam", package = "Rariant", mustWork = TRUE)
mix_bam = system.file("extdata", "platinum", "mix.bam", package = "Rariant", mustWork = TRUE)

v_test1 = rariant(test1_bam, control_bam, tp53_region, select = FALSE)

   Warning: 'rbind_all' is deprecated.
   Use 'bind_rows()' instead.
   See help("Deprecated")

v_test2 = rariant(test2_bam, control_bam, tp53_region, select = FALSE)

   Warning: 'rbind_all' is deprecated.
   Use 'bind_rows()' instead.
   See help("Deprecated")

v_mix = rariant(mix_bam, control_bam, tp53_region, select = FALSE)

   Warning: 'rbind_all' is deprecated.
   Use 'bind_rows()' instead.
   See help("Deprecated")

v_all = GenomicRangesList(T1 = v_test1, T2 = v_test2, M = v_mix)

   Warning: 'GenomicRangesList' is deprecated.
   Use 'GRangesList(..., compress=FALSE)' instead.
   See help("Deprecated")

v_all = endoapply(v_all, updateCalls)

To better understand the evidence for the presence or absence of particular variants across samples, we plot the confidence intervals, colored according to the predicted event type, and abundance shifts for all sites of interest, colored according to the sign of the shift.

t_ci = tracks(lapply(v_all, plotConfidenceIntervals, color = "verdict")) + verdictColorScale()

print(t_ci)

   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.

   Warning: Removed 6 rows containing missing values (geom_pointrange).

Confidence intervals for simulation study

t_ci = tracks(lapply(v_all, plotConfidenceIntervals, color = "eventType"))

print(t_ci)

   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.

   Warning: Removed 6 rows containing missing values (geom_pointrange).

Confidence intervals for simulation study

t_rates = tracks(lapply(v_all, plotAbundanceShift))

print(t_rates)

   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.
   Coordinate system already present. Adding new coordinate system, which will replace the existing one.

   Warning: Removed 6 rows containing missing values (geom_segment).

   Warning: Removed 6 rows containing missing values (geom_point).

Abundance shifts for simulation study

In the following, we look at positions which showed a significant effect in at least one sample. This gives us 12 positions to consider in the following.

While most of the variants are somatic, i.e. they do not appear in the control sample, the last variant position shows a loss of a heterozygous SNP. Looking for example in more detail into the group of 5 variant sites around 7.85 Mbp: We can identify them as consistent with a heterozygous somatic variant in the first sample, since their 95% CIs overlap the value of 0.5. In contrast, we can show the absence of the same variants in the second sample. The third sample again shows the presence of the variants, as seen in the first case, but with lower abundance. Such a result could be expected in a mixture of subclones, in which some clones carry a somatic variant and others not. Further, we can also see the case of the next variant which consistently exists in all three samples with the same abundance.

z = filterCalls(v_all, verdict %in% c("present", "inbetween", "dontknow"))

elementNROWS(z)

   T1 T2  M 
   49 49 49

evidenceHeatmap(z, fill = "d", color = "verdict") + verdictColorScale()

4 Supplementary Information

4.1 Strand-specific analysis

By comparing the confidence intervals between the two strands, we can further detect and characterize effects such as variations in sequencing depth and strand biases. We illustrate this with a set of hypothetical cases for confidence intervals for two strands. The upper row (cases 4-7) corresponds to sites with overlapping CIs, whereas the lower row (cases 1-3) shows cases of disagreements between the CIs indicative of strand biases. When analyzing the probability for the overlap of confidence intervals, an adjustment of the confidence level has to be taken into account [8].

Illustrative cases of confidence intervals for somatic variant frequency estimates for two strands

Motivated by the analysis of different Illumina genome and exome sequencing, we consider strand-biases, in which the non-consensus base call rates differ significantly between strands at sites with sufficient sequencinq depth, a neglectable problem with current data sets and analysis pipelines (see also Best practices for short-read processing). In the presence of strand biases, pooling the counts of both plus and minus strand may be not desirable. A possible solution may be to perform a strand-specific analysis, and later combine the resulting statistics. Gerstung and colleagues discuss different approaches for combining p-values [9], in particular taking the minimum, maximum, average, or Fisher combination. These can be also applied for confidence intervals, with Fisher's method being equivalent to taking the sum of both strands.

4.2 Assessing performance of confidence interval methods

As outlined before, an important property for assessing confidence intervals is given by their coverage probabilities. Ideally, we would expect a method to have coverage probabilities close to the nominal confidence level β over a wide range in the parameter space. Previous publications analyzing the performance focus on parameter settings that deviate from those of sequencing data sets [7]. Therefore, we perform a simulation that demonstrates the behavior of the Agresti-Caffo methods for a whole-genome sequencing study. For a fixed sequencing depth of 30 in both test and control sample, the coverage probability of 95% AC confidence intervals is computed for all possible combinations of mismatch counts $K^{T}$ and $K^{C}$.

## WGS
n1 = 30
n2 = 30
k1 = 0:(n1-1)
k2 = 0:(n2-1)
cl = 0.95
n_sample = 1e4

pars = expand.grid(k1 = k1, k2 = k2, n1 = n1, n2 = n2, conf_level = cl)

cp_ac = coverageProbability(pars, fun = acCi, n_sample = n_sample)

p_ac = ggplot(cp_ac) + geom_tile(aes(x = k1, y = k2, fill = cp)) +
    scale_fill_gradient2(midpoint = 0.95, limits = c(0.9, 1)) +
    theme_bw() + xlab("kT") + ylab("kC")

print(p_ac)

Coverage probabilities for whole-genome setting

For mismatch rates close to 0 or 1 in both samples, the Agresti-Caffo method shows a conservative perfomance.

4.3 Benchmarking of performance and resources

For an analysis of two matched human tumor samples, we performed a benchmark to assess the computational time and memory usage on a standard laptop (Thinkpad X220 built in 2011). Both samples contain about 95M reads mapped to the 1000genomes reference sequence reads that are considered in the analysis. For the analysis of chromosome 22, the analysis with default parameters required ~873s and 600MB of RAM. For an analysis of all linear toplevel chromosomes (autosomes and allosomes), this would require ~15h of time. Please consider that the current version of Rariant is under active development and computational efficiency will increase with newer versions.

5 Frequently Asked Questions

5.1 Getting help

We welcome emails with questions or suggestions about our software, and want to ensure that we eliminate issues if and when they appear. We have a few requests to optimize the process:

All emails and follow-up questions should take place over the Bioconductor mailing list, which serves as a repository of information.
The subject line should contain Rariant and a few words describing the problem. First search the Bioconductor mailing list, for past threads which might have answered your question.
If you have a question about the behavior of a function, read the sections of the manual page for this function by typing a question mark and the function name, e.g. ?rariant. Additionally, read through the vignette to understand the interplay between different functions of the package. We spend a lot of time documenting individual functions and the exact steps that the software is performing.
Include all of your R code related to the question you are asking.
Include complete warning or error messages, and conclude your message with the full output of sessionInfo().

5.2 Installing the package

Before you want to install the Rariant package, please ensure that you have the latest version of R and Bioconductor installed. For details on this, please have a look at the help packages for R and Bioconductor. Then you can open R and run the following commands which will install the latest release version of Rariant:

source("http://bioconductor.org/biocLite.R")
biocLite("Rariant")

6 References

References

[1]	Alan Agresti and Brent A. Coull. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician, 52(2):119126, 1998. [ http ]
[2]	Alan Agresti and Brian Caffo. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician, 54(4):280288, 2000. [ http ]
[3]	Walter W. Piegorsch. Sample sizes for improved binomial confidence intervals. Computational Statistics & Data Analysis, 46(2):309-316, June 2004. [ DOI \| http ]
[4]	Yoav Benjamini and Daniel Yekutieli. False discovery rateadjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469):7181, 2005. [ http ]
[5]	Frank Schaarschmidt, Martin Sill, and Ludwig A. Hothorn. Approximate simultaneous confidence intervals for multiple contrasts of binomial proportions. Biometrical Journal, 50(5):782792, 2008. [ DOI \| http ]
[6]	The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, October 2010. [ DOI \| .html ]
[7]	Morten W. Fagerland, Stian Lydersen, and Petter Laake. Recommended confidence intervals for two independent binomial proportions. Statistical methods in medical research, 2011. [ http ]
[8]	Mirjam J. Knol, Wiebe R. Pestman, and Diederick E. Grobbee. The (mis)use of overlap of confidence intervals to assess effect modification. European Journal of Epidemiology, 26(4):253-254, April 2011. PMID: 21424218 PMCID: PMC3088813. [ DOI \| http ]
[9]	Moritz Gerstung, Christian Beisel, Markus Rechsteiner, Peter Wild, Peter Schraml, Holger Moch, and Niko Beerenwinkel. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature Communications, 3:811, May 2012. [ DOI \| .html ]
[10]	Omkar Muralidharan, Georges Natsoulis, John Bell, Hanlee Ji, and Nancy R. Zhang. Detecting mutations in mixed sample sequencing data using empirical bayes. The Annals of Applied Statistics, 6(3):1047-1067, September 2012. Zentralblatt MATH identifier06096521, Mathematical Reviews number (MathSciNet) MR3012520. [ DOI \| http ]
[11]	Lucy R. Yates and Peter J. Campbell. Evolution of the cancer genome. Nature Reviews Genetics, 13(11):795, November 2012. [ DOI \| .html ]
[12]	Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013. [ http ]
[13]	Geoffrey Decrouez and Peter Hall. Split sample methods for constructing confidence intervals for binomial and poisson parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), page n/an/a, 2013. [ DOI \| http ]
[14]	Alan Agresti. Categorical data analysis. Wiley, Hoboken, NJ, 2013.
[15]	Su Y. Kim and Terence P. Speed. Comparing somatic mutation-callers: beyond venn diagrams. BMC Bioinformatics, 14(1):189, June 2013. PMID: 23758877. [ DOI \| http ]
[16]	Nicola D. Roberts, R. Daniel Kortschak, Wendy T. Parker, Andreas W. Schreiber, Susan Branford, Hamish S. Scott, Garique Glonek, and David L. Adelson. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18):2223-2230, September 2013. PMID: 23842810. [ DOI \| http ]

7 Session Info

   R version 4.0.0 (2020-04-24)
   Platform: x86_64-pc-linux-gnu (64-bit)
   Running under: Ubuntu 18.04.4 LTS
   
   Matrix products: default
   BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
   LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
   
   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
    [9] LC_ADDRESS=C               LC_TELEPHONE=C            
   [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
   
   attached base packages:
   [1] parallel  stats4    stats     graphics  grDevices utils    
   [7] datasets  methods   base     
   
   other attached packages:
    [1] ggbio_1.36.0                Rariant_1.24.0             
    [3] bigmemory_4.5.36            VariantAnnotation_1.34.0   
    [5] Rsamtools_2.4.0             Biostrings_2.56.0          
    [7] XVector_0.28.0              SummarizedExperiment_1.18.0
    [9] DelayedArray_0.14.0         matrixStats_0.56.0         
   [11] Biobase_2.48.0              GenomicRanges_1.40.0       
   [13] GenomeInfoDb_1.24.0         IRanges_2.22.0             
   [15] S4Vectors_0.26.0            BiocGenerics_0.34.0        
   [17] ggplot2_3.3.0               knitr_1.28                 
   
   loaded via a namespace (and not attached):
     [1] backports_1.1.6          Hmisc_4.4-0             
     [3] VGAM_1.1-2               BiocFileCache_1.12.0    
     [5] NMF_0.22.0               plyr_1.8.6              
     [7] lazyeval_0.2.2           splines_4.0.0           
     [9] BiocParallel_1.22.0      gridBase_0.4-7          
    [11] digest_0.6.25            foreach_1.5.0           
    [13] ensembldb_2.12.0         htmltools_0.4.0         
    [15] magrittr_1.5             checkmate_2.0.0         
    [17] memoise_1.1.0            BSgenome_1.56.0         
    [19] cluster_2.1.0            doParallel_1.0.15       
    [21] askpass_1.1              prettyunits_1.1.1       
    [23] jpeg_0.1-8.1             colorspace_1.4-1        
    [25] blob_1.2.1               rappdirs_0.3.1          
    [27] xfun_0.13                dplyr_0.8.5             
    [29] crayon_1.3.4             RCurl_1.98-1.2          
    [31] bigmemory.sri_0.1.3      graph_1.66.0            
    [33] SomaticSignatures_2.24.0 survival_3.1-12         
    [35] iterators_1.0.12         glue_1.4.0              
    [37] registry_0.5-1           gtable_0.3.0            
    [39] zlibbioc_1.34.0          scales_1.1.0            
    [41] DBI_1.1.0                GGally_1.5.0            
    [43] rngtools_1.5             bibtex_0.4.2.2          
    [45] Rcpp_1.0.4.6             xtable_1.8-4            
    [47] progress_1.2.2           htmlTable_1.13.3        
    [49] foreign_0.8-79           bit_1.1-15.2            
    [51] proxy_0.4-24             OrganismDbi_1.30.0      
    [53] Formula_1.2-3            htmlwidgets_1.5.1       
    [55] httr_1.4.1               RColorBrewer_1.1-2      
    [57] acepack_1.4.1            ellipsis_0.3.0          
    [59] pkgconfig_2.0.3          reshape_0.8.8           
    [61] XML_3.99-0.3             farver_2.0.3            
    [63] nnet_7.3-14              dbplyr_1.4.3            
    [65] tidyselect_1.0.0         labeling_0.3            
    [67] rlang_0.4.5              reshape2_1.4.4          
    [69] later_1.0.0              AnnotationDbi_1.50.0    
    [71] munsell_0.5.0            tools_4.0.0             
    [73] RSQLite_2.2.0            evaluate_0.14           
    [75] stringr_1.4.0            fastmap_1.0.1           
    [77] bit64_0.9-7              purrr_0.3.4             
    [79] AnnotationFilter_1.12.0  RBGL_1.64.0             
    [81] mime_0.9                 biomaRt_2.44.0          
    [83] compiler_4.0.0           rstudioapi_0.11         
    [85] curl_4.3                 png_0.1-7               
    [87] tibble_3.0.1             stringi_1.4.6           
    [89] highr_0.8                GenomicFeatures_1.40.0  
    [91] exomeCopy_1.34.0         lattice_0.20-41         
    [93] ProtGenerics_1.20.0      Matrix_1.2-18           
    [95] markdown_1.1             vctrs_0.2.4             
    [97] pillar_1.4.3             lifecycle_0.2.0         
    [99] BiocManager_1.30.10      data.table_1.12.8       
   [101] bitops_1.0-6             httpuv_1.5.2            
   [103] rtracklayer_1.48.0       R6_2.4.1                
   [105] latticeExtra_0.6-29      pcaMethods_1.80.0       
   [107] promises_1.1.0           gridExtra_2.3           
   [109] codetools_0.2-16         dichromat_2.0-0         
   [111] assertthat_0.2.1         openssl_1.4.1           
   [113] pkgmaker_0.31.1          withr_2.2.0             
   [115] GenomicAlignments_1.24.0 GenomeInfoDbData_1.2.3  
   [117] hms_0.5.3                grid_4.0.0              
   [119] rpart_4.1-15             biovizBase_1.36.0       
   [121] shiny_1.4.0.2            base64enc_0.1-3