Comparative Identification and Assessment of Single Nucleotide Variants Through Shifts in Base Call Frequencies
Table of Contents
4 Supplementary Information
4.1 Strand-specific analysis
By comparing the confidence intervals between the two strands, we can further detect and characterize effects such as variations in sequencing depth and strand biases. We illustrate this with a set of hypothetical cases for confidence intervals for two strands. The upper row (cases 4-7) corresponds to sites with overlapping CIs, whereas the lower row (cases 1-3) shows cases of disagreements between the CIs indicative of strand biases. When analyzing the probability for the overlap of confidence intervals, an adjustment of the confidence level has to be taken into account [8].
Illustrative cases of confidence intervals for somatic variant frequency estimates for two strands
Motivated by the analysis of different Illumina genome and exome sequencing, we consider strand-biases, in which the non-consensus base call rates differ significantly between strands at sites with sufficient sequencinq depth, a neglectable problem with current data sets and analysis pipelines (see also Best practices for short-read processing). In the presence of strand biases, pooling the counts of both plus and minus strand may be not desirable. A possible solution may be to perform a strand-specific analysis, and later combine the resulting statistics. Gerstung and colleagues discuss different approaches for combining p-values [9], in particular taking the minimum, maximum, average, or Fisher combination. These can be also applied for confidence intervals, with Fisher's method being equivalent to taking the sum of both strands.
4.2 Assessing performance of confidence interval methods
As outlined before, an important property for assessing confidence intervals is given by their coverage probabilities. Ideally, we would expect a method to have coverage probabilities close to the nominal confidence level β over a wide range in the parameter space. Previous publications analyzing the performance focus on parameter settings that deviate from those of sequencing data sets [7]. Therefore, we perform a simulation that demonstrates the behavior of the Agresti-Caffo methods for a whole-genome sequencing study. For a fixed sequencing depth of 30 in both test and control sample, the coverage probability of 95% AC confidence intervals is computed for all possible combinations of mismatch counts \(K^{T}\) and \(K^{C}\).
## WGS n1 = 30 n2 = 30 k1 = 0:(n1-1) k2 = 0:(n2-1) cl = 0.95 n_sample = 1e4 pars = expand.grid(k1 = k1, k2 = k2, n1 = n1, n2 = n2, conf_level = cl) cp_ac = coverageProbability(pars, fun = acCi, n_sample = n_sample)
p_ac = ggplot(cp_ac) + geom_tile(aes(x = k1, y = k2, fill = cp)) + scale_fill_gradient2(midpoint = 0.95, limits = c(0.9, 1)) + theme_bw() + xlab("kT") + ylab("kC") print(p_ac)
Coverage probabilities for whole-genome setting
For mismatch rates close to 0 or 1 in both samples, the Agresti-Caffo method shows a conservative perfomance.
4.3 Benchmarking of performance and resources
For an analysis of two matched human tumor samples, we performed a benchmark to
assess the computational time and memory usage on a standard laptop (Thinkpad
X220 built in 2011). Both samples contain about 95M reads mapped to the
1000genomes reference sequence reads that are considered in the analysis. For
the analysis of chromosome 22, the analysis with default parameters required
~873s and 600MB of RAM. For an analysis of all linear toplevel chromosomes
(autosomes and allosomes), this would require ~15h of time. Please consider
that the current version of Rariant
is under active development and
computational efficiency will increase with newer versions.
5 Frequently Asked Questions
5.1 Getting help
We welcome emails with questions or suggestions about our software, and want to ensure that we eliminate issues if and when they appear. We have a few requests to optimize the process:
- All emails and follow-up questions should take place over the Bioconductor mailing list, which serves as a repository of information.
- The subject line should contain Rariant and a few words describing the problem. First search the Bioconductor mailing list, for past threads which might have answered your question.
- If you have a question about the behavior of a function, read the sections of
the manual page for this function by typing a question mark and the function
name, e.g.
?rariant
. Additionally, read through the vignette to understand the interplay between different functions of the package. We spend a lot of time documenting individual functions and the exact steps that the software is performing. - Include all of your R code related to the question you are asking.
- Include complete warning or error messages, and conclude your message with the
full output of
sessionInfo()
.
5.2 Installing the package
Before you want to install the Rariant
package, please ensure that
you have the latest version of R
and Bioconductor
installed. For details on
this, please have a look at the help packages for R and Bioconductor. Then you
can open R
and run the following commands which will install the latest
release version of Rariant
:
source("http://bioconductor.org/biocLite.R") biocLite("Rariant")
6 References
References
[1] | Alan Agresti and Brent A. Coull. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician, 52(2):119126, 1998. [ http ] |
[2] | Alan Agresti and Brian Caffo. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician, 54(4):280288, 2000. [ http ] |
[3] | Walter W. Piegorsch. Sample sizes for improved binomial confidence intervals. Computational Statistics & Data Analysis, 46(2):309-316, June 2004. [ DOI | http ] |
[4] | Yoav Benjamini and Daniel Yekutieli. False discovery rateadjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469):7181, 2005. [ http ] |
[5] | Frank Schaarschmidt, Martin Sill, and Ludwig A. Hothorn. Approximate simultaneous confidence intervals for multiple contrasts of binomial proportions. Biometrical Journal, 50(5):782792, 2008. [ DOI | http ] |
[6] | The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, October 2010. [ DOI | .html ] |
[7] | Morten W. Fagerland, Stian Lydersen, and Petter Laake. Recommended confidence intervals for two independent binomial proportions. Statistical methods in medical research, 2011. [ http ] |
[8] | Mirjam J. Knol, Wiebe R. Pestman, and Diederick E. Grobbee. The (mis)use of overlap of confidence intervals to assess effect modification. European Journal of Epidemiology, 26(4):253-254, April 2011. PMID: 21424218 PMCID: PMC3088813. [ DOI | http ] |
[9] | Moritz Gerstung, Christian Beisel, Markus Rechsteiner, Peter Wild, Peter Schraml, Holger Moch, and Niko Beerenwinkel. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature Communications, 3:811, May 2012. [ DOI | .html ] |
[10] | Omkar Muralidharan, Georges Natsoulis, John Bell, Hanlee Ji, and Nancy R. Zhang. Detecting mutations in mixed sample sequencing data using empirical bayes. The Annals of Applied Statistics, 6(3):1047-1067, September 2012. Zentralblatt MATH identifier06096521, Mathematical Reviews number (MathSciNet) MR3012520. [ DOI | http ] |
[11] | Lucy R. Yates and Peter J. Campbell. Evolution of the cancer genome. Nature Reviews Genetics, 13(11):795, November 2012. [ DOI | .html ] |
[12] | Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013. [ http ] |
[13] | Geoffrey Decrouez and Peter Hall. Split sample methods for constructing confidence intervals for binomial and poisson parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), page n/an/a, 2013. [ DOI | http ] |
[14] | Alan Agresti. Categorical data analysis. Wiley, Hoboken, NJ, 2013. |
[15] | Su Y. Kim and Terence P. Speed. Comparing somatic mutation-callers: beyond venn diagrams. BMC Bioinformatics, 14(1):189, June 2013. PMID: 23758877. [ DOI | http ] |
[16] | Nicola D. Roberts, R. Daniel Kortschak, Wendy T. Parker, Andreas W. Schreiber, Susan Branford, Hamish S. Scott, Garique Glonek, and David L. Adelson. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18):2223-2230, September 2013. PMID: 23842810. [ DOI | http ] |
7 Session Info
R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils [7] datasets methods base other attached packages: [1] ggbio_1.36.0 Rariant_1.24.0 [3] bigmemory_4.5.36 VariantAnnotation_1.34.0 [5] Rsamtools_2.4.0 Biostrings_2.56.0 [7] XVector_0.28.0 SummarizedExperiment_1.18.0 [9] DelayedArray_0.14.0 matrixStats_0.56.0 [11] Biobase_2.48.0 GenomicRanges_1.40.0 [13] GenomeInfoDb_1.24.0 IRanges_2.22.0 [15] S4Vectors_0.26.0 BiocGenerics_0.34.0 [17] ggplot2_3.3.0 knitr_1.28 loaded via a namespace (and not attached): [1] backports_1.1.6 Hmisc_4.4-0 [3] VGAM_1.1-2 BiocFileCache_1.12.0 [5] NMF_0.22.0 plyr_1.8.6 [7] lazyeval_0.2.2 splines_4.0.0 [9] BiocParallel_1.22.0 gridBase_0.4-7 [11] digest_0.6.25 foreach_1.5.0 [13] ensembldb_2.12.0 htmltools_0.4.0 [15] magrittr_1.5 checkmate_2.0.0 [17] memoise_1.1.0 BSgenome_1.56.0 [19] cluster_2.1.0 doParallel_1.0.15 [21] askpass_1.1 prettyunits_1.1.1 [23] jpeg_0.1-8.1 colorspace_1.4-1 [25] blob_1.2.1 rappdirs_0.3.1 [27] xfun_0.13 dplyr_0.8.5 [29] crayon_1.3.4 RCurl_1.98-1.2 [31] bigmemory.sri_0.1.3 graph_1.66.0 [33] SomaticSignatures_2.24.0 survival_3.1-12 [35] iterators_1.0.12 glue_1.4.0 [37] registry_0.5-1 gtable_0.3.0 [39] zlibbioc_1.34.0 scales_1.1.0 [41] DBI_1.1.0 GGally_1.5.0 [43] rngtools_1.5 bibtex_0.4.2.2 [45] Rcpp_1.0.4.6 xtable_1.8-4 [47] progress_1.2.2 htmlTable_1.13.3 [49] foreign_0.8-79 bit_1.1-15.2 [51] proxy_0.4-24 OrganismDbi_1.30.0 [53] Formula_1.2-3 htmlwidgets_1.5.1 [55] httr_1.4.1 RColorBrewer_1.1-2 [57] acepack_1.4.1 ellipsis_0.3.0 [59] pkgconfig_2.0.3 reshape_0.8.8 [61] XML_3.99-0.3 farver_2.0.3 [63] nnet_7.3-14 dbplyr_1.4.3 [65] tidyselect_1.0.0 labeling_0.3 [67] rlang_0.4.5 reshape2_1.4.4 [69] later_1.0.0 AnnotationDbi_1.50.0 [71] munsell_0.5.0 tools_4.0.0 [73] RSQLite_2.2.0 evaluate_0.14 [75] stringr_1.4.0 fastmap_1.0.1 [77] bit64_0.9-7 purrr_0.3.4 [79] AnnotationFilter_1.12.0 RBGL_1.64.0 [81] mime_0.9 biomaRt_2.44.0 [83] compiler_4.0.0 rstudioapi_0.11 [85] curl_4.3 png_0.1-7 [87] tibble_3.0.1 stringi_1.4.6 [89] highr_0.8 GenomicFeatures_1.40.0 [91] exomeCopy_1.34.0 lattice_0.20-41 [93] ProtGenerics_1.20.0 Matrix_1.2-18 [95] markdown_1.1 vctrs_0.2.4 [97] pillar_1.4.3 lifecycle_0.2.0 [99] BiocManager_1.30.10 data.table_1.12.8 [101] bitops_1.0-6 httpuv_1.5.2 [103] rtracklayer_1.48.0 R6_2.4.1 [105] latticeExtra_0.6-29 pcaMethods_1.80.0 [107] promises_1.1.0 gridExtra_2.3 [109] codetools_0.2-16 dichromat_2.0-0 [111] assertthat_0.2.1 openssl_1.4.1 [113] pkgmaker_0.31.1 withr_2.2.0 [115] GenomicAlignments_1.24.0 GenomeInfoDbData_1.2.3 [117] hms_0.5.3 grid_4.0.0 [119] rpart_4.1-15 biovizBase_1.36.0 [121] shiny_1.4.0.2 base64enc_0.1-3