katdetectr 1.0.0
To install this package, start R (version “4.2”) and enter:
# Install via BioConductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("katdetectr")
library(katdetectr)
katdetectr
is an R package for the detection, characterization and visualization of localized hypermutated regions, often referred to as kataegis.
The general workflow of katdetectr
can be summarized as follows:
Please see the Application Note (under submission) for additional background and details of katdetectr
. The application note also section regarding the performance of katdetectr
and other kataegis detection packages: maftools, ClusteredMutations, SeqKat, kataegis, and SigProfilerClusters.
We have made katdetectr
available on BioConductor as this insures reliability, and operability on common operation systems (Linux, Mac, and Windows).
Below, the katdetectr
workflow is performed in a step-by-step manner on publicly-available datasets which are included within this package.
Genomic variants from multiple common data-formats (VCF/MAF and VRanges objects) can be imported into katdetectr
.
# Genomic variants stored within the VCF format.
pathToVCF <- system.file(package = "katdetectr", "extdata/CPTAC_Breast.vcf")
# Genomic variants stored within the MAF format.
pathToMAF <- system.file(package = "katdetectr", "extdata/APL_primary.maf")
# In addition, we can generate synthetic genomic variants including kataegis foci.
# using generateSyntheticData(). This will output a VRanges object.
syntheticData <- generateSyntheticData(nBackgroundVariants = 2500, nKataegisFoci = 1)
Using detectKataegis()
, we can employ changepoint detection to detect distinct clusters of varying IMD and size.
Imported samples can contain either single or multiple samples, in which case records can be aggregated by setting aggregateRecords = TRUE
. Overlapping genomic variants (e.g., an InDel and SNV) are reduced into a single record.
From the genomic variants data, we calculate the intermutation distance (IMD). The IMD is defined as the genomic distance (in bp) between a genomic variant and it’s respective nearest upstream genomic variant (5’ A <- B 3’). Following, changepoint analysis is performed on the IMD of the genomic variants which results in segments. Lastly, a segment is labelled as kataegis foci if the segment fits the following parameters: minSizeKataegis = 6
and maxMeanIMD = 1000
.
# Detect kataegis foci within the given VCF file.
kdVCF <- detectKataegis(genomicVariants = pathToVCF)
# # Detect kataegis foci within the given MAF file.
# As this file contains multiple samples, we set aggregateRecords = TRUE.
kdMAF <- detectKataegis(genomicVariants = pathToMAF, aggregateRecords = TRUE)
# Detect kataegis foci within the synthetic data.
kdSynthetic <- detectKataegis(genomicVariants = syntheticData)
All relevant input and subsequent results are stored within KatDetect
objects.
Using summary()
, show()
and/or print()
, we can generate overviews of these KatDetect
object(s).
summary(kdVCF)
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
print(kdVCF)
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
show(kdVCF)
## Class 'KatDetect' : KatDetect Object
## : S4 class containing 4 slots with names:
## kataegisFoci genomicVariants segments info
##
## Created on: Tue Nov 1 17:52:59 2022
## katdetectr version: 1.0.0
##
## summary:
## --------------------------------------------------------
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
## --------------------------------------------------------
# Or simply:
kdVCF
## Class 'KatDetect' : KatDetect Object
## : S4 class containing 4 slots with names:
## kataegisFoci genomicVariants segments info
##
## Created on: Tue Nov 1 17:52:59 2022
## katdetectr version: 1.0.0
##
## summary:
## --------------------------------------------------------
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
## --------------------------------------------------------
Underlying data can be retrieved from a KatDetect
objects using the following getter functions:
getGenomicVariants()
returns: VRanges object. Processed genomic variants used as input for changepoint detection. This VRanges contains the genomic location, IMD, and kataegis status of each genomic variantgetSegments()
returns: GRanges object. Contains the segments as derived from changepoint detection. This Granges contains the genomic location, total number of variants, mean IMD and, mutation rate of each segment.getKataegisFoci()
returns: GRanges object. Contains all segments designated as putative kataegis foci according the the specified parameters (minSizeKataegis and maxMeanIMD). This Granges contains the genomic location, total number of variants and mean IMD of each putative kataegis focigetInfo()
returns: List object. Contains supplementary information including used parameter settings.getGenomicVariants(kdVCF)
## VRanges object with 3684 ranges and 5 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] chr1 935222 * C A 50
## [2] chr1 949608 * G A 50
## [3] chr1 981131 * A G 50
## [4] chr1 982722 * A G 50
## [5] chr1 1164015 * C A 50
## ... ... ... ... ... ... ...
## [3680] chrX 153594977 * G A 50
## [3681] chrX 153627839 * C T 50
## [3682] chrX 153629155 * A G 50
## [3683] chrX 153668757 * G A 50
## [3684] chrX 153764217 * C T 50
## refDepth altDepth sampleNames softFilterMatrix | revmap
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <list>
## [1] 20 30 CPTAC | 1
## [2] 20 30 CPTAC | 2
## [3] 20 30 CPTAC | 3
## [4] 20 30 CPTAC | 4
## [5] 20 30 CPTAC | 5
## ... ... ... ... ... . ...
## [3680] 20 30 CPTAC | 3683
## [3681] 20 30 CPTAC | 3684
## [3682] 20 30 CPTAC | 3685
## [3683] 20 30 CPTAC | 3686
## [3684] 20 30 CPTAC | 3687
## variantID IMD segmentID putativeKataegis
## <integer> <integer> <integer> <logical>
## [1] 1 935222 1 FALSE
## [2] 2 14386 1 FALSE
## [3] 3 31523 1 FALSE
## [4] 4 1591 1 FALSE
## [5] 5 181293 1 FALSE
## ... ... ... ... ...
## [3680] 3680 442 5 FALSE
## [3681] 3681 32862 6 FALSE
## [3682] 3682 1316 6 FALSE
## [3683] 3683 39602 6 FALSE
## [3684] 3684 95460 7 FALSE
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
## hardFilters: NULL
getSegments(kdVCF)
## GRanges object with 450 ranges and 7 metadata columns:
## seqnames ranges strand | sampleNames segmentID
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 1-3389727 * | CPTAC 1
## [2] chr1 3389728-3428608 * | CPTAC 2
## [3] chr1 3428609-19199400 * | CPTAC 3
## [4] chr1 19199401-19203725 * | CPTAC 4
## [5] chr1 19203726-19635011 * | CPTAC 5
## ... ... ... ... . ... ...
## [446] chrX 3248105-152721728 * | CPTAC 3
## [447] chrX 152721729-153577918 * | CPTAC 4
## [448] chrX 153577919-153594977 * | CPTAC 5
## [449] chrX 153594978-153668757 * | CPTAC 6
## [450] chrX 153668758-155270560 * | CPTAC 7
## totalVariants firstVariantID lastVariantID meanIMD mutationRate
## <integer> <integer> <integer> <numeric> <numeric>
## [1] 11 1 11 308157.00 3.24510e-06
## [2] 4 12 15 9720.25 1.02878e-04
## [3] 22 16 37 716854.18 1.39498e-06
## [4] 4 38 41 1081.25 9.24855e-04
## [5] 8 42 49 53910.75 1.85492e-05
## ... ... ... ... ... ...
## [446] 38 3634 3671 3933516.42 2.54225e-07
## [447] 3 3672 3674 285396.67 3.50390e-06
## [448] 6 3675 3680 2843.17 3.51720e-04
## [449] 3 3681 3683 24593.33 4.06614e-05
## [450] 1 3684 3684 1601802.00 6.24297e-07
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
getKataegisFoci(kdVCF)
## GRanges object with 9 ranges and 6 metadata columns:
## seqnames ranges strand | fociID sampleNames totalVariants
## <Rle> <IRanges> <Rle> | <integer> <character> <numeric>
## [1] chr3 58108856-58111467 * | 1 CPTAC 7
## [2] chr6 32489708-32489949 * | 2 CPTAC 13
## [3] chr6 32632598-32632770 * | 3 CPTAC 8
## [4] chr6 151669875-151674326 * | 4 CPTAC 7
## [5] chr8 144991205-144999107 * | 5 CPTAC 25
## [6] chr11 62285208-62298597 * | 6 CPTAC 25
## [7] chr14 105405599-105419557 * | 7 CPTAC 23
## [8] chr15 86122654-86124712 * | 8 CPTAC 6
## [9] chr19 4510560-4513559 * | 9 CPTAC 19
## firstVariantID lastVariantID meanIMD
## <numeric> <integer> <numeric>
## [1] 782 788 435.1667
## [2] 1251 1263 20.0833
## [3] 1273 1280 24.5714
## [4] 1358 1364 741.8333
## [5] 1659 1683 329.2500
## [6] 2112 2136 557.8750
## [7] 2591 2613 634.4545
## [8] 2687 2692 411.6000
## [9] 3139 3157 166.6111
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
getInfo(kdVCF)
## $sampleName
## [1] "CPTAC"
##
## $totalGenomicVariants
## [1] 3684
##
## $totalKataegisFoci
## [1] 9
##
## $totalVariantsInKataegisFoci
## [1] 133
##
## $version
## [1] "1.0.0"
##
## $date
## [1] "Tue Nov 1 17:52:59 2022"
##
## $parameters
## $parameters$minSizeKataegis
## [1] 6
##
## $parameters$maxMeanIMD
## [1] 1000
##
## $parameters$test.stat
## [1] "Exponential"
##
## $parameters$penalty
## [1] "BIC"
##
## $parameters$pen.value
## [1] 0
##
## $parameters$minseglen
## [1] 2
##
## $parameters$aggregateRecords
## [1] FALSE
Per sample, we can visualize the IMD, detected segments and putative kataegis foci as a rainfall plot. In addition, this allows for a per-chromosome approach which can highlight the putative kataegis foci.
rainfallPlot(kdVCF)
# With showSegmentation, the detected segments (changepoints) as visualized with their mean IMD.
rainfallPlot(kdMAF, showSegmentation = TRUE)
# With showSequence, we can display specific chromosomes or all chromosomes in which a putative kataegis foci has been detected.
rainfallPlot(kdSynthetic, showKataegis = TRUE, showSegmentation = TRUE, showSequence = "Kataegis")
katdetectr
has been implemented flexibly which allows its users to detect clustered mutations of different classes. The historical definition of kataegis foci is a segment harboring ≥6 variants and has a mean IMD ≤1000bp. However, these parameters can be set differently in detectKataegis()
.
For example, other classes of mutation are:
Note that we did not evaluate the performance of katdetectr
in regards to detecting these cluster types. The following is just to show you how to change the parameters of detectKataegis()
if you want to use katdetectr
for detecting these types of clusters.
# detect putative DBS
kdSyntheticDBS <- detectKataegis(genomicVariants = syntheticData, minSizeKataegis = 2, maxMeanIMD = 0)
# detect putative MBS, size = 3
kdSyntheticMBS <- detectKataegis(genomicVariants = syntheticData, minSizeKataegis = 3, maxMeanIMD = 0)
# detect putative Omikli, size 3 and mean IMD = 500
kdSyntheticMBS <- detectKataegis(genomicVariants = syntheticData, minSizeKataegis = 3, maxMeanIMD = 500)
We tested katdetectr with multiple parameter settings (test.stat, penalty, pen.value, minseglen) in order to obtain the highest performance in regards to kataegis classification. The best combination of parameters have been set as the default values. We recommend using these parameter settings!
If your interested you can play with different parameters settings. All these parameters are passed directly to the changepoint or changepoint.np package. For more information regarding these packages see Killick2014 or Haynes2016
utils::sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] katdetectr_1.0.0 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3 rjson_0.2.21
## [3] ellipsis_0.3.2 BSgenome.Hsapiens.UCSC.hg38_1.4.4
## [5] DNAcopy_1.72.0 markdown_1.3
## [7] XVector_0.38.0 GenomicRanges_1.50.0
## [9] gridtext_0.1.5 ggtext_0.1.2
## [11] farver_2.1.1 bit64_4.0.5
## [13] AnnotationDbi_1.60.0 fansi_1.0.3
## [15] xml2_1.3.3 codetools_0.2-18
## [17] splines_4.2.1 cachem_1.0.6
## [19] knitr_1.40 maftools_2.14.0
## [21] jsonlite_1.8.3 Rsamtools_2.14.0
## [23] dbplyr_2.2.1 png_0.1-7
## [25] BiocManager_1.30.19 compiler_4.2.1
## [27] httr_1.4.4 backports_1.4.1
## [29] assertthat_0.2.1 Matrix_1.5-1
## [31] fastmap_1.1.0 cli_3.4.1
## [33] htmltools_0.5.3 prettyunits_1.1.1
## [35] tools_4.2.1 gtable_0.3.1
## [37] glue_1.6.2 GenomeInfoDbData_1.2.9
## [39] dplyr_1.0.10 rappdirs_0.3.3
## [41] Rcpp_1.0.9 Biobase_2.58.0
## [43] jquerylib_0.1.4 vctrs_0.5.0
## [45] Biostrings_2.66.0 rtracklayer_1.58.0
## [47] changepoint_2.2.3 xfun_0.34
## [49] stringr_1.4.1 plyranges_1.18.0
## [51] rbibutils_2.2.9 lifecycle_1.0.3
## [53] restfulr_0.0.15 XML_3.99-0.12
## [55] zlibbioc_1.44.0 zoo_1.8-11
## [57] scales_1.2.1 BSgenome_1.66.0
## [59] VariantAnnotation_1.44.0 hms_1.1.2
## [61] MatrixGenerics_1.10.0 parallel_4.2.1
## [63] SummarizedExperiment_1.28.0 RColorBrewer_1.1-3
## [65] yaml_2.3.6 curl_4.3.3
## [67] memoise_2.0.1 ggplot2_3.3.6
## [69] sass_0.4.2 biomaRt_2.54.0
## [71] stringi_1.7.8 RSQLite_2.2.18
## [73] highr_0.9 S4Vectors_0.36.0
## [75] BiocIO_1.8.0 checkmate_2.1.0
## [77] GenomicFeatures_1.50.0 BiocGenerics_0.44.0
## [79] filelock_1.0.2 BiocParallel_1.32.0
## [81] GenomeInfoDb_1.34.0 commonmark_1.8.1
## [83] Rdpack_2.4 rlang_1.0.6
## [85] pkgconfig_2.0.3 matrixStats_0.62.0
## [87] bitops_1.0-7 evaluate_0.17
## [89] lattice_0.20-45 purrr_0.3.5
## [91] labeling_0.4.2 GenomicAlignments_1.34.0
## [93] bit_4.0.4 tidyselect_1.2.0
## [95] BSgenome.Hsapiens.UCSC.hg19_1.4.3 magrittr_2.0.3
## [97] bookdown_0.29 R6_2.5.1
## [99] magick_2.7.3 IRanges_2.32.0
## [101] generics_0.1.3 DelayedArray_0.24.0
## [103] DBI_1.1.3 pillar_1.8.1
## [105] withr_2.5.0 survival_3.4-0
## [107] KEGGREST_1.38.0 RCurl_1.98-1.9
## [109] tibble_3.1.8 crayon_1.5.2
## [111] utf8_1.2.2 BiocFileCache_2.6.0
## [113] rmarkdown_2.17 progress_1.2.2
## [115] grid_4.2.1 data.table_1.14.4
## [117] blob_1.2.3 digest_0.6.30
## [119] tidyr_1.2.1 stats4_4.2.1
## [121] munsell_0.5.0 bslib_0.4.0