This vignette outlines a workflow of parsing and plotting structural variants from Variant Call Format (VCF) using the StructuralVariantAnnotation
package. StructuralVariantAnnotation
contains useful helper functions for reading and interpreting structural variants calls. The packages contains functions for parsing VCFs from a number of popular callers as well as functions for dealing with breakpoints involving two separate genomic loci encoded as GRanges
objects.
The StructuralVariationAnnotation package can be installed from Bioconductor as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("StructuralVariantAnnotation")
The VCF standard describes two types of SV notations. One is by SV types, i.e. insertions, deletions, inversions, translocations, etc. The other is by breakend notations, often labelled with SVTYPE=BND
. To describe a SV with breakend notations, each SV has two positions, each captured by one breakend (except for inversions, which has 4 separate records). Each breakend includes a genomic locus, as well as a half interval extending out to the partner breakend. In VCF BND notations, the ALT
field encodes directional information of the partner breakend.
Unlike breakpoint-centric data structures such as the Pairs
object that rtracklayer
uses to load BEDPE files, this package uses a breakend-centric notation. Breakends are stored in a GRanges object with strand used it indicate breakpoint orientation. Consistent with how breakends are encoded in VCF +
indicates that the breakpoint occurs immediately after the given position, with -
indicating the breakpoint occurs immedaitely before the given position. Breakpoints are represented using a partner
field containing the name of the breakend at the other side of the breakend. Both single breakends and breakpoints are supported but many-to-many breakend partner mapping supported by the VCF MATEID
field are not: each breakend must have 0 (single breakend) or 1 (breakpoint) partner breakends.
This notation was chosen as it simplifies many common operations and annotations are breakend-level annotations. These include annotation associated with genomic positions (e.g. genes, repeats, mappability), as well as breakend-level attributes of a breakpoint such as variant allele fractions (e.g. a structural variant can be homozygous at one breakend, but heterzygous at the other breakend).
VCF data is parsed into a VCF
object using readVCF
function from Bioconductor package VariantAnnotation
. Simple filters could be applied to a VCF
object to remove unwanted calls. More information about VCF
objects can be found by consulting the vignetts in the VariantAnnotation package with browseVignettes("VariantAnnotation")
.
StructuralVariantAnnotation support structural variants reported in the following VCF notations:
SVTYPE
of DEL
, INS
, and DUP
.SVTYPE=BND
In addition to parsing spec-compliant VCFs, additional logic has been added to enable parsing of non-compliant variants for the following callers:
SVTYPE=RPL
)INv3
, INV5
fields)SVTYPE=TRA
, CHR2
, CT
fields)SVTYPE=CTX
)Breakpoint ambiguity reported in using the spec-defined CIPOS
is, by default, incorporated into the GRanges breakend intervals.
suppressPackageStartupMessages(require(StructuralVariantAnnotation))
suppressPackageStartupMessages(require(VariantAnnotation))
vcf.file <- system.file("extdata", "gridss.vcf", package = "StructuralVariantAnnotation")
vcf <- VariantAnnotation::readVcf(vcf.file, "hg19")
gr <- breakpointRanges(vcf)
partner()
returns the breakpoint GRanges
object with the order rearranged such that the partner breakend on the other side of each breakpoint corresponds with the local breakend.
partner(gr)
#> GRanges object with 4 ranges and 12 metadata columns:
#> seqnames ranges strand | paramRangeID REF
#> <Rle> <IRanges> <Rle> | <factor> <character>
#> gridss2h chr12 84963533 - | <NA> G
#> gridss39h chr12 4886681 + | <NA> T
#> gridss39o chr12 84350 - | <NA> G
#> gridss2o chr1 18992158 + | <NA> C
#> ALT QUAL FILTER
#> <character> <numeric> <character>
#> gridss2h ]chr1:18992158]AG 55 LOW_QUAL;SINGLE_ASSEMBLY
#> gridss39h T[chr12:84350[ 627.96 .
#> gridss39o ]chr12:4886681]G 627.96 .
#> gridss2o CA[chr12:84963533[ 55 LOW_QUAL;SINGLE_ASSEMBLY
#> sourceId partner svtype svLen insSeq
#> <character> <character> <character> <numeric> <character>
#> gridss2h gridss2h gridss2o BND <NA> A
#> gridss39h gridss39h gridss39o BND 4802330
#> gridss39o gridss39o gridss39h BND 4802330
#> gridss2o gridss2o gridss2h BND <NA> A
#> insLen HOMLEN
#> <integer> <numeric>
#> gridss2h 1 0
#> gridss39h 0 0
#> gridss39o 0 0
#> gridss2o 1 0
#> -------
#> seqinfo: 84 sequences from hg19 genome
Single breakends are loaded using the breakendRanges()
function. The GRanges
object is of the same form as breakpointRanges()
but as the breakend partner is not specified, the partner is NA. A single GRanges object can contain both breakend and breakpoint variants.
colo829_vcf <- VariantAnnotation::readVcf(system.file("extdata", "COLO829T.purple.sv.ann.vcf.gz", package = "StructuralVariantAnnotation"))
colo829_bpgr <- breakpointRanges(colo829_vcf)
colo829_begr <- breakendRanges(colo829_vcf)
colo829_gr <- sort(c(colo829_begr, colo829_bpgr))
colo829_gr[seqnames(colo829_gr) == "6"]
#> GRanges object with 10 ranges and 12 metadata columns:
#> seqnames ranges strand | paramRangeID
#> <Rle> <IRanges> <Rle> | <factor>
#> gridss36_2106o 6 26194117 + | <NA>
#> gridss36_2106h 6 26194406 + | <NA>
#> gridss37_233635b 6 65298376 + | <NA>
#> gridss38_18403o 6 94917253-94917268 + | <NA>
#> gridss40_299o 6 138774180-138774182 + | <NA>
#> gridss41_7816o 6 168570432-168570448 + | <NA>
#> gridss17_45233h 6 26194039-26194041 - | <NA>
#> gridss38_18403h 6 94917320-94917335 - | <NA>
#> gridss40_35285o 6 138774059 - | <NA>
#> gridss41_7816h 6 168570465-168570481 - | <NA>
#> REF
#> <character>
#> gridss36_2106o G
#> gridss36_2106h A
#> gridss37_233635b G
#> gridss38_18403o T
#> gridss40_299o T
#> gridss41_7816o G
#> gridss17_45233h A
#> gridss38_18403h T
#> gridss40_35285o T
#> gridss41_7816h T
#> ALT
#> <character>
#> gridss36_2106o G]6:26194406]
#> gridss36_2106h A]6:26194117]
#> gridss37_233635b GTGTTTTTTTCTCTGTGTTGTCTTTTTTTTTTTTTCTTTTTAGAATTCCGTGTAAACAAAATCACGCAGTATTTAGCCTTTTTGTTTTAATTTTTGTTTGTTTAATAGAAATGGGAGGCTTTGTTATATTGCCTATGCTGGCCTCAAGCTCCTGGTCTCAAGGGATCCCCCTGCTTGACTCAGTATGTGGCTTCTTGAGACGGACGTCTTTTATTTAACCTAATGTATTTTGAGATTCATCCATGTTTTTATGTATCAGTAATTACTTTCTGTTGCTTAGTAGCATTGCGTTGTATGGATGTACAACAGGTTCTGTATTCATTCCCCAGTTCATTTGGGTTGTTTCCAGTTTTTGGTAATTACGAATAAAACTGCCATAAAAGCATTCATGCATACATACATACACACATGTGCCCTCGTATTTTCTTATGGTTTAAAAAGATATGGTGCCTAGAACTTTTATAACTTTACTACAGAACCTGAAAAAGCTGATGATTTTCACAGAACATTGTAAATTGCTTAGTAAACTTCATCCCCCAAAAAGCCCACTCTGGAATGAGAATAATGTGTTTGTATAAATAATCTTGTGGTATAAACTGTAAGTCATTAGAATTTTTTAAATTAAAGAAGTACATACACATATTTATTTAATGGGTAAATTTATATATAAAACTCCTTAGTGCA.
#> gridss38_18403o T[6:94917327[
#> gridss40_299o T]15:23712618]
#> gridss41_7816o G[6:168570473[
#> gridss17_45233h [3:26431918[A
#> gridss38_18403h ]6:94917260]T
#> gridss40_35285o [15:23717166[GTATATTATCT
#> gridss41_7816h ]6:168570440]T
#> QUAL FILTER sourceId svtype
#> <numeric> <character> <character> <character>
#> gridss36_2106o 2500.79 PASS gridss36_2106o BND
#> gridss36_2106h 2500.79 PASS gridss36_2106h BND
#> gridss37_233635b 2871.65 PASS gridss37_233635b BND
#> gridss38_18403o 411.4 PON gridss38_18403o BND
#> gridss40_299o 3009.75 PASS gridss40_299o BND
#> gridss41_7816o 4495.46 PON gridss41_7816o BND
#> gridss17_45233h 3842.54 PASS gridss17_45233h BND
#> gridss38_18403h 411.4 PON gridss38_18403h BND
#> gridss40_35285o 2213.96 PASS gridss40_35285o BND
#> gridss41_7816h 4495.46 PON gridss41_7816h BND
#> svLen
#> <numeric>
#> gridss36_2106o 288
#> gridss36_2106h 288
#> gridss37_233635b <NA>
#> gridss38_18403o -66
#> gridss40_299o <NA>
#> gridss41_7816o -32
#> gridss17_45233h <NA>
#> gridss38_18403h -66
#> gridss40_35285o <NA>
#> gridss41_7816h -32
#> insSeq
#> <character>
#> gridss36_2106o
#> gridss36_2106h
#> gridss37_233635b TGTTTTTTTCTCTGTGTTGTCTTTTTTTTTTTTTCTTTTTAGAATTCCGTGTAAACAAAATCACGCAGTATTTAGCCTTTTTGTTTTAATTTTTGTTTGTTTAATAGAAATGGGAGGCTTTGTTATATTGCCTATGCTGGCCTCAAGCTCCTGGTCTCAAGGGATCCCCCTGCTTGACTCAGTATGTGGCTTCTTGAGACGGACGTCTTTTATTTAACCTAATGTATTTTGAGATTCATCCATGTTTTTATGTATCAGTAATTACTTTCTGTTGCTTAGTAGCATTGCGTTGTATGGATGTACAACAGGTTCTGTATTCATTCCCCAGTTCATTTGGGTTGTTTCCAGTTTTTGGTAATTACGAATAAAACTGCCATAAAAGCATTCATGCATACATACATACACACATGTGCCCTCGTATTTTCTTATGGTTTAAAAAGATATGGTGCCTAGAACTTTTATAACTTTACTACAGAACCTGAAAAAGCTGATGATTTTCACAGAACATTGTAAATTGCTTAGTAAACTTCATCCCCCAAAAAGCCCACTCTGGAATGAGAATAATGTGTTTGTATAAATAATCTTGTGGTATAAACTGTAAGTCATTAGAATTTTTTAAATTAAAGAAGTACATACACATATTTATTTAATGGGTAAATTTATATATAAAACTCCTTAGTGCA
#> gridss38_18403o
#> gridss40_299o
#> gridss41_7816o
#> gridss17_45233h
#> gridss38_18403h
#> gridss40_35285o GTATATTATC
#> gridss41_7816h
#> insLen HOMLEN partner
#> <integer> <numeric> <character>
#> gridss36_2106o 0 0 gridss36_2106h
#> gridss36_2106h 0 0 gridss36_2106o
#> gridss37_233635b 683 0 <NA>
#> gridss38_18403o 0 15 gridss38_18403h
#> gridss40_299o 0 2 gridss40_299h
#> gridss41_7816o 0 16 gridss41_7816h
#> gridss17_45233h 0 2 gridss17_45233o
#> gridss38_18403h 0 15 gridss38_18403o
#> gridss40_35285o 10 0 gridss40_35285h
#> gridss41_7816h 0 16 gridss41_7816o
#> -------
#> seqinfo: 25 sequences from an unspecified genome
Functions such as findBreakpointOverlaps()
require the GRanges
object to be composed entirely of valid breakpoints. Subsetting a breakpoint GRanges
object can results in one side of a breakpoint getting filtered with the remaining orphaned record no longer valid as its partner no longer exists. Such record can be filtered
colo828_chr6_breakpoints <- colo829_gr[seqnames(colo829_gr) == "6"]
# A call to findBreakpointOverlaps(colo828_chr6_breakpoints, colo828_chr6_breakpoints)
# will fail as there are a) single breakends, and b) breakpoints with missing partners
colo828_chr6_breakpoints <- colo828_chr6_breakpoints[colo828_chr6_breakpoints$partner %in% names(colo828_chr6_breakpoints)]
# As expected, each call on chr6 only overlaps with itself
countBreakpointOverlaps(colo828_chr6_breakpoints, colo828_chr6_breakpoints)
#> [1] 1 1 1 1 1 1
Note that if you did want to include inter-chromosomal breakpoints involving chromosome 6, you would need to update the filtering criteria to include records with chr6 on either side. In such cases, the filtering logic can be simplified by the selfPartnerSingleBreakends
parameter of partner()
. When selfPartnerSingleBreakends=TRUE
, the partner of single breakend events is considered to be the single breakend itself.
colo828_chr6_breakpoints <- colo829_gr[
seqnames(colo829_gr) == "6" |
seqnames(partner(colo829_gr, selfPartnerSingleBreakends=TRUE)) == "6"]
# this way we keep the chr3<->chr6 breakpoint and don't create any orphans
head(colo828_chr6_breakpoints, 1)
#> GRanges object with 1 range and 12 metadata columns:
#> seqnames ranges strand | paramRangeID
#> <Rle> <IRanges> <Rle> | <factor>
#> gridss17_45233o 3 26431917-26431919 - | <NA>
#> REF ALT QUAL FILTER
#> <character> <character> <numeric> <character>
#> gridss17_45233o T [6:26194040[T 3882.79 PASS
#> sourceId svtype svLen insSeq
#> <character> <character> <numeric> <character>
#> gridss17_45233o gridss17_45233o BND <NA>
#> insLen HOMLEN partner
#> <integer> <numeric> <character>
#> gridss17_45233o 0 2 gridss17_45233h
#> -------
#> seqinfo: 25 sequences from an unspecified genome
findBreakpointOverlaps()
and countBreakpointOverlaps()
are functions for finding and counting overlaps between breakpoint objects. All breakends must have their partner breakend included in the GRanges. A valid overlap requires that breakends on boths sides overlap.
To demonstrate countBreakpointOverlaps()
function, we use a small subset of data from our structural variant caller benchmarking paper to construct precision recall curves for a pair of callers.
truth_vcf <- readVcf(system.file("extdata", "na12878_chr22_Sudmunt2015.vcf", package = "StructuralVariantAnnotation"))
truth_svgr <- breakpointRanges(truth_vcf)
truth_svgr <- truth_svgr[seqnames(truth_svgr) == "chr22"]
crest_vcf <- readVcf(system.file("extdata", "na12878_chr22_crest.vcf", package = "StructuralVariantAnnotation"))
# Some SV callers don't report QUAL so we need to use a proxy
VariantAnnotation::fixed(crest_vcf)$QUAL <- info(crest_vcf)$left_softclipped_read_count + info(crest_vcf)$left_softclipped_read_count
crest_svgr <- breakpointRanges(crest_vcf)
crest_svgr$caller <- "crest"
hydra_vcf <- readVcf(system.file("extdata", "na12878_chr22_hydra.vcf", package = "StructuralVariantAnnotation"))
hydra_svgr <- breakpointRanges(hydra_vcf)
hydra_svgr$caller <- "hydra"
svgr <- c(crest_svgr, hydra_svgr)
svgr$truth_matches <- countBreakpointOverlaps(svgr, truth_svgr,
# read pair based callers make imprecise calls.
# A margin around the call position is required when matching with the truth set
maxgap=100,
# Since we added a maxgap, we also need to restrict the mismatch between then
# size of the events. We don't want to match a 100bp deletion with a
# 5bp dupliaction. This will happen if we have a 100bp margin but don't also
# require an approximate size match as well
sizemargin=0.25,
# We also don't want to match a 20bp deletion with a 20bp deletion 80bp away
# by restricting the margin based on the size of the event, we can make sure
# that simple events actually do overlap
restrictMarginToSizeMultiple=0.5,
# HYDRA makes duplicate calls and will sometimes report a variant multiple
# times with slightly different bounds. countOnlyBest prevents these being
# double-counted as multiple true positives.
countOnlyBest=TRUE)
Once we know which calls match the truth set, we can generate Precision-Recall and ROC curves for each caller using one of the many ROC R packages, or directly with dplyr.
suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(require(ggplot2))
ggplot(as.data.frame(svgr) %>%
dplyr::select(QUAL, caller, truth_matches) %>%
dplyr::group_by(caller, QUAL) %>%
dplyr::summarise(
calls=n(),
tp=sum(truth_matches > 0)) %>%
dplyr::group_by(caller) %>%
dplyr::arrange(dplyr::desc(QUAL)) %>%
dplyr::mutate(
cum_tp=cumsum(tp),
cum_n=cumsum(calls),
cum_fp=cum_n - cum_tp,
Precision=cum_tp / cum_n,
Recall=cum_tp/length(truth_svgr))) +
aes(x=Recall, y=Precision, colour=caller) +
geom_point() +
geom_line() +
labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set")
The package supports converting GRanges objects to BEDPE files. The BEDPE format is defined by bedtools
. This is achieved using breakpointgr2pairs
and pairs2breakpointgr
functions to convert to and from the GRanges Pairs
notation used by rtracklayer
suppressPackageStartupMessages(require(rtracklayer))
# Export to BEDPE
rtracklayer::export(breakpointgr2pairs(gr), con="gridss.bedpe")
# Import to BEDPE
bedpe.gr <- pairs2breakpointgr(rtracklayer::import("gridss.bedpe"))
One way of visualising paired breakpoints is by circos plots. Here we use package circlize
to demonstrate breakpoint visualisation. The bedpe2circos
function takes BEDPE-formatted dataframes (see breakpointgr2bedpe()
) and plotting parameters for circos.initializeWithIdeogram()
and circos.genomicLink()
functions from circlize
.
To generate a simple circos plot of paired breakpoints:
suppressPackageStartupMessages(require(circlize))
colo829_bpgr_with_chr_prefix <- colo829_bpgr
seqlevelsStyle(colo829_bpgr_with_chr_prefix) <- "UCSC"
pairs <- breakpointgr2pairs(colo829_bpgr_with_chr_prefix)
circos.initializeWithIdeogram()
circos.genomicLink(as.data.frame(S4Vectors::first(pairs)), as.data.frame(S4Vectors::second(pairs)))
circos.clear()
Alternatively, plotting package such as ggbio
provides flexible tracks function which binds with ggplot2
objects. It takes GRanges
objects as input and support circos plots. To plot structural variant breakpoints in a circos plot using ggbio
, we need to first prepare the breakpoint GRanges. The function requires a sepecial column, indicating the end of the link using GRanges format, which we can add to gr
using plyranges
.
suppressPackageStartupMessages(require(ggbio))
gr.circos <- colo829_bpgr[seqnames(colo829_bpgr) %in% seqlevels(biovizBase::hg19sub)]
seqlevels(gr.circos) <- seqlevels(biovizBase::hg19sub)
mcols(gr.circos)$to.gr <- granges(partner(gr.circos))
We can then plot the breakpoints against reference genomes.
p <- ggbio() +
circle(gr.circos, geom="link", linked.to="to.gr") +
circle(biovizBase::hg19sub, geom='ideo', fill='gray70') +
circle(biovizBase::hg19sub, geom='scale', size=2) +
circle(biovizBase::hg19sub, geom='text', aes(label=seqnames), vjust=0, size=3)
#> Warning in recycleSingleBracketReplacementValue(value, x, nsbs): number
#> of values supplied is not a sub-multiple of the number of values to be
#> replaced
p
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] ggbio_1.32.0 circlize_0.4.6
#> [3] ggplot2_3.1.1 dplyr_0.8.0.1
#> [5] StructuralVariantAnnotation_1.0.0 VariantAnnotation_1.30.0
#> [7] Rsamtools_2.0.0 Biostrings_2.52.0
#> [9] XVector_0.24.0 SummarizedExperiment_1.14.0
#> [11] DelayedArray_0.10.0 BiocParallel_1.18.0
#> [13] matrixStats_0.54.0 Biobase_2.44.0
#> [15] rtracklayer_1.44.0 GenomicRanges_1.36.0
#> [17] GenomeInfoDb_1.20.0 IRanges_2.18.0
#> [19] S4Vectors_0.22.0 BiocGenerics_0.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] ProtGenerics_1.16.0 bitops_1.0-6
#> [3] bit64_0.9-7 RColorBrewer_1.1-2
#> [5] progress_1.2.0 httr_1.4.0
#> [7] tools_3.6.0 backports_1.1.4
#> [9] R6_2.4.0 rpart_4.1-15
#> [11] Hmisc_4.2-0 DBI_1.0.0
#> [13] lazyeval_0.2.2 colorspace_1.4-1
#> [15] nnet_7.3-12 withr_2.1.2
#> [17] tidyselect_0.2.5 gridExtra_2.3
#> [19] prettyunits_1.0.2 GGally_1.4.0
#> [21] curl_3.3 bit_1.1-14
#> [23] compiler_3.6.0 graph_1.62.0
#> [25] htmlTable_1.13.1 labeling_0.3
#> [27] scales_1.0.0 checkmate_1.9.1
#> [29] RBGL_1.60.0 stringr_1.4.0
#> [31] digest_0.6.18 foreign_0.8-71
#> [33] rmarkdown_1.12 dichromat_2.0-0
#> [35] base64enc_0.1-3 pkgconfig_2.0.2
#> [37] htmltools_0.3.6 ensembldb_2.8.0
#> [39] BSgenome_1.52.0 htmlwidgets_1.3
#> [41] rlang_0.3.4 GlobalOptions_0.1.0
#> [43] rstudioapi_0.10 RSQLite_2.1.1
#> [45] shape_1.4.4 acepack_1.4.1
#> [47] RCurl_1.95-4.12 magrittr_1.5
#> [49] GenomeInfoDbData_1.2.1 Formula_1.2-3
#> [51] Matrix_1.2-17 Rcpp_1.0.1
#> [53] munsell_0.5.0 stringi_1.4.3
#> [55] yaml_2.2.0 zlibbioc_1.30.0
#> [57] plyr_1.8.4 grid_3.6.0
#> [59] blob_1.1.1 crayon_1.3.4
#> [61] lattice_0.20-38 splines_3.6.0
#> [63] GenomicFeatures_1.36.0 hms_0.4.2
#> [65] knitr_1.22 pillar_1.3.1
#> [67] reshape2_1.4.3 biomaRt_2.40.0
#> [69] XML_3.98-1.19 glue_1.3.1
#> [71] evaluate_0.13 biovizBase_1.32.0
#> [73] latticeExtra_0.6-28 BiocManager_1.30.4
#> [75] data.table_1.12.2 gtable_0.3.0
#> [77] purrr_0.3.2 reshape_0.8.8
#> [79] assertthat_0.2.1 xfun_0.6
#> [81] AnnotationFilter_1.8.0 survival_2.44-1.1
#> [83] OrganismDbi_1.26.0 tibble_2.1.1
#> [85] GenomicAlignments_1.20.0 AnnotationDbi_1.46.0
#> [87] memoise_1.1.0 cluster_2.0.9