Currently, VCF
objects of the VariantAnnotation package may be subsetted by indices or names of variant records. The TVTB package extends FilterRules
of the S4Vectors package to provide filter rules readily applicable to individual slots of VCF
objects. These new classes of filter rules provide containers for powerful expressions and functions to facilitate filtering of variants using combinations of filters applicable to FIXED fields, INFO fields, and Ensembl VEP predictions stored in a given INFO field of VCF files.
TVTB 1.2.0
VCF
objects of the VariantAnnotation package contain a plethora of information imported from specific fields of source VCF files and stored in dedicated slots (e.g. fixed
, info
, geno
), as well as optional Ensembl VEP predictions (McLaren et al. 2010) stored under a given key of their INFO slot.
This information may be used to identify and filter variants of interest for further analysis. However, the size of genetic data sets and the variety of filter rules—and their combinatorial explosion—create considerable challenges in terms of workspace memory and entropy (i.e. size and number of objects in the workspace, respectively).
The FilterRules
class implemented in the S4Vectors package provides a powerful tool to create flexible and lightweight filter rules defined in the form of expression
and function
objects that can be evaluated within given environments. The TVTB package extends this FilterRules
class into novel classes of VCF filter rules, applicable to information stored in the distinct slots of VCF
objects (i.e. CollapsedVCF
and ExpandedVCF
classes), as described below:
Class | Motivation |
---|---|
VcfFixedRules |
Filter rules applied to the fixed slot of a VCF object. |
VcfInfoRules |
Filter rules applied to the info slot of a VCF object. |
VcfVepRules |
Filter rules applied to the Ensembl VEP predictions stored in a given INFO key of a VCF object. |
VcfFilterRules |
Combination of VcfFixedRules , VcfInfoRules , and VcfVepRules applicable to a VCF object. |
Table: Motivation for each of the new classes extending FilterRules
to define VCF filter rules.
Note that FilterRules
objects themselves are applicable to VCF
objects, with two important difference from the above specialised classes:
VCF
slotsVCF
slots, for instance:fr <- S4Vectors::FilterRules(list(
mixed = function(x){
VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" &
VariantAnnotation::info(x)[,"MAF"] >= 0.05
}
))
fr
## FilterRules of length 1
## names(1): mixed
As they inherit from the FilterRules
class, these new classes benefit from accessors and methods defined for their parent class, including:
To account for the more complex structure of VCF
objects, some of the new VCF filter rules classes implemented in the TVTB package require additional information stored in new dedicated slots, associated with the appropriate accessors and setters. For instance:
VcfVepRules
require the INFO key where predictions of the Ensembl Variant Effect Predictor are stored in a VCF
object. The vep
accessor method may be used to access this slot.VcfFilterRules
—which may combine any number of filter rules stored in FixedRules
, VcfFixedRules
, VcfInfoRules
, VcfVepRules
, and other VcfFilterRules
objects— mark each filter rule with their type in the combined object. The information is stored in the type
slot, which may be accessed using the read-only accessor method type
.For the purpose of demonstrating the utility and usage of VCF filter rules, a set of variants and associated phenotype information was obtained from the 1000 Genomes Project Phase 3 release. It can be imported as a CollapsedVCF
object using the following code:
library(TVTB)
extdata <- system.file("extdata", package = "TVTB")
vcfFile <- file.path(extdata, "chr15.phase3_integrated.vcf.gz")
tabixVcf <- Rsamtools::TabixFile(file = vcfFile)
vcf <- VariantAnnotation::readVcf(file = tabixVcf)
VCF filter rules may be applied to ExpandedVCF
objects equally:
evcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)
As described in the documentation of the VariantAnnotation package, the key difference between CollapsedVCF
and ExpandedVCF
objects —both extending the VCF
class—is the expansion of multi-allelic records into bi-allelic records, respectively. In other words (quoting the VariantAnnotation documentation):
“
CollapsedVCF
objects contains the ALT data as aDNAStringSetList
allowing for multiple alleles per variant. In contrast, theExpandedVCF
stores the ALT data as aDNAStringSet
where the ALT column has been expanded to create a flat form of the data with one row per variant-allele combination.”
This difference has implications for filter rules using the "ALT"
field of the info
slot, as demonstrated in a later section.
First, let us examine which fields (i.e. column names) are available in the VCF
objects to create VCF filter rules:
fixedVcf <- colnames(fixed(vcf))
fixedVcf
## [1] "REF" "ALT" "QUAL" "FILTER"
infoVcf <- colnames(info(vcf))
infoVcf
## [1] "CIEND" "CIPOS" "CS" "END"
## [5] "IMPRECISE" "MC" "MEINFO" "MEND"
## [9] "MLEN" "MSTART" "SVLEN" "SVTYPE"
## [13] "TSD" "AC" "AF" "NS"
## [17] "AN" "EAS_AF" "EUR_AF" "AFR_AF"
## [21] "AMR_AF" "SAS_AF" "DP" "AA"
## [25] "VT" "EX_TARGET" "MULTI_ALLELIC" "CSQ"
csq <- ensemblVEP::parseCSQToGRanges(x = evcf)
vepVcf <- colnames(mcols(csq))
vepVcf
## [1] "Allele" "Consequence" "IMPACT"
## [4] "SYMBOL" "Gene" "Feature_type"
## [7] "Feature" "BIOTYPE" "EXON"
## [10] "INTRON" "HGVSc" "HGVSp"
## [13] "cDNA_position" "CDS_position" "Protein_position"
## [16] "Amino_acids" "Codons" "Existing_variation"
## [19] "DISTANCE" "STRAND" "FLAGS"
## [22] "VARIANT_CLASS" "SYMBOL_SOURCE" "HGNC_ID"
## [25] "CANONICAL" "TSL" "APPRIS"
## [28] "CCDS" "ENSP" "SWISSPROT"
## [31] "TREMBL" "UNIPARC" "GENE_PHENO"
## [34] "SIFT" "PolyPhen" "DOMAINS"
## [37] "HGVS_OFFSET" "GMAF" "AFR_MAF"
## [40] "AMR_MAF" "EAS_MAF" "EUR_MAF"
## [43] "SAS_MAF" "AA_MAF" "EA_MAF"
## [46] "ExAC_MAF" "ExAC_Adj_MAF" "ExAC_AFR_MAF"
## [49] "ExAC_AMR_MAF" "ExAC_EAS_MAF" "ExAC_FIN_MAF"
## [52] "ExAC_NFE_MAF" "ExAC_OTH_MAF" "ExAC_SAS_MAF"
## [55] "CLIN_SIG" "SOMATIC" "PHENO"
## [58] "PUBMED" "MOTIF_NAME" "MOTIF_POS"
## [61] "HIGH_INF_POS" "MOTIF_SCORE_CHANGE" "CADD_PHRED"
## [64] "CADD_RAW"
The value of a particular field can be used to define expressions that represent simple filter rules based on that value alone. Multiple rules may be stored in any one FilterRules
objects. Ideally, VCF filter rules should be named to facilitate their use, but also as a reminder of the purpose of each particular rule. For instance, in the chunk of code below, two filter rules are defined using fields of the fixed
slot:
"pass"
identifies variants for which the value in the FILTER
field is "PASS"
"qual20"
identifies variants where the value in the QUAL
field is greater than or equal to 20
fixedRules <- VcfFixedRules(exprs = list(
pass = expression(FILTER == "PASS"),
qual20 = expression(QUAL >= 20)
))
active(fixedRules)["qual20"] <- FALSE
summary(evalSeparately(fixedRules, vcf))
## pass qual20
## Mode:logical Mode:logical
## TRUE:479 TRUE:479
In the example above, all variants pass the active "pass"
filter, while the deactivated rules "qual20"
) automatically returns TRUE
for all variants.
It is also possible for VCF filter rules to use multiple fields (of the same VCF
slot) in a single expression. In the chunk of code below, the VCF filter rule identifies variants for which both the "REF"
and "ALT"
values (in the INFO slot) are one of the four nucleotides (i.e. a simple definition of single nucleotide polymorphisms; SNPs):
nucleotides <- c("A", "T", "G", "C")
SNPrule <- VcfFixedRules(exprs = list(
SNP = expression(
as.character(REF) %in% nucleotides &
as.character(ALT) %in% nucleotides)
))
summary(evalSeparately(SNPrule, evcf, enclos = .GlobalEnv))
## SNP
## Mode :logical
## FALSE:14
## TRUE :467
Some considerations regarding the above filter rule:
nucleotides
character vector, the global environment must be supplied as the enclosing environment to successfully evaluate the expression"REF"
and "ALT"
are stored as DNAStringSet
in CollapsedVCF
objects and must be converted to character
in order to successfully apply the method %in%
.Expressions that define filter rules may also include calculations. In the chunk of code below, two simple VCF filter rules are defined using fields of the info
slot:
"samples"
identifies variants where at least 90% of samples have data (i.e. the NS
value is greater than or equal to 0.9
times the total number of samples)"avgSuperPopAF"
calculates the average of the allele frequencies calculated in each the five super-populations (available in several INFO fields), and subsequently identifies variants with an average value greater than 0.05
.infoRules <- VcfInfoRules(exprs = list(
samples = expression(NS > (0.9 * ncol(evcf))),
avgSuperPopAF = expression(
(EAS_AF + EUR_AF + AFR_AF + AMR_AF + SAS_AF) / 5 > 0.05
)
))
summary(evalSeparately(infoRules, evcf, enclos = .GlobalEnv))
## samples avgSuperPopAF
## Mode:logical Mode :logical
## TRUE:481 FALSE:452
## TRUE :29
It may be more convenient to define filters as function
objects. For instance, the chunk of code below:
info
slot of a VCF
object as inputAFcutoff <- 0.05
popCutoff <- 2/3
filterFUN <- function(envir){
# info(vcf) returns a DataFrame; rowSums below requires a data.frame
df <- as.data.frame(envir)
# Identify fields storing allele frequency in super-populations
popFreqCols <- grep("[[:alpha:]]{3}_AF", colnames(df))
# Count how many super-population have an allele freq above the cutoff
popCount <- rowSums(df[,popFreqCols] > AFcutoff)
# Convert the cutoff ratio to a integer count
popCutOff <- popCutoff * length(popFreqCols)
# Identifies variants where enough super-population pass the cutoff
testRes <- (popCount > popCutOff)
# Return a boolean vector, required by the eval method
return(testRes)
}
funFilter <- VcfInfoRules(exprs = list(
commonSuperPops = filterFUN
))
summary(evalSeparately(funFilter, evcf))
## commonSuperPops
## Mode :logical
## FALSE:464
## TRUE :17
Notably, the filterFUN
function may also be applied separately to the info
slot of VCF
objects:
summary(filterFUN(info(evcf)))
## Mode FALSE TRUE
## logical 464 17
The grepl
function is particularly suited for the purpose of FilterRules
as they return a logical
vector:
missenseFilter <- VcfVepRules(
exprs = list(
exact = expression(Consequence == "missense_variant"),
grepl = expression(grepl("missense", Consequence))
),
vep = "CSQ")
summary(evalSeparately(missenseFilter, evcf))
## exact grepl
## Mode :logical Mode :logical
## FALSE:454 FALSE:452
## TRUE :27 TRUE :29
In the above chunk of code:
"exact"
matches only the given value, associated with 27
variants,"grepl"
also matches an extra two variants associated with the value "missense_variant&splice_region_variant"
matched by the given pattern. By deduction, the two rules indicate together that those two variants were not assigned the "missense_variant"
prediction.fixed
slot of VCF
objectsAs detailed in an earlier section introducing the demonstration data, and more thoroughly in the documentation of the VariantAnnotation package, CollapsedVCF
and ExpandedVCF
classes differ in the class of data stored in the "ALT"
field of their respective fixed
slot. As as result, VCF filter rules using data from this field must take into account the VCF
class in order to handle the data appropriately:
ExpandedVCF
objectsA key aspect of ExpandedVCF
objects is that the "ALT"
field of their fixed
slot may store only a single allele per record as a DNAStringSet
object.
For instance, in an earlier section that demonstrated Filter rules using multiple raw fields, ALT data of the fixed
slot in an ExpandedVCF
object had to be re-typed from DNAStringSet
to character
before the %in%
function could be applied.
Nevertheless, VCF filter rules may also make use of methods associated with the DNAStringSet
class. For instance, genetic insertions may be identified using the fields "REF"
and "ALT"
fields of the fixed
slot:
fixedInsertionFilter <- VcfFixedRules(exprs = list(
isInsertion = expression(
Biostrings::width(ALT) > Biostrings::width(REF)
)
))
evcf_fixedIns <- subsetByFilter(evcf, fixedInsertionFilter)
as.data.frame(fixed(evcf_fixedIns)[,c("REF", "ALT")])
## REF ALT
## 1 A AC
## 2 A AT
## 3 C CA
## 4 T TA
Here, the above VcfFixedRules
is synonym to a distinct VcfVepRules
using the Ensembl VEP prediction "VARIANT_CLASS"
:
vepInsertionFilter <- VcfVepRules(exprs = list(
isInsertion = expression(VARIANT_CLASS == "insertion")
))
evcf_vepIns <- subsetByFilter(evcf, vepInsertionFilter)
as.data.frame(fixed(evcf_vepIns)[,c("REF", "ALT")])
## REF ALT
## 1 A AC
## 2 A AT
## 3 C CA
## 4 T TA
CollapsedVCF
objectsIn contrast to ExpandedVCF
, CollapsedVCF
may contain more than one allele per record in their "ALT"
field (fixed
slot), represented by a DNAStringSetList
object.
As a result, VCF filter rules using the "ALT"
field of the info
slot in CollapsedVCF
objects may use methods dedicated to DNAStringSetList
to handle the data. For instance, multi-allelic variants may be identified by the following VcfFixedRules
:
multiallelicFilter <- VcfFixedRules(exprs = list(
multiallelic = expression(lengths(ALT) > 1)
))
summary(eval(multiallelicFilter, vcf))
## Mode FALSE TRUE
## logical 477 2
Any number of VcfFixedRules
, VcfInfoRules
, and VcfVepRules
—or even VcfFilterRules
themselves—may be combined into a larger object of class VcfFilterRules
. Notably, the active state of each filter rule is transferred to the combined object. Even though the VcfFilterRules
class acts as a container for multiple types of VCF filter rules, the resulting VcfFilterRules
object also extends the FilterRules
class, and as a result can be evaluated and used to subset VCF
objects identically to any of the specialised more specialised classes.
During the creation of VcfFixedRules
objects, each VCF filter rule being combined is marked with a type value, indicating the VCF slot in which the filter rule must be evaluated. This information is stored in the new type
slot of VcfFixedRules
objects. For instance, it is possible to combine two VcfFixedRules
(containing two and one filter rules, respectively), one VcfInfoRules
, and one VcfVepRules
defined earlier in this vignette:
vignetteRules <- VcfFilterRules(
fixedRules,
SNPrule,
infoRules,
vepInsertionFilter
)
vignetteRules
## VcfFilterRules of length 6
## names(6): pass qual20 SNP samples avgSuperPopAF isInsertion
active(vignetteRules)
## pass qual20 SNP samples avgSuperPopAF
## TRUE FALSE TRUE TRUE TRUE
## isInsertion
## TRUE
type(vignetteRules)
## pass qual20 SNP samples avgSuperPopAF
## "fixed" "fixed" "fixed" "info" "info"
## isInsertion
## "vep"
summary(evalSeparately(vignetteRules, evcf, enclos = .GlobalEnv))
## pass qual20 SNP samples
## Mode:logical Mode:logical Mode :logical Mode:logical
## TRUE:481 TRUE:481 FALSE:14 TRUE:481
## TRUE :467
## avgSuperPopAF isInsertion
## Mode :logical Mode :logical
## FALSE:452 FALSE:477
## TRUE :29 TRUE :4
Clearly1 This statement below would be more evident if the summary
method was displaying the result of evalSeparately
in this vignette as it does it in an R session., the VCF filter rules SNP
and isInsertion
are mutually exclusive, which explains the final 0
variants left after filtering. Conveniently, either of these rules may be deactivated before evaluating the remaining active filter rules:
active(vignetteRules)["SNP"] <- FALSE
summary(evalSeparately(vignetteRules, evcf, enclos = .GlobalEnv))
## pass qual20 SNP samples avgSuperPopAF
## Mode:logical Mode:logical Mode:logical Mode:logical Mode :logical
## TRUE:481 TRUE:481 TRUE:481 TRUE:481 FALSE:452
## TRUE :29
## isInsertion
## Mode :logical
## FALSE:477
## TRUE :4
As a result, the deactivated filter rule ("SNP"
) now returns TRUE
for all variants, leaving a final 2
variants2 Again, this statement would benefit from the result of evalSeparately
being displayed identically to an R session. pass the remaining active filter rules:
"PASS"
0.05
VARIANT_CLASS
equal to "insertion"
Finally, the following chunk of code demonstrates how VcfFilterRules
may also be created from the combination of VcfFilterRules
, either with themselves or with any of the classes that define more specific VCF filter rules. Notably, when VcfFilterRules
objects are combined, the type
and active
value of each filter rule is transferred to the combined object:
Combine VcfFilterRules
with VcfVepRules
combinedFilters <- VcfFilterRules(
vignetteRules, # VcfFilterRules
missenseFilter # VcfVepRules
)
type(vignetteRules)
## pass qual20 SNP samples avgSuperPopAF
## "fixed" "fixed" "fixed" "info" "info"
## isInsertion
## "vep"
type(combinedFilters)
## pass qual20 SNP samples avgSuperPopAF
## "fixed" "fixed" "fixed" "info" "info"
## isInsertion exact grepl
## "vep" "vep" "vep"
active(vignetteRules)
## pass qual20 SNP samples avgSuperPopAF
## TRUE FALSE FALSE TRUE TRUE
## isInsertion
## TRUE
active(missenseFilter)
## exact grepl
## TRUE TRUE
active(combinedFilters)
## pass qual20 SNP samples avgSuperPopAF
## TRUE FALSE FALSE TRUE TRUE
## isInsertion exact grepl
## TRUE TRUE TRUE
Combine multiple VcfFilterRules
with VcfFilterRules
(and more)
To demonstrate this action, another VcfFilterRules
must first be created. This can be achieve by simply re-typing a VcfVepRules
defined earlier:
secondVcfFilter <- VcfFilterRules(missenseFilter)
secondVcfFilter
## VcfFilterRules of length 2
## names(2): exact grepl
It is now possible to combine the two VcfFilterRules
. Let us even combine another VcfInfoRules
object in the same step:
manyRules <- VcfFilterRules(
vignetteRules, # VcfFilterRules
secondVcfFilter, # VcfFilterRules
funFilter # VcfInfoRules
)
manyRules
## VcfFilterRules of length 9
## names(9): pass qual20 SNP samples ... exact grepl commonSuperPops
active(manyRules)
## pass qual20 SNP samples
## TRUE FALSE FALSE TRUE
## avgSuperPopAF isInsertion exact grepl
## TRUE TRUE TRUE TRUE
## commonSuperPops
## TRUE
type(manyRules)
## pass qual20 SNP samples
## "fixed" "fixed" "fixed" "info"
## avgSuperPopAF isInsertion exact grepl
## "info" "vep" "vep" "vep"
## commonSuperPops
## "info"
summary(evalSeparately(manyRules, evcf, enclos = .GlobalEnv))
## pass qual20 SNP samples avgSuperPopAF
## Mode:logical Mode:logical Mode:logical Mode:logical Mode :logical
## TRUE:481 TRUE:481 TRUE:481 TRUE:481 FALSE:452
## TRUE :29
## isInsertion exact grepl commonSuperPops
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:477 FALSE:454 FALSE:452 FALSE:464
## TRUE :4 TRUE :27 TRUE :29 TRUE :17
Critically, users must be careful to combine rules all compatible with the class of VCF
object in which it will be evaluated (i.e. CollapsedVCF
or ExpandedVCF
).
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] TVTB_1.2.0 knitr_1.15.1 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.2.1 Biobase_2.36.0
## [3] AnnotationHub_2.8.0 splines_3.4.0
## [5] shiny_1.0.2 Formula_1.2-1
## [7] interactiveDisplayBase_1.14.0 stats4_3.4.0
## [9] latticeExtra_0.6-28 pander_0.6.0
## [11] BSgenome_1.44.0 GenomeInfoDbData_0.99.0
## [13] Rsamtools_1.28.0 yaml_2.1.14
## [15] RSQLite_1.1-2 backports_1.0.5
## [17] lattice_0.20-35 biovizBase_1.24.0
## [19] limma_3.32.0 digest_0.6.12
## [21] GenomicRanges_1.28.0 RColorBrewer_1.1-2
## [23] XVector_0.16.0 checkmate_1.8.2
## [25] colorspace_1.3-2 httpuv_1.3.3
## [27] htmltools_0.3.5 Matrix_1.2-9
## [29] plyr_1.8.4 XML_3.98-1.6
## [31] ensemblVEP_1.16.0 biomaRt_2.32.0
## [33] bookdown_0.3 zlibbioc_1.22.0
## [35] xtable_1.8-2 scales_0.4.1
## [37] BiocParallel_1.10.0 EnsDb.Hsapiens.v75_2.1.0
## [39] tibble_1.3.0 htmlTable_1.9
## [41] AnnotationFilter_1.0.0 IRanges_2.10.0
## [43] ggplot2_2.2.1 SummarizedExperiment_1.6.0
## [45] GenomicFeatures_1.28.0 nnet_7.3-12
## [47] Gviz_1.20.0 BiocGenerics_0.22.0
## [49] lazyeval_0.2.0 mime_0.5
## [51] survival_2.41-3 magrittr_1.5
## [53] memoise_1.1.0 evaluate_0.10
## [55] GGally_1.3.0 foreign_0.8-67
## [57] BiocInstaller_1.26.0 data.table_1.10.4
## [59] tools_3.4.0 matrixStats_0.52.2
## [61] stringr_1.2.0 S4Vectors_0.14.0
## [63] munsell_0.4.3 cluster_2.0.6
## [65] DelayedArray_0.2.0 ensembldb_2.0.0
## [67] AnnotationDbi_1.38.0 Biostrings_2.44.0
## [69] compiler_3.4.0 GenomeInfoDb_1.12.0
## [71] grid_3.4.0 RCurl_1.95-4.8
## [73] dichromat_2.0-0 VariantAnnotation_1.22.0
## [75] htmlwidgets_0.8 labeling_0.3
## [77] bitops_1.0-6 base64enc_0.1-3
## [79] rmarkdown_1.4 gtable_0.2.0
## [81] DBI_0.6-1 reshape_0.8.6
## [83] reshape2_1.4.2 R6_2.2.0
## [85] GenomicAlignments_1.12.0 gridExtra_2.2.1
## [87] rtracklayer_1.36.0 Hmisc_4.0-2
## [89] rprojroot_1.2 ProtGenerics_1.8.0
## [91] stringi_1.1.5 parallel_3.4.0
## [93] Rcpp_0.12.10 rpart_4.1-11
## [95] acepack_1.4.1
McLaren, W., B. Pritchard, D. Rios, Y. Chen, P. Flicek, and F. Cunningham. 2010. “Deriving the Consequences of Genomic Variants with the Ensembl API and SNP Effect Predictor.” Journal Article. Bioinformatics 26 (16): 2069–70. doi:10.1093/bioinformatics/btq330.