Abstract
This vignette is a guide containing example code for performing real-life tasks. Importantly, it covers some functionality that were not covered in the Quick-Start vignette (because they are too computationally intensive to be reproducible in a vignette). Version 1.0.4
For instructions on installing and configuring SpliceWiz, please see the Quick-Start vignette.
library(SpliceWiz)
#> Loading required package: NxtIRFdata
#> SpliceWiz package loaded with 2 threads
#> Use setSWthreads() to set the number of SpliceWiz threads
First, define the path to the directory in which the reference should be stored. This directory will be made by SpliceWiz, but its parent directory must exist, otherwise an error will be returned.
Note that setting genome_path = "hg38"
will prompt SpliceWiz to use the default files for nonPolyA and Mappability exclusion references in the generation of its reference. Valid options for genome_path
are “hg38”, “hg19”, “mm10” and “mm9”.
buildRef(
reference_path = ref_path,
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = "hg38"
)
buildRef()
first fetches the genome and gene annotations and makes a compressed local copy in the resources
subdirectory of the given reference path. This can sometimes be a long process (especially when downloading the genome from the internet). Also, one may choose to generate the mappability exclusion regions manually (see Reference Generation
- Mappability exclusion generation using Rsubread
section). This needs to occur prior to generation of the SpliceWiz reference.
To separately prepare the annotation data and build the SpliceWiz reference, use the getResources()
function to specify the FASTA and GTF files. Once getResources()
has completed successfully, call buildRef()
and leave the fasta
and gtf
arguments blank, as in the example below:
getResources(
reference_path = ref_path,
fasta = "genome.fa",
gtf = "transcripts.gtf"
)
buildRef(
reference_path = ref_path,
genome_type = "hg38"
)
To re-build and overwrite an existing reference, using the same resource annotations:
# Assuming hg38 genome:
buildRef(
reference_path = ref_path,
genome_type = "hg38",
overwrite = TRUE
)
The following will first download the genome and gene annotation files from the online resource and store a local copy of it in a file cache, facilitated by BiocFileCache. Then, it uses the downloaded resource to create the SpliceWiz reference.
FTP <- "ftp://ftp.ensembl.org/pub/release-94/"
buildRef(
reference_path = ref_path,
fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
"Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"),
gtf = paste0(FTP, "gtf/homo_sapiens/",
"Homo_sapiens.GRCh38.94.chr.gtf.gz"),
genome_type = "hg38"
)
AnnotationHub contains Ensembl references for many genomes. To browse what is available:
require(AnnotationHub)
#> Loading required package: AnnotationHub
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
ah <- AnnotationHub()
#> snapshotDate(): 2022-10-31
query(ah, "Ensembl")
#> AnnotationHub with 35520 records
#> # snapshotDate(): 2022-10-31
#> # $dataprovider: Ensembl, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,E...
#> # $species: Mus musculus, Homo sapiens, Danio rerio, Rattus norvegicus, Sus ...
#> # $rdataclass: TwoBitFile, GRanges, EnsDb, SQLiteFile, data.frame, OrgDb, list
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH5046"]]'
#>
#>
#> AH5046 |
#> AH5160 |
#> AH5311 |
#> AH5434 |
#> AH5435 |
#> ...
#> AH111328 |
#> AH111329 |
#> AH111330 |
#> AH111331 |
#> AH111332 |
#> title
#> AH5046 Ensembl Genes
#> AH5160 Ensembl Genes
#> AH5311 Ensembl Genes
#> AH5434 Ensembl Genes
#> AH5435 Ensembl EST Genes
#> ... ...
#> AH111328 Zalophus_californianus.mZalCal1.pri.109.gtf
#> AH111329 Zonotrichia_albicollis.Zonotrichia_albicollis-1.0.1.109.abinitio...
#> AH111330 Zonotrichia_albicollis.Zonotrichia_albicollis-1.0.1.109.gtf
#> AH111331 Zosterops_lateralis_melanops.ASM128173v1.109.abinitio.gtf
#> AH111332 Zosterops_lateralis_melanops.ASM128173v1.109.gtf
For a more specific query:
query(ah, c("Homo Sapiens", "release-94"))
#> AnnotationHub with 9 records
#> # snapshotDate(): 2022-10-31
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: TwoBitFile, GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH64628"]]'
#>
#> title
#> AH64628 | Homo_sapiens.GRCh38.94.abinitio.gtf
#> AH64629 | Homo_sapiens.GRCh38.94.chr.gtf
#> AH64630 | Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
#> AH64631 | Homo_sapiens.GRCh38.94.gtf
#> AH65744 | Homo_sapiens.GRCh38.cdna.all.2bit
#> AH65745 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit
#> AH65746 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#> AH65747 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#> AH65748 | Homo_sapiens.GRCh38.ncrna.2bit
We wish to fetch “AH65745” and “AH64631” which contains the desired FASTA and GTF files, respectively. To build a reference using these resources:
Build-Reference-methods
will recognise the inputs of fasta
and gtf
as AnnotationHub resources as they begin with “AH”.
For human and mouse genomes, we highly recommend specifying genome_type
as the default mappability file is used to exclude intronic regions with repeat sequences from intron retention analysis. For other species, one could generate a SpliceWiz reference without this reference:
buildRef(
reference_path = ref_path,
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = ""
)
To use STAR
to align FASTQ files, one must be using a system with STAR
installed. This software is not available in Windows. To check if STAR
is available:
ref_path = "./Reference"
# Ensure genome resources are prepared from genome FASTA and GTF file:
if(!dir.exists(file.path(ref_path, "resource"))) {
getResources(
reference_path = ref_path,
fasta = "genome.fa",
gtf = "transcripts.gtf"
)
}
# Generate a STAR genome reference:
STAR_BuildRef(
reference_path = ref_path,
n_threads = 8
)
If STAR
is available on the same computer or server where R/RStudio is being run, we can use the one-line function buildFullRef
. This function will: * Prepare the resources from the given FASTA and GTF files * Generate a STAR genome * Use the STAR genome and the FASTA file to de-novo calculate and define low mappability regions * Build the SpliceWiz reference using the genome resources and mappability file
buildFullRef(
reference_path = ref_path,
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = "",
use_STAR_mappability = TRUE,
n_threads = 4
)
n_threads
specify how many threads should be used to build the STAR reference and to calculate the low mappability regions
Finally, if STAR
is not available, Rsubread
is available on Bioconductor to perform mappability calculations. The example code in the manual is displayed here for convenience, to demonstrate how this would be done:
# (1a) Creates genome resource files
ref_path <- file.path(tempdir(), "Reference")
getResources(
reference_path = ref_path,
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
# (1b) Systematically generate reads based on the SpliceWiz example genome:
generateSyntheticReads(
reference_path = ref_path
)
# (2) Align the generated reads using Rsubread:
# (2a) Build the Rsubread genome index:
setwd(ref_path)
Rsubread::buildindex(basename = "./reference_index",
reference = chrZ_genome())
# (2b) Align the synthetic reads using Rsubread::subjunc()
Rsubread::subjunc(
index = "./reference_index",
readfile1 = file.path(ref_path, "Mappability", "Reads.fa"),
output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"),
useAnnotation = TRUE,
annot.ext = chrZ_gtf(),
isGTF = TRUE
)
# (3) Analyse the aligned reads in the BAM file for low-mappability regions:
calculateMappability(
reference_path = ref_path,
aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam")
)
# (4) Build the SpliceWiz reference using the calculated Mappability Exclusions
buildRef(ref_path)
First, remember to check that STAR is available via command line:
STAR_alignReads(
STAR_ref_path = file.path(ref_path, "STAR"),
BAM_output_path = "./bams/sample1",
fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq",
n_threads = 8
)
Experiment <- data.frame(
sample = c("sample_A", "sample_B"),
forward = file.path("raw_data", c("sample_A", "sample_B"),
c("sample_A_1.fastq", "sample_B_1.fastq")),
reverse = file.path("raw_data", c("sample_A", "sample_B"),
c("sample_A_2.fastq", "sample_B_2.fastq"))
)
STAR_alignExperiment(
Experiment = Experiment,
STAR_ref_path = file.path("Reference_FTP", "STAR"),
BAM_output_path = "./bams",
n_threads = 8,
two_pass = FALSE
)
To use two-pass mapping, set two_pass = TRUE
. We recommend disabling this feature, as one-pass mapping is adequate in typical-use cases.
SpliceWiz can identify sequencing FASTQ files recursively from a given directory. It assumes that forward and reverse reads are suffixed as _1
and _2
, respectively. Users can choose to identify such files using a specified file extension. For example, to recursively identify FASTQ files of the format {sample}_1.fq.gz
and {sample}_2.fq.gz
, use the following:
# Assuming sequencing files are named by their respective sample names
fastq_files <- findFASTQ("./sequencing_files", paired = TRUE,
fastq_suffix = ".fq.gz", level = 0)
Then, these can be aligned as follows:
To conveniently find all BAM files recursively in a given path:
This convenience function returns the putative sample names, either from BAM file names themselves (level = 0
), or from the names of their parent directories (level = 1
).
To run processBAM()
using 4 OpenMP threads:
# assume SpliceWiz reference has been generated in `ref_path`
processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = "./pb_output",
n_threads = 4,
useOpenMP = TRUE
)
Sometimes one may wish to create a COV file from a BAM file without running processBAM()
. One reason might be because a SpliceWiz reference is not available.
To convert a list of BAM files, run BAM2COV()
. This is a function structurally similar to processBAM()
but without the need to give the path to the SpliceWiz reference:
BAM2COV(
bamfiles = bams$path,
sample_names = bams$sample,
output_path = "./pb_output",
n_threads = 4,
useOpenMP = TRUE
)
Assuming the SpliceWiz reference is in ref_path
, after running processBAM()
as shown in the previous section, use the convenience function findSpliceWizOutput()
to tabulate a list of samples and their corresponding processBAM()
outputs:
This data.frame can be directly used to run collateData
:
novelSplicing = TRUE
. See the SpliceWiz Quick-Start vignette for more details.Then, the collated data can be imported as a NxtSE
object, which is an object that inherits SummarizedExperiment
and has specialized containers to hold additional data required by SpliceWiz.
Please refer to SpliceWiz: Quick-Start vignette for worked examples using the example dataset.
sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.6 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] AnnotationHub_3.6.0 BiocFileCache_2.6.1 dbplyr_2.3.2
#> [4] BiocGenerics_0.44.0 SpliceWiz_1.0.4 NxtIRFdata_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] lazyeval_0.2.2 shinydashboard_0.7.2
#> [3] splines_4.2.3 BiocParallel_1.32.6
#> [5] GenomeInfoDb_1.34.9 ggplot2_3.4.1
#> [7] digest_0.6.31 foreach_1.5.2
#> [9] ca_0.71.1 htmltools_0.5.5
#> [11] viridis_0.6.2 fansi_1.0.4
#> [13] magrittr_2.0.3 memoise_2.0.1
#> [15] BSgenome_1.66.3 shinyFiles_0.9.3
#> [17] Biostrings_2.66.0 annotate_1.76.0
#> [19] matrixStats_0.63.0 R.utils_2.12.2
#> [21] prettyunits_1.1.1 colorspace_2.1-0
#> [23] blob_1.2.4 rappdirs_0.3.3
#> [25] xfun_0.38 dplyr_1.1.1
#> [27] crayon_1.5.2 RCurl_1.98-1.12
#> [29] jsonlite_1.8.4 genefilter_1.80.3
#> [31] survival_3.5-5 iterators_1.0.14
#> [33] glue_1.6.2 registry_0.5-1
#> [35] gtable_0.3.3 zlibbioc_1.44.0
#> [37] XVector_0.38.0 webshot_0.5.4
#> [39] DelayedArray_0.24.0 Rhdf5lib_1.20.0
#> [41] HDF5Array_1.26.0 scales_1.2.1
#> [43] pheatmap_1.0.12 DBI_1.1.3
#> [45] Rcpp_1.0.10 progress_1.2.2
#> [47] viridisLite_0.4.1 xtable_1.8-4
#> [49] bit_4.0.5 stats4_4.2.3
#> [51] DT_0.27 htmlwidgets_1.6.2
#> [53] httr_1.4.5 fstcore_0.9.14
#> [55] RColorBrewer_1.1-3 ellipsis_0.3.2
#> [57] pkgconfig_2.0.3 XML_3.99-0.14
#> [59] R.methodsS3_1.8.2 sass_0.4.5
#> [61] utf8_1.2.3 tidyselect_1.2.0
#> [63] rlang_1.1.0 later_1.3.0
#> [65] AnnotationDbi_1.60.2 munsell_0.5.0
#> [67] BiocVersion_3.16.0 tools_4.2.3
#> [69] cachem_1.0.7 cli_3.6.1
#> [71] generics_0.1.3 RSQLite_2.3.0
#> [73] evaluate_0.20 fastmap_1.1.1
#> [75] heatmaply_1.4.2 yaml_2.3.7
#> [77] ompBAM_1.2.0 fs_1.6.1
#> [79] knitr_1.42 bit64_4.0.5
#> [81] purrr_1.0.1 KEGGREST_1.38.0
#> [83] dendextend_1.17.1 sparseMatrixStats_1.10.0
#> [85] mime_0.12 rhandsontable_0.3.8
#> [87] R.oo_1.25.0 compiler_4.2.3
#> [89] plotly_4.10.1 filelock_1.0.2
#> [91] curl_5.0.0 png_0.1-8
#> [93] interactiveDisplayBase_1.36.0 tibble_3.2.1
#> [95] bslib_0.4.2 lattice_0.20-45
#> [97] Matrix_1.5-3 vctrs_0.6.1
#> [99] pillar_1.9.0 lifecycle_1.0.3
#> [101] rhdf5filters_1.10.1 BiocManager_1.30.20
#> [103] jquerylib_0.1.4 data.table_1.14.8
#> [105] bitops_1.0-7 seriation_1.4.2
#> [107] httpuv_1.6.9 rtracklayer_1.58.0
#> [109] GenomicRanges_1.50.2 R6_2.5.1
#> [111] BiocIO_1.8.0 promises_1.2.0.1
#> [113] TSP_1.2-3 gridExtra_2.3
#> [115] IRanges_2.32.0 codetools_0.2-19
#> [117] assertthat_0.2.1 rhdf5_2.42.0
#> [119] SummarizedExperiment_1.28.0 rjson_0.2.21
#> [121] withr_2.5.0 shinyWidgets_0.7.6
#> [123] GenomicAlignments_1.34.1 Rsamtools_2.14.0
#> [125] S4Vectors_0.36.2 GenomeInfoDbData_1.2.9
#> [127] hms_1.1.3 parallel_4.2.3
#> [129] fst_0.9.8 grid_4.2.3
#> [131] tidyr_1.3.0 rmarkdown_2.21
#> [133] DelayedMatrixStats_1.20.0 MatrixGenerics_1.10.0
#> [135] Biobase_2.58.0 shiny_1.7.4
#> [137] restfulr_0.0.15