BgeeCall
is a collection of functions that uses Bgee expertise to create RNA-Seq gene expression present/absent calls.
The BgeeCall
package allows to:
If you find a bug or have any issues with BgeeCall
please write a bug report in our GitHub issues manager.
In Bgee present/absent gene expression calls for RNA-seq are generated using a threshold specific to each RNA-Seq library, calculated using reads mapped to reference intergenic regions. This is unlike the more usual use of an arbitrary threshold below which a gene is not considered as expressed (e.g log2(TPM) = 1).
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species and produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It notably integrates RNA-Seq libraries for 29 species.
Reference intergenic regions are defined in the Bgee RNA-Seq pipeline. Candidate intergenic regions are defined using gene annotation data. For each species, over all available libraries, reads are mapped to these intergenic regions with kallisto, as well as to genes. This “intergenic expression” is deconvoluted to distinguish reference intergenic from non annotated genes, which have higher expression. Reference intergenic regions are then defined as intergenic regions with low expression level over all RNA-Seq libraries, relative to genes. This step allows not to consider regions wrongly considered as intergenic because of potential gene annotation quality problem as intergenic. For more information please refer to the Bgee RNA-Seq pipeline.
BgeeCall pipeline allows to download reference intergenic regions resulting from the expertise of the Bgee team. Moreover BgeeCall allows to use these reference intergenic regions to automatically generate gene expression calls for your own RNA-Seq libraries as long as the species is integrated to Bgee The present/absent abundance threshold is calculated for each library using the formula :
\[ \frac {proportion\ of\ reference\ intergenic\ present}{proportion\ of\ protein\ coding\ present} = 0.05 \]
In R:
BgeeCall is highly tunable. Do not hesitate to have a look at the reference manual to have a precise descripton of all slots of the 4 main S4 classes (AbundanceMetadata, KallistoMetadata, BgeeMetadata and UserMetadata) or of all available functions. BgeeCall needs kallisto to run. If you do not have kallisto installed you will find more information how to install it here
With the BgeeCall package it is easy to generate present/absent gene expression calls. The most time comsuming task of this calls generation is the generation of the kallisto transcriptome index. As the time needed for this step depend on the size of the transcriptome, we choose, as an example, the smallest transcriptome file among all species available on Bgee (C. elegans). To generate these calls you will need :
For this vignette we created a toy fastq file example based on the SRX099901 library using the ShortRead R package
library("ShortRead")
# keep 48.000 reads
sampler <- FastqSampler(file.path("absolute_path","/SRX099901/SRR350955.fastq.gz"), 48000)
set.seed(1); SRR350955 <- yield(sampler)
writeFastq(object = SRR350955,
file =file.path( "absolute_path","SRX099901_subset", "SRR350955_subset.fastq.gz"),
mode = "w", full = FALSE, compress = TRUE)
In this example we used the Bioconductor AnnotationHub to load transcriptome and gene annotations but you can load them from wherever you want.
ah <- AnnotationHub::AnnotationHub()
ah_resources <- AnnotationHub::query(ah, c("Ensembl", "Caenorhabditis elegans", "84"))
annotation_object <- ah_resources[["AH50789"]]
transcriptome_object <- rtracklayer::import.2bit(ah_resources[["AH50453"]])
Once you have access to transcriptome, gene annotations and your RNA-Seq library, an object of class UserMetadata
has to be created.
# create an object of class UserMetadata and specify the species ID
user_BgeeCall <- new("UserMetadata", species_id = "6239")
# import annotation and transcriptome in the user_BgeeCall object
# it is possible to import them using an S4 object (GRanges, DNAStringSet) or a file (gtf, fasta)
user_BgeeCall <- setAnnotationFromObject(user_BgeeCall, annotation_object, "WBcel235_84")
user_BgeeCall <- setTranscriptomeFromObject(user_BgeeCall, transcriptome_object, "WBcel235")
# provide path to the directory of your RNA-Seq library
user_BgeeCall <- setRNASeqLibPath(user_BgeeCall,
system.file("extdata",
"SRX099901_subset",
package = "BgeeCall"))
And that’s it… You can run the generation of your present/absent gene expression calls
#> Querying Bgee to get intergenic release information...
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read_tsv
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read_tsv
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Generate present/absent expression calls.
#>
#> TPM cutoff for which 95% of the expressed genes are coding found at TPM = 4.64724e-06
Each analyze generates 5 files and return path to each one of them.
head.DataTable(x = read.table(calls_output$calls_tsv_path, header = TRUE), n = 5)
#> id abundance counts length biotype type call
#> 1 WBGene00000001 27.0993 2 1556.20 protein_coding genic present
#> 2 WBGene00000002 0.0000 0 1761.00 protein_coding genic absent
#> 3 WBGene00000003 0.0000 0 1549.00 protein_coding genic absent
#> 4 WBGene00000004 13.8730 1 1519.92 protein_coding genic present
#> 5 WBGene00000005 0.0000 0 1487.00 protein_coding genic absent
read.table(calls_output$cutoff_info_file_path)
#> V1 V2
#> 1 libraryId SRX099901_subset
#> 2 cutoffTPM 4.64724e-06
#> 3 proportionGenicPresent 25.3210509597495
#> 4 numberGenicPresent 5580
#> 5 numberGenic 22037
#> 6 proportionCodingPresent 27.1433462121583
#> 7 numberPresentCoding 5550
#> 8 numberCoding 20447
#> 9 proportionIntergenicPresent 0.490998363338789
#> 10 numberIntergenicPresent 21
#> 11 numberIntergenic 4277
#> 12 ratioIntergenicCodingPresent 0.05
head.DataTable(x = read.table(calls_output$abundance_tsv, header = TRUE), n = 5)
#> target_id length eff_length est_counts tpm
#> 1 Y110A7A.10 1787 1556.20 2 27.0993
#> 2 F27C8.1 1940 1761.00 0 0.0000
#> 3 F07C3.7 1728 1549.00 0 0.0000
#> 4 F52H2.2 1739 1519.92 1 13.8730
#> 5 T13A10.10a 1734 1555.00 0 0.0000
read.table(calls_output$S4_slots_summary, header = TRUE, sep = "\t")
#> Slot.name
#> 1 AbundanceMetadata@tool_name
#> 2 AbundanceMetadata@txOut
#> 3 AbundanceMetadata@ignoreTxVersion
#> 4 AbundanceMetadata@cutoff
#> 5 AbundanceMetadata@read_size_kmer_threshold
#> 6 BgeeMetadata@intergenic_release
#> 7 UserMetadata@species_id
#> 8 UserMetadata@reads_size
#> 9 UserMetadata@rnaseq_lib_path
#> 10 UserMetadata@transcriptome_name
#> 11 UserMetadata@annotation_name
#> 12 UserMetadata@simple_arborescence
#> 13 output_dir
#> Slot.value
#> 1 kallisto
#> 2 FALSE
#> 3 FALSE
#> 4 0.05
#> 5 50
#> 6 0.1
#> 7 6239
#> 8 51
#> 9 /tmp/RtmpZvBitd/Rinst39a92a0110b0/BgeeCall/extdata/SRX099901_subset
#> 10 WBcel235
#> 11 WBcel235_84
#> 12 TRUE
#> 13 /tmp/RtmpZvBitd/Rinst39a92a0110b0/BgeeCall/extdata/intergenic_0.1/all_results/SRX099901_subset
You will potentialy be also interested to generate present/absent calls on different RNA-Seq libraries, potentially on different species, or using The main function generate_presence_absence()
allows to generate present/absent calls from a UserMetadata object but also from a data frame or a tsv file depending on the arguments of the function you use. Please choose one of the three following arguments : - userMetadata : Allows to generate present/absent calls for one RNA-Seq library using one object of the class UserMetadata.
- userDataFrame : Provide a dataframe where each row correspond to one present/absent call generation. It allows to generate present/absent calls on different libraries, species, transcriptome, genome annotations, etc. - userFile : Similar to userDataFrame except that the information are stored in a tsv file. A template of this file called userMetadataTemplate.tsv
is available at the root of the package.
Columns of the dataframe or the tsv file are :
getwd()
function and correspond to the working directory of your R session. If not interested by this option, leave the column empty.working_path
column. However, this column allows you to define a different output_directory for RNA-Seq results. For instance it allows you to save calls information directly in the RNA-Seq directory. If not interested by this option, leave the column empty.Once the file has been fill in expression calls can be generated with :
Different releases of reference intergenic sequences are available. It is possible to list all these releases :
list_intergenic_release()
#> Downloading release information of reference intergenic sequences...
#> release releaseDate FTPURL
#> 1 0.1 2018-12-21 ftp://ftp.bgee.org/intergenic/0.1/
#> 2 0.2 2019-02-07 ftp://ftp.bgee.org/intergenic/0.2/
#> 3 community 2019-07-22
#> 4 custom 2019-07-22
#> referenceIntergenicFastaURL
#> 1 ftp://ftp.bgee.org/intergenic/0.1/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 2 ftp://ftp.bgee.org/intergenic/0.2/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 3
#> 4
#> minimumVersionBgeeCall
#> 1 0.9.9
#> 2 0.9.9
#> 3 1.1.0
#> 4 1.1.0
#> description
#> 1 intergenic regions used to generate Bgee 14.
#> 2 cleaned intergenic sequences based on release 0.1 (remove blocks of Ns longer than 100 and sequences containing more than 5% of Ns).
#> 3 Release allowing to access to all reference intergenic sequences generated by the community and not present in Bgee for the moment.
#> 4 Release allowing to use your own FASTA reference intergenic sequences. When this release is selected BgeeCall will use the sequences at UserMetadata@custom_intergenic_path to generate present/absent calls.
#> messageToUsers
#> 1
#> 2 be careful, this intergenic release has not been tested by Bgee
#> 3 These reference intergenic sequences have not been generated by Bgee. Use with caution.
#> 4 You decided to use your own reference intergenic sequences
It is then possible to choose one specific release to create a BgeeMetadata
object. Always use the setter method setIntergenicRelease()
when changing the release of an already existing BgeeMetadata object.
# create BgeeMetadata object and define one reference intergenic release
bgee <- new("BgeeMetadata", intergenic_release = "0.1")
#> Querying Bgee to get intergenic release information...
# change the reference intergenic release of your BgeeMetadata object
bgee <- setIntergenicRelease(bgee, "0.2")
#> IMPORTANT : be careful, this intergenic release has not been tested by Bgee
By default the reference intergenic release used when a BgeeMetadata
object is created is the last stable one created by the Bgee team.
Core reference intergenic
releases are created by the Bgee team when a lot of new RNA-Seq libraries have been manually curated for already existing species and/or for new species. These releases are the only ones with a release number (e.g “0.1”). Each of these releases contains reference intergenic sequences for a list of species. Bgee reference intergenic sequences have been generated using Bgee team expertise. The RNA-Seq libraries were manually curated as healthy and wild type. Quality Control have been done along all steps of generation of these sequences. Reference intergenic sequences have been selected from all potential intergenic regions (see Bgee pipeline). BgeeCall allows to generate gene expression call from Bgee reference intergenic sequences for any RNA-Seq libraries as long as these sequences have been generated by the Bgee team. A tsv file containing all species available for current release of reference intergenic is available here. This file also contains a column describing the number of RNA-Seq libraries used to generated the reference intergenic sequences of each species. It is also possible to list in R all species for which Bgee reference intergenic sequences have been created :
list_bgee_ref_intergenic_species(myBgeeMetadata = bgee)
#> speciesId speciesName numberOfLibraries genomeVersion
#> 1 9606 Homo sapiens 5026 GRCh38.p5
#> 2 10090 Mus musculus 133 GRCm38.p4
#> 3 9544 Macaca mulatta 90 MMUL1.0
#> 4 7955 Danio rerio 67 GRCz10
#> 5 8364 Xenopus tropicalis 66 JGI4.2
#> 6 6239 Caenorhabditis elegans 50 WBcel235
#> 7 9031 Gallus gallus 45 Galgal4
#> 8 10116 Rattus norvegicus 36 Rnor_6.0
#> 9 9913 Bos taurus 33 UMD3.1
#> 10 13616 Monodelphis domestica 19 monDom5
#> 11 9258 Ornithorhynchus anatinus 17 OANA5
#> 12 7240 Drosophila simulans 17 GCA_000259055.1
#> 13 9598 Pan troglodytes 15 CHIMP2.1.4
#> 14 7237 Drosophila pseudoobscura 14 GCA_000001765.2
#> 15 7227 Drosophila melanogaster 14 BDGP6
#> 16 9593 Gorilla gorilla 13 gorGor3.1
#> 17 9597 Pan paniscus 12 CHIMP2.1.4
#> 18 9823 Sus scrofa 10 Sscrofa10.2
#> 19 10141 Cavia porcellus 9 Felis_catus_6.2
#> 20 9685 Felis catus 9 cavPor3
#> 21 7230 Drosophila mojavensis 8 EquCab2
#> 22 9796 Equus caballus 8 GCA_000005175.1
#> 23 9986 Oryctolagus cuniculus 6 eriEur1
#> 24 9615 Canis lupus familiaris 6 CanFam3.1
#> 25 9365 Erinaceus europaeus 6 OryCun2.0
#> 26 7244 Drosophila virilis 4 GCA_000005245.1
#> 27 28377 Anolis carolinensis 4 AnoCar2.0
#> 28 7217 Drosophila ananassae 4 GCA_000005975.1
#> 29 7245 Drosophila yakuba 4 GCA_000005115.1
If you want to use BgeeCall on a species for which Bgee does not provide reference intergenic sequences you have the possibility to create them by yourself and share them with the Bgee community by following all steps of this tutorial. Do not forget that the number of RNA-Seq libraries is a key point to the generation of precise reference intergenic sequences. It is possible to list in R all species for which reference intergenic sequences have been created by the community using the following code
list_community_ref_intergenic_species()
#> speciesId numberOfLibraries annotationVersion genomeVersion kallistoVersion
#> 1 10036 15 MesAur1.0 MesAur1.0 0.46.0
#> 2 13686 243 Si_gnG Si_gnG 0.44.0
#> url
#> 1 https://zenodo.org/api/files/f46c7de0-d9a5-4ffd-a30e-4b08121ba446/ref_intergenic.fa.gz
#> 2 https://zenodo.org/api/files/5492ff2f-91a3-4101-8d67-78b8f8625cc6/ref_intergenic.fa.gz
If reference intergenic sequences of the species you are interested in are available only from the community release it is then possible to use this release to generate your present/absent calls
If you generated your own reference intergenic sequences follwowing this tuorial but did not share them for the moment (do not forget to do it…), it is also possible to use BgeeCall with a file containing the sequences. In this case you need to select the custom release and provide the path to the file containing reference intergenic sequences :
kallisto generates TPMs at the transcript level. In the Bgee pipeline we summarize this expression at the gene level to calculate our present/absent calls. In BgeeCall
it is now possible to generate present/absent calls at the transcript level. Be careful when using this feature as it has not been tested for the moment. To generate such calls you only have to create one object of the class KallistoMetadata
and edit the value of one attribute
By default BgeeCall
suppose that kallisto is installed. If kallisto is not installed on your computer you can either :
kallisto <- new("KallistoMetadata", download_kallisto = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)
By default kallisto is run with the same parameters that we use in the RNA-Seq Bgee pipeline:
It is possible to modify them and use your favourite kallisto parameters
By default 2 indexes with 2 different kmer sizes can be used by BgeeCall
The default kmer size of kallisto (31) is used for libraries with reads length equal or larger than 50 bp. A kmer size of 15 is used for libraries with reads length smaller than 50 bp. We decided not to allow to tune kmers size because the generation of the index is time consuming and index generation takes even more time with small kmers size (< 15bp). However it is possible to modify the threshold of read length allowing to choose between default and small kmer size.
By default gene expression calls are generated using all runs of the RNA-Seq library. It is possible to select only a subset of these runs.
# RNA-Seq run SRR350955_subsetof from the RNA-Seq library will be used to generate the calls
user_BgeeCall <- setRunIds(user_BgeeCall, c("SRR350955_subset"))
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)
When run IDs are selected, the name output directory combine the library ID and all selected run IDs. In our example the expression calls will be stored in the directory SRX099901_SRR350955_subset
.
By default the threshold of present/absent is calculated with the formula :
proportion of ref intergenic present / proportion of protein coding present = 0.05
This 0.05 corresponds to the ratio used in the Bgee pipeline. However it is possible to edit this value. Be careful when editing this value as it has a big impact on your present absent.
By default BgeeCall write output messages for all parts of the workflow. It is possible not to write any messages by changing the value of the slot verbose in the UserMetadata object. By default this value is set to true but it is possible to change it to false with this line :
Generation of present/absent expression calls is done in several steps. It is possible to force overwritting already existing intermediary files : - overwrite_index : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the index generation step is skiped if one index already exists. If TRUE, the kallisto index will be generated even if one index already exists. - overwrite_quant : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the kallisto quantification step is skiped if a quantification file already exists. If TRUE, the kallisto quantification step will be run even if a quantification file already exists. - overwrite_calls : slot of the object KallistoMetadata. The value has to be a logical. If FALSE, the generation of present/absent calls is skiped if an index already exists. If TRUE (default), the generation of present/absent calls will be run even if calls were already generated.
By default the arborescence of directories created by BgeeCall
is as simple as possible. the results will be created using the path working_path/intergenic_release/all_results/libraryId
. Generating present/absent gene expression calls for the same RNA-Seq library using different transcriptome or annotation versions using this arborescence will overwrite previous results. The UserMetadata
class has an attribute simple_arborescence
that is TRUE
by default. If FALSE
, a complexe arborescence of directories containing the name of the annotation and transcriptome files will be created. This complex arborescence will then allow to generate present/absent calls for the same library using different version of transcriptome or annotaiton.
By default directories used to save present/absent calls are subdirectories of UserMetadata@working_path
. However it is possible to select the directory where you want the calls to be generated.
This output directory will only contains results generated at the RNA-Seq library level. All data generated at species level are still stored using the UserMetadata@working_path
. They can then still be reused to generate calls from other libraries of the same species.
#Session Info
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] rtracklayer_1.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
#> [4] IRanges_2.22.1 S4Vectors_0.26.0 BiocGenerics_0.34.0
#> [7] BgeeCall_1.4.1
#>
#> loaded via a namespace (and not attached):
#> [1] Biobase_2.48.0 httr_1.4.1
#> [3] bit64_0.9-7 jsonlite_1.6.1
#> [5] AnnotationHub_2.20.0 shiny_1.4.0.2
#> [7] assertthat_0.2.1 interactiveDisplayBase_1.26.0
#> [9] askpass_1.1 BiocManager_1.30.10
#> [11] BiocFileCache_1.12.0 blob_1.2.1
#> [13] BSgenome_1.56.0 GenomeInfoDbData_1.2.3
#> [15] Rsamtools_2.4.0 yaml_2.2.1
#> [17] progress_1.2.2 BiocVersion_3.11.1
#> [19] pillar_1.4.4 RSQLite_2.2.0
#> [21] lattice_0.20-41 glue_1.4.0
#> [23] digest_0.6.25 promises_1.1.0
#> [25] XVector_0.28.0 htmltools_0.4.0
#> [27] httpuv_1.5.2 Matrix_1.2-18
#> [29] XML_3.99-0.3 pkgconfig_2.0.3
#> [31] biomaRt_2.44.0 zlibbioc_1.34.0
#> [33] xtable_1.8-4 purrr_0.3.4
#> [35] later_1.0.0 BiocParallel_1.22.0
#> [37] tibble_3.0.1 openssl_1.4.1
#> [39] ellipsis_0.3.0 SummarizedExperiment_1.18.1
#> [41] GenomicFeatures_1.40.0 mime_0.9
#> [43] magrittr_1.5 crayon_1.3.4
#> [45] memoise_1.1.0 evaluate_0.14
#> [47] tools_4.0.0 prettyunits_1.1.1
#> [49] hms_0.5.3 lifecycle_0.2.0
#> [51] matrixStats_0.56.0 stringr_1.4.0
#> [53] Rhdf5lib_1.10.0 DelayedArray_0.14.0
#> [55] AnnotationDbi_1.50.0 Biostrings_2.56.0
#> [57] compiler_4.0.0 rlang_0.4.6
#> [59] rhdf5_2.32.0 grid_4.0.0
#> [61] RCurl_1.98-1.2 tximport_1.16.0
#> [63] rappdirs_0.3.1 bitops_1.0-6
#> [65] rmarkdown_2.1 DBI_1.1.0
#> [67] curl_4.3 R6_2.4.1
#> [69] GenomicAlignments_1.24.0 knitr_1.28
#> [71] dplyr_0.8.5 fastmap_1.0.1
#> [73] bit_1.1-15.2 readr_1.3.1
#> [75] stringi_1.4.6 Rcpp_1.0.4.6
#> [77] vctrs_0.3.0 dbplyr_1.4.3
#> [79] tidyselect_1.1.0 xfun_0.13