Warning: Due to space limitations, we demonstrate the useful functions in this workflow without images. In order to see an HTML report generated based on the full genome annotations and complete datasets described in the vignette, please see here.
For the most up-to-date functionality, usage and installation instructions, and example outputs, see our github repository here.
The first release of RCAS was designed to produce quality controls and exploratory plots for the analysis of transcriptome-wide regions of interest with regard to their feature specific distributions, sequence motif signals, and biological function enrichments. However, this workflow didn’t include functions to directly compare multiple samples. However, it may be often desirable to see how the biological content of one sample relates to the content obtained from different samples. For instance, one can design a CLIP-seq based experiment with multiple conditions each containing multiple biological replicates and want to address the following common questions:
Since the release of RCAS 1.4.0, it is possible to process more than one input BED files and generate a self-containing HTML report with plots and tables that compare/contrast the input samples.
In this vignette, we will use example datasets from CLIP based sequencing experiments to show how to use the meta-analysis functionality of the RCAS package. However, the same workflow can be applied to any kind of high-throughput dataset that pertains to transripts and can be expressed as genomic ranges in a BED format file.
When starting from scratch, the two necessary inputs are:
In this vignette, we will analyse the peak regions discovered via CLIP-sequencing experiments of the RNA-binding protein FUS by Nakaya et al, 2013, Synaptic Functional Regulator FMR1 by Ascano et al. 2012, and Eukaryotic initiation factor 4A-III by Sauliere et al, 2012. We use two replicates from both experiments and constrain the analysis to a small portion of the genome (the first 1 million base pairs of chromosome 1). Both the genome annotations and the peaks detected within these regions are available as built-in data in the RCAS package. To obtain the full datasets, please refer to the doRiNa database.
First, we detect the paths to these datasets:
FUS_rep1_path <- system.file('extdata', 'FUS_Nakaya2013c_hg19.bed', package = 'RCAS')
FUS_rep2_path <- system.file('extdata', 'FUS_Nakaya2013d_hg19.bed', package = 'RCAS')
FMR1_rep1_path <- system.file('extdata', 'FMR1_Ascano2012a_hg19.bed', package = 'RCAS')
FMR1_rep2_path <- system.file('extdata', 'FMR1_Ascano2012b_hg19.bed', package = 'RCAS')
EIF4A3_rep1_path <- system.file('extdata', 'EIF4A3Sauliere20121a.bed', package = 'RCAS')
EIF4A3_rep2_path <- system.file('extdata', 'EIF4A3Sauliere20121b.bed', package = 'RCAS')
Then, we create a file containing the sample names and the locations of the BED files for each sample.
projData <- data.frame('sampleName' = c('FUS_1', 'FUS_2', 'FMR1_1', 'FMR1_2', 'EIF4A3_1', 'EIF4A3_2'),
'bedFilePath' = c(FUS_rep1_path, FUS_rep2_path,
FMR1_rep1_path, FMR1_rep2_path,
EIF4A3_rep1_path, EIF4A3_rep2_path),
stringsAsFactors = FALSE)
projDataFile <- file.path(getwd(), 'myProjDataFile.tsv')
write.table(projData, projDataFile, sep = '\t', quote =FALSE, row.names = FALSE)
The second input is the path to the GTF file containing genome annotations. Here we use the first 1 million bases of the chromosome 1 from the Ensembl Database (version 75 - corresponds to GRCh37 build - hg19 in UCSC).
In order to avoid redundant preprocessing of the same input files, we
have developed a function RCAS::createDB
to save all
preprocessed data into an SQLite database using the R package RSQLite.
With this function, all input files from the project data file are
processed, the resulting processed data are saved in the following SQL
tables:
gtfFilePath
argument. The fields of this table are:
seqnames, start, end, width, strand, source, type, score, gene_id,
gene_name, transcript_id, exon_id, and exon_number.To create a database with preprocessed data, use
RCAS::createDB()
function. When starting a fresh database,
we can provide a file name that does not exist at the corresponding
folder must be provided. If a database already exists at the given
location, an error will be returned.
databasePath <- file.path(getwd(), 'myProject.sqlite')
invisible(createDB(dbPath = databasePath, projDataFile = projDataFile, gtfFilePath = gtfFilePath, genomeVersion = 'hg19'))
However, it is also possible to update an existing database by
setting the update
argument to TRUE
. This is
useful when you want to add new datasets to an existing project. This
will only add processed data for samples that don’t already exist in the
database. This will not attempt to overwrite existing data for existing
samples.
createDB(dbPath = databasePath, projDataFile = projDataFile, gtfFilePath = gtfFilePath, genomeVersion = 'hg19', update = TRUE)
It is also possible to turn off certain analysis modules by setting
annotationSummary
, coverageProfiles
, or
motifAnalysis
modules to FALSE
. For instance,
the following command will create the same database except for the
discoveredMotifs
table.
createDB(dbPath = databasePath, projDataFile = projDataFile, gtfFilePath = gtfFilePath, genomeVersion = 'hg19', motifAnalysis = FALSE)
If the user wishes to overwrite datasets for existing samples, all relevant data for the given samples must be first purged from the database. It is possible to do so via:
Here we erase all entries relevant to samples ‘FMR1_1’ and ‘FMR1_2’.
To have a quick look into the contents of any given database, we can
use the RCAS::summarizeDatabaseContent()
function. This
returns a table of number of row entries for each sample in any given
table that exists in the database.
Each table of the database can be read into memory using the
RSQLite::dbReadTable()
function. To read a specific table
from a database, firstly a connection to the sqlite dump needs to be
created.
List the tables that exist in the connection:
A table can be read into an R object:
annotationSummaries <- RSQLite::dbReadTable(mydb, 'annotationSummaries')
knitr::kable(annotationSummaries)
Warning: It is important to consider that the tables
read into an R object from an sqlite database are of the
data.frame
class. For most tables it is the desired format,
however some of the tables need converting from data.frame
to the desired class (e.g. GenomicRanges
for
gtfData
table) for downstream processing.
Once an sqlite database is obtained,
RCAS::runReportMetaAnalysis()
function can be utilized to
quickly generate comparative analysis reports between whichever sample
groups are desired to be compared. To generate this report, only two
inputs are required:
sampleName
column. For example, replicates
of the same biological condition can be assigned the same
sampleGroup
value.Here is how to generate a stand-alone HTML report with interactive
figures and tables from a pre-calculated RCAS database that compares
CLIP datasets from two replicates of FUS
and two replicates
of EIF4A3
.
Let’s first create a sample data file. The sample names found in this file should be a subset of the original project data file used to generate the sqlite database.
sampleData <- data.frame('sampleName' = c('FUS_1', 'FUS_2', 'EIF4A3_1', 'EIF4A3_2'),
'sampleGroup' = c('FUS', 'FUS', 'EIF4A3', 'EIF4A3'),
stringsAsFactors = FALSE)
sampleDataFile <- file.path(getwd(), 'mySampleDataTable.tsv')
write.table(sampleData, sampleDataFile, sep = '\t', quote =FALSE, row.names = FALSE)
Now, we are ready to get an HTML report:
RCAS is developed in the group of Altuna Akalin (head of the Scientific Bioinformatics Platform) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.
RCAS is developed as a bioinformatics service as part of the RNA Bioinformatics Center, which is one of the eight centers of the German Network for Bioinformatics Infrastructure (de.NBI).
To cite RCAS in publications use:
Bora Uyar, Dilmurat Yusuf, Ricardo Wurmus, Nikolaus Rajewsky, Uwe Ohler, Altuna Akalin; RCAS: an RNA centric annotation system for transcriptome-wide regions of interest. Nucleic Acids Res 2017 gkx120. doi: 10.1093/nar/gkx120