1 Introduction
- 1.1 Motivation
- 1.2 What is an ORFik experiment?
2 creating an ORFik experiment
- 2.1 A minimal experiment
- 2.2 Fixing or updating an experiment
3 The experiment object
- 3.1 Loading data from experiment
- 3.2 Plotting with ORFik experiments
4 P-site shifting experiment
5 Converting bam files to faster formats
6 ORFik QC report
- 6.1 General report
- 6.2 Ribo-seq specific pshifting and QC
7 Using the ORFik experiment system in your script
- 7.1 Looping over all libraries in experiment
  - 7.1.1 Reformat output to data.table (merge column-wise)
  - 7.1.2 Reformat output to data.table (merge row-wise)
8 Conclusion

1 Introduction

Welcome to the introduction of data management with ORFik experiment. This vignette will walk you through how to work with large amounts of sequencing data effectively in ORFik. ORFik is an R package containing various functions for analysis of RiboSeq, RNASeq, RCP-seq, TCP-seq, Chip-seq and Cage data, we advice you to read ORFikOverview vignette, before starting this one.

1.1 Motivation

NGS libraries are becoming more and more numerous. As a bioinformatician / biologist you often work on multi-library experiments, like 6 libraries of RNA-seq and 6 Ribo-seq libraries, split on 3 conditions with 2 replicates each. Then make some plots or statistics. A lot of things can go wrong when you scale up from just 1 library to many, or even to multiple experiments.

Another problem is also that annotations like gff and fasta files combined with the NGS data, must be separately loaded. Making it possible to use wrong annotation for the NGS data.

So to summarize, the ORFik experiment API abstracts what could be done with 1 NGS library and a corresponding organism annotation to the level of multiple libraries and the comparison between them, standardizing ploting, comparisons, loading libraries and many much more.

1.2 What is an ORFik experiment?

It is an object that simplify and error correct your NGS workflow, creating a single R object that stores and controls all results relevant to a specific experiment. It contains following important parts:

filepaths and info for each library in the experiment (for multiple files formats: bam, bed, wig, ofst, ..)
genome annotation files of the experiment (fasta genome, index, gtf, txdb)
organism name (for automatic GO, sequence analysis..)
description and author information (list.experiments(), show all experiments you have made with ORFik, easy to find and load them later)
API: ORFik supports a rich API for using the experiment,like outputLibs(experiment, type = “wig”) will load all libraries converted to wig format into R, loadTxdb(experiment) will load the txdb (gtf) of experiment, transcriptWindow() will automatically plot metacoverage of all libraries in the experiment, countTable(experiment) will load count tables, etc..)
Safety: It is also a safety in that it verifies your experiments contain no duplicate, empty or non-accessible files. Making it almost impossible to load the wrong data. In addition it has other safety checks; comparing chromosome naming of libraries and annotation, making sure there is no mixing of chr1 vs 1 as name for chromosome 1 etc.

2 creating an ORFik experiment

Let’s say we have a human experiment, containing annotation files (.gtf and .fasta genome) + Next generation sequencing libraries (NGS-data); RNA-seq, ribo-seq and CAGE.

An example of how to make the experiment will now be shown:

First load ORFik

library(ORFik)

In a normal experiment, you would usually have only bam files from alignment of your experiment to start with (and split this into 3 experiments, 1 for RNA-seq, 1 for Ribo-seq and 1 for CAGE), but to simplify this for you to replicate we use the ORFik example data.

2.1 A minimal experiment

The minimal amount of information you need to make an ORFik experiment is:

1. A folder with NGS data
1. A path to transcriptome annotation (gtf->slow or txdb->faster)
1. A path to genome (fasta)
1. A name for the experiment

# 1. Pick directory (normally a folder with your aligned bam files)
NGS.dir <- system.file("extdata", "", package = "ORFik")
# 2. .gff/.gtf location
txdb <- system.file("extdata", "annotations.gtf", package = "ORFik")
# 3. fasta genome location
fasta <- system.file("extdata", "genome.fasta", package = "ORFik")
# 4. Pick an experiment name
exper.name <- "ORFik_example_human"


list.files(NGS.dir)

##  [1] "QC_STATS"               "annotations.gtf"        "cage-seq-heart.bed.bgz"
##  [4] "features.rdata"         "genome.fasta"           "genome.fasta.fai"      
##  [7] "ofst"                   "pshifted"               "ribo-seq-heart.bed.bgz"
## [10] "ribo-seq.bam"           "ribo-seq.bam.bai"       "rna-seq-heart.bed.bgz"

Experiments are created by all accepted files from a folder (file extension given by type argument, default: bam, bed, wig), so remember to keep your experiment folder clean of other NGS libraries not related to the experiment.

# This experiment is intentionally malformed, so we first make only a template:
template <- create.experiment(dir = NGS.dir,  # directory of the NGS files for the experiment
                              exper.name,     # Experiment name
                              txdb = txdb,    # gtf / gff / gff.db annotation
                              fa = fasta,     # Fasta genome
                              organism = "Homo sapiens", # Scientific naming
                              saveDir = NULL, # Create template instead of ready experiment
                              )
# The experiment contains 3 main parts:
# 1. Annotation, organism, general info:
data.frame(template)[1:3, ]

##      X1                                                               X2 X3 X4
## 1  name                                              ORFik_example_human      
## 2   gff /tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/annotations.gtf      
## 3 fasta    /tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/genome.fasta      
##         X5           X6
## 1                      
## 2 organism Homo sapiens
## 3

# 2. NGS data set-up info:
data.frame(template)[4:8, 1:5]

##        X1    X2  X3        X4       X5
## 4 libtype stage rep condition fraction
## 5    CAGE heart                       
## 6     RFP heart                       
## 7     RFP                             
## 8     RNA heart

# 3. NGS File paths:
data.frame(template)[4:8, 6]

## [1] "filepath"                                                               
## [2] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/cage-seq-heart.bed.bgz"
## [3] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/ribo-seq-heart.bed.bgz"
## [4] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/ribo-seq.bam"          
## [5] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/rna-seq-heart.bed.bgz"

You see from the template, it excludes files with .bai or .fai, .rdata etc, and only using data of NGS libraries, defined by argument (type).

You can also see it tries to guess library types, stages, replicates, condition etc. It will also try to auto-detect paired end bam files.

2.2 Fixing or updating an experiment

Since every NGS file in a experiment must be a unique set of information columns ( there can not be 2 RNA-seq libraries from wildtype that are replicate1 etc), the create.experiment function will intentionally abort if it can not distinguish all the libraries in some way. (Example: It might find 2 files that are categorized as RNA-seq replicate 1, but the condtion: Wild type vs crispr mutant was not auto-detected), so the files would be non-unique.

To fix the things it did not find (a condition not specified, etc), there are 3 ways:

Specify the values manually in create.experiment. Example: condition = rep(c(“WT”, “Mutant”), each = 3)
Open the file in Excel / Libre office and edit
Edit the template (only applies if object was not directly saved, like in this vignette)

Let’s update the template to have correct tissue-fraction in one of the samples.

template$X5[6] <- "heart_valve" # <- fix non unique row (tissue fraction is heart valve)

df <- read.experiment(template)# read experiment from template

Normally you read experiments saved to disc, if you made only a template, save it by doing:

save.experiment(df, file = "path/to/save/experiment.csv")

You can then load the experiment whenever you need it.

3 The experiment object

To see the object, just show it like this:

df

## experiment: ORFik_example_human with 3 library types and 4 runs 
##    libtype stage    fraction
## 1:    CAGE heart            
## 2:     RFP heart heart_valve
## 3:     RFP                  
## 4:     RNA heart

You see here that file paths are hidden, you can access them like this:

filepath(df, type = "default")

## [1] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/cage-seq-heart.bed.bgz"
## [2] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/ribo-seq-heart.bed.bgz"
## [3] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/ribo-seq.bam"          
## [4] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/rna-seq-heart.bed.bgz"

If you have varying version of libraries, like p-shifted, bam, simplified wig files, you can get file paths to different version with this function, like this:

filepath(df[df$libtype == "RFP", ], type = "pshifted")[2] # RFP = Ribo-seq, Default location for pshifted reads

## [1] "/tmp/RtmpEIV12e/Rinst28f9d7b10a4eb/ORFik/extdata/pshifted/ribo-seq_pshifted.ofst"

3.1 Loading data from experiment

3.1.1 Loading NGS data to environment

There are 3 ways to load NGS data, the first one is to load data into an environment. By default all libraries are loaded into .GlobalEnv (global environment) with names decided by columns in experiment, to see what the names will be, do:

bamVarName(df) #This will be the names:

## [1] "CAGE_heart"             "RFP_heart_fheart_valve" "RFP"                   
## [4] "RNA_heart"

Now let’s auto-load the libraries to the global environment

outputLibs(df) # With default output.mode = "envir".

## Outputting libraries from: ORFik_example_human

To remove the outputted libraries:

# remove.experiments(df)

3.1.2 Loading NGS data as list

The second way gives you a list, where the elements are the NGS libraries. There are also two ways of loading the list:

outputLibs(df, output.mode = "envirlist") # Make sure NGS exist in envir, then return as list

## $CAGE_heart
## GRanges object with 500 ranges and 1 metadata column:
##         seqnames    ranges strand |     score
##            <Rle> <IRanges>  <Rle> | <integer>
##     [1]     chr1   2556380      + |         1
##     [2]     chr1  12428832      + |         1
##     [3]     chr1  17755857      + |         1
##     [4]     chr1  21280476      - |         1
##     [5]     chr1  31644889      - |         1
##     ...      ...       ...    ... .       ...
##   [496]     chrX  81329014      + |         1
##   [497]     chrX 103356533      + |        11
##   [498]     chrX 132217287      - |         1
##   [499]     chrX 140564830      + |         1
##   [500]     chrX 153954723      - |         1
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
## 
## $RFP_heart_fheart_valve
## GRanges object with 500 ranges and 6 metadata columns:
##           seqnames              ranges strand |        V7        V8        V9
##              <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##     [1] GL000220.1       109184-109205      + |    109183    109205         0
##     [2] GL000220.1       112148-112179      + |    112147    112179         0
##     [3] GL000220.1       112148-112179      + |    112147    112179         0
##     [4] GL000220.1       112148-112179      + |    112147    112179         0
##     [5] GL000220.1       112148-112179      + |    112147    112179         0
##     ...        ...                 ...    ... .       ...       ...       ...
##   [496]       chr8   81522077-81522106      - |  81522076  81522106         0
##   [497]       chr8 135773772-135773799      - | 135773771 135773799         0
##   [498]       chr9   76571752-76571781      + |  76571751  76571781         0
##   [499]       chr9   76571752-76571781      + |  76571751  76571781         0
##   [500]       chr9   93543250-93543273      + |  93543249  93543273         0
##               V10         V11         V12
##         <integer> <character> <character>
##     [1]         1          22           0
##     [2]         1          32           0
##     [3]         1          32           0
##     [4]         1          32           0
##     [5]         1          32           0
##     ...       ...         ...         ...
##   [496]         1          30           0
##   [497]         1          28           0
##   [498]         1          30           0
##   [499]         1          30           0
##   [500]         1          24           0
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths
## 
## $RFP
## GAlignments object with 16649 alignments and 0 metadata columns:
##           seqnames strand       cigar    qwidth     start       end     width
##              <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##       [1]    chr23      -         28M        28  17599129  17599156        28
##       [2]    chr23      -         28M        28  17599129  17599156        28
##       [3]    chr23      -         28M        28  17599129  17599156        28
##       [4]    chr23      -         28M        28  17599129  17599156        28
##       [5]    chr23      -         28M        28  17599129  17599156        28
##       ...      ...    ...         ...       ...       ...       ...       ...
##   [16645]     chr8      +         29M        29  24068894  24068922        29
##   [16646]     chr8      +         28M        28  24068907  24068934        28
##   [16647]     chr8      +         30M        30  24068919  24068948        30
##   [16648]     chr8      +         30M        30  24068919  24068948        30
##   [16649]     chr8      +         30M        30  24068939  24068968        30
##               njunc
##           <integer>
##       [1]         0
##       [2]         0
##       [3]         0
##       [4]         0
##       [5]         0
##       ...       ...
##   [16645]         0
##   [16646]         0
##   [16647]         0
##   [16648]         0
##   [16649]         0
##   -------
##   seqinfo: 1133 sequences from an unspecified genome
## 
## $RNA_heart
## GRanges object with 500 ranges and 6 metadata columns:
##           seqnames              ranges strand |        V7        V8        V9
##              <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##     [1] GL000220.1       117772-117872      - |    117771    117872         0
##     [2] GL000220.1       118010-118110      - |    118009    118110         0
##     [3] KI270733.1       134864-134964      - |    134863    134964         0
##     [4] KI270733.1       175346-175446      + |    175345    175446         0
##     [5] KI270733.1       177165-177265      - |    177164    177265         0
##     ...        ...                 ...    ... .       ...       ...       ...
##   [496]       chr9 130236114-130236214      - | 130236113 130236214         0
##   [497]       chr9 137008790-137008972      + | 137008789 137008972         0
##   [498]       chr9 137040403-137040580      - | 137040402 137040580         0
##   [499]       chr9 137220666-137220766      - | 137220665 137220766         0
##   [500]       chr9 137616179-137617934      + | 137616178 137617934         0
##               V10         V11         V12
##         <integer> <character> <character>
##     [1]         1         101           0
##     [2]         1         101           0
##     [3]         1         101           0
##     [4]         1         101           0
##     [5]         1         101           0
##     ...       ...         ...         ...
##   [496]         1         101           0
##   [497]         2       79,22       0,161
##   [498]         2       34,67       0,111
##   [499]         1         101           0
##   [500]         2        21,1      0,1755
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths

# Check envir, if it exist, list them and return, if not, only return list
outputLibs(df, output.mode = "list")

## $CAGE_heart
## GRanges object with 500 ranges and 1 metadata column:
##         seqnames    ranges strand |     score
##            <Rle> <IRanges>  <Rle> | <integer>
##     [1]     chr1   2556380      + |         1
##     [2]     chr1  12428832      + |         1
##     [3]     chr1  17755857      + |         1
##     [4]     chr1  21280476      - |         1
##     [5]     chr1  31644889      - |         1
##     ...      ...       ...    ... .       ...
##   [496]     chrX  81329014      + |         1
##   [497]     chrX 103356533      + |        11
##   [498]     chrX 132217287      - |         1
##   [499]     chrX 140564830      + |         1
##   [500]     chrX 153954723      - |         1
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
## 
## $RFP_heart_fheart_valve
## GRanges object with 500 ranges and 6 metadata columns:
##           seqnames              ranges strand |        V7        V8        V9
##              <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##     [1] GL000220.1       109184-109205      + |    109183    109205         0
##     [2] GL000220.1       112148-112179      + |    112147    112179         0
##     [3] GL000220.1       112148-112179      + |    112147    112179         0
##     [4] GL000220.1       112148-112179      + |    112147    112179         0
##     [5] GL000220.1       112148-112179      + |    112147    112179         0
##     ...        ...                 ...    ... .       ...       ...       ...
##   [496]       chr8   81522077-81522106      - |  81522076  81522106         0
##   [497]       chr8 135773772-135773799      - | 135773771 135773799         0
##   [498]       chr9   76571752-76571781      + |  76571751  76571781         0
##   [499]       chr9   76571752-76571781      + |  76571751  76571781         0
##   [500]       chr9   93543250-93543273      + |  93543249  93543273         0
##               V10         V11         V12
##         <integer> <character> <character>
##     [1]         1          22           0
##     [2]         1          32           0
##     [3]         1          32           0
##     [4]         1          32           0
##     [5]         1          32           0
##     ...       ...         ...         ...
##   [496]         1          30           0
##   [497]         1          28           0
##   [498]         1          30           0
##   [499]         1          30           0
##   [500]         1          24           0
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths
## 
## $RFP
## GAlignments object with 16649 alignments and 0 metadata columns:
##           seqnames strand       cigar    qwidth     start       end     width
##              <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##       [1]    chr23      -         28M        28  17599129  17599156        28
##       [2]    chr23      -         28M        28  17599129  17599156        28
##       [3]    chr23      -         28M        28  17599129  17599156        28
##       [4]    chr23      -         28M        28  17599129  17599156        28
##       [5]    chr23      -         28M        28  17599129  17599156        28
##       ...      ...    ...         ...       ...       ...       ...       ...
##   [16645]     chr8      +         29M        29  24068894  24068922        29
##   [16646]     chr8      +         28M        28  24068907  24068934        28
##   [16647]     chr8      +         30M        30  24068919  24068948        30
##   [16648]     chr8      +         30M        30  24068919  24068948        30
##   [16649]     chr8      +         30M        30  24068939  24068968        30
##               njunc
##           <integer>
##       [1]         0
##       [2]         0
##       [3]         0
##       [4]         0
##       [5]         0
##       ...       ...
##   [16645]         0
##   [16646]         0
##   [16647]         0
##   [16648]         0
##   [16649]         0
##   -------
##   seqinfo: 1133 sequences from an unspecified genome
## 
## $RNA_heart
## GRanges object with 500 ranges and 6 metadata columns:
##           seqnames              ranges strand |        V7        V8        V9
##              <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##     [1] GL000220.1       117772-117872      - |    117771    117872         0
##     [2] GL000220.1       118010-118110      - |    118009    118110         0
##     [3] KI270733.1       134864-134964      - |    134863    134964         0
##     [4] KI270733.1       175346-175446      + |    175345    175446         0
##     [5] KI270733.1       177165-177265      - |    177164    177265         0
##     ...        ...                 ...    ... .       ...       ...       ...
##   [496]       chr9 130236114-130236214      - | 130236113 130236214         0
##   [497]       chr9 137008790-137008972      + | 137008789 137008972         0
##   [498]       chr9 137040403-137040580      - | 137040402 137040580         0
##   [499]       chr9 137220666-137220766      - | 137220665 137220766         0
##   [500]       chr9 137616179-137617934      + | 137616178 137617934         0
##               V10         V11         V12
##         <integer> <character> <character>
##     [1]         1         101           0
##     [2]         1         101           0
##     [3]         1         101           0
##     [4]         1         101           0
##     [5]         1         101           0
##     ...       ...         ...         ...
##   [496]         1         101           0
##   [497]         2       79,22       0,161
##   [498]         2       34,67       0,111
##   [499]         1         101           0
##   [500]         2        21,1      0,1755
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths

3.1.3 Loading NGS data by fimport

The third way is to load manually, more secure, but also more cumbersome.

files <- filepath(df, type = "default")
CAGE_loaded_manually <- fimport(files[1])

If you use the auto-loading to environment and you have multiple experiments, it might be a chance of non-unique naming, 2 experiments might have a library called cage. To be sure names are unique, we add the experiment name in the variable name:

df@expInVarName <- TRUE
bamVarName(df) #This will be the names:

## [1] "ORFik_example_human_CAGE_heart"            
## [2] "ORFik_example_human_RFP_heart_fheart_valve"
## [3] "ORFik_example_human_RFP"                   
## [4] "ORFik_example_human_RNA_heart"

You see here that the experiment name, “ORFik” is in the variable name If you are only working on one experiment, you do not need to include the name, since there is no possibility of duplicate naming (the experiment class validates all names are unique).

Since we want NGS data names without “ORFik”, let’s remove the loaded libraries and load them again.

df@expInVarName <- FALSE
remove.experiments(df)

## Removed loaded libraries from experiment:ORFik_example_human

outputLibs(df)

## Outputting libraries from: ORFik_example_human

3.1.4 Loading Annotation and specific regions

There is also many function to load specific parts of the annotation:

txdb <- loadTxdb(df) # transcript annotation

## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK

Let’s say we want to load all leaders, cds and 3’ UTRs that are longer than 30. With ORFik experiment this is easy:

txNames <- filterTranscripts(txdb, minFiveUTR = 30, minCDS = 30, minThreeUTR = 30)
loadRegions(txdb, parts = c("leaders", "cds", "trailers"), names.keep = txNames)

The regions are now loaded into .GlobalEnv, only keeping transcripts from txNames.

3.2 Plotting with ORFik experiments

ORFik supports a myriad of plots for experiments. Lets make a plot with coverage over mrna, seperated by 5’ UTR, CDS and 3’ UTR in one of the ribo-seq libraries from the experiment

transcriptWindow(leaders, cds, trailers, df[3,])

## RFP

## [[1]]

## 
## [[2]]

4 P-site shifting experiment

If your experiment consists of Ribo-seq, you want to do p-site shifting.

shiftFootprintsByExperiment(df[df$libtype == "RFP",])

P-shifted ribo-seq will automaticly be stored as .bigWwig (track files for IGV/UCSC) and .ofst (ORFik serialized for R) files in a ./pshifted folder, relative to original libraries.

To validate p-shifting, use shiftPlots. Here is an example from Bazzini et al. 2014 I made.

df.baz <- read.experiment("zf_bazzini14_RFP")
shiftPlots(df.baz, title = "Ribo-seq, zebrafish, Bazzini et al. 2014", type = "heatmaps")

p-site analysis

To see the shifts per library do:

shifts.load(df)

To see the location of pshifted files:

filepath(df[df$libtype == "RFP",], type = "pshifted")

To load p-shifted libraries, you can do:

outputLibs(df[df$libtype == "RFP",], type = "pshifted")

5 Converting bam files to faster formats

Bam files are slow to load, and usually you don’t need all the information contained in a bam file.

Usually you convert to bed or bigWig files, but ORFik also support 3 formats for much faster loading and use of data.

5.1 ofst: ORFik serialized format

From the bam file store these columns as a serialized file (using the insane facebook zstandard compression): seqname, start, cigar, strand, score (number of identical replicates for that read).

This is the fastest format to use, loading time of 10GB Ribo-seq bam file reduced from ~ 5 minutes to ~ 1 second and ~ 15MB size.

convertLibs(df, type = "ofst") # Collapsed

5.2 bedo: bed ORFik file

From the bam file store these columns as text file: seqname, start, end (if not all widths are 1), strand, score (number of identical replicates for that read), size (size of cigar Ms according to reference)

The R object loaded from these files are GRanges, since cigar is not needed.

Loading time of 10GB Ribo-seq bam file reduced to ~ 10 seconds and ~ 100MB size.

5.3 bedoc: bed ORFik file with cigar

From the bam file store these columns as text file: seqname, cigar, start, strand, score (number of identical replicates for that read)

The R object loaded from these files are GAlignments or GAlignmentPairs, since cigar is needed.

Loading time of 10GB Ribo-seq bam file reduced to ~ 15 seconds and ~ 200MB size.

6 ORFik QC report

ORFik also support a full QC report for post alignment statistics, correlation plots, simplified libraries for plotting, meta coverage, ++.

6.1 General report

To optimize the experiment for use in ORFik, always run QCreport, you will then get:

1. ofst collapsed reads of the bam files for faster loading (saved in /ofst folder relative to the bam files)
1. count tables over mrna, 5’ UTRs, CDS and 3’ UTRs for all transcripts
1. QC plots from experiment (correlation, metacoverage, alignment stats, coverage of (5’ UTRs, CDS and 3’ UTRs), PCA outlier analysis, ++)
1. Statistics summarized in csv files.

6.1.1 How to run QC:

The default QC report:

QCreport(df)

6.1.2 How to load optimized output from QC:

Load Count tables for cds (FPKM normalized):

countTable(df, region = "cds", type = "fpkm")

Load Count tables for all mRNAs (DESeq object):

countTable(df, region = "mrna", type = "deseq")

6.1.3 How to load the QC statistics:

The statistics are saved in /QC_STATS/ folder relative to the bam files as csv files. To see the statistics, you can do:

QCstats(df)

6.1.4 How to see the QC plots:

The plots are saved in /QC_STATS/ folder relative to the bam files, this folder will contain all plots from the QC, either as pdf or png files dependent on what you specify in the QC.

6.2 Ribo-seq specific pshifting and QC

To pshift all ribo-seq files in an experiment, do:

shiftFootprintsByExperiment(df)

In addition there is a QC report for Ribo-seq, with some addition analysis of read lengths and frames. This should only be run on when you have pshifted the reads.

RiboQC.plot(df)

7 Using the ORFik experiment system in your script

Usually you want to do some operation on multiple data-sets. If ORFik does not include a premade function for what you want, you can make it yourself. If your data is in the format of an ORFik experiment, this operation is simple.

7.1 Looping over all libraries in experiment

Not all functions in ORFik supports abstraction from single library to the experiment syntax. Here 3 ways to run loops for the data for these cases are shown:

if you know you have enough memory to load all data at once

outputLibs(df, type = "pshifted") # Output all libraries, fastest way
libs <- bamVarName(df) # <- here are names of the libs that were outputed
cds <- loadRegion(df, "cds")
# parallel loop
bplapply(libs, FUN = function(lib, cds) { 
    return(entropy(cds, get(lib)))
}, cds = cds)

Not loading data to global environment

files <- filepath(df, type = "pshifted")
cds <- loadRegion(df, "cds")
# parallel loop
res <- bplapply(files, FUN = function(file, cds) { 
    return(entropy(cds, fimport(file)))
}, cds = cds)

No parallel evaluation (If you do not have a lot of memory)

files <- filepath(df, type = "pshifted")
cds <- loadRegion(df, "cds")
# Single thread loop
lapply(files, FUN = function(file, cds) { 
    return(entropy(cds, fimport(file)))
}, cds = cds)

7.1.1 Reformat output to data.table (merge column-wise)

Since the output from the above loops will output lists, a very fast conversion to data.table can be done with:

library(data.table)

outputLibs(df, type = "pshifted")
libs <- bamVarName(df) # <- here are names of the libs that were outputed
cds <- loadRegion(df, "cds")
# parallel loop
res <- bplapply(libs, FUN = function(lib, cds) { 
        return(entropy(cds, get(lib)))
    }, cds = cds)

res.by.columns <- copy(res) # data.table copies default by reference
# Add some names and convert
names(res.by.columns) <- libs
data.table::setDT(res.by.columns) # Will give 1 column per library
res.by.columns # Now by columns

7.1.2 Reformat output to data.table (merge row-wise)

To merge row-wise do:

res.by.rows <- copy(res)
# Add some names and convert
names(res.by.rows) <- libs
res.by.rows <- rbindlist(res.by.rows) # Will bind rows per library
res.by.columns # now melted row-wise

8 Conclusion

ORFik contains a whole API for using the ORFik.experiment S4 class to simplify coding over experiments. More examples of use shown in documentation and in the Annotation_Alignment and Ribo-seq pipeline vignettes.

Data management

30 September 2021

Package

Contents