Note: the most recent version of this vignette can be found here and a short overview slide show here.

Note: if you use systemPipeR and systemPipeRdata in published research, please cite:

Backman, T.W.H and Girke, T. (2016). systemPipeR: NGS Workflow and Report Generation Environment. BMC Bioinformatics, 17: 388. 10.1186/s12859-016-1241-0.

1 Introduction

systemPipeRdata is a helper package to generate with a single command NGS workflow templates that are intended to be used by its parent package systemPipeR (H Backman and Girke 2016). The latter is an environment for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, Ribo-Seq, ChIP-Seq, VAR-Seq and many others. The directory structure of the workflow templates and the sample data used by systemPipeRdata are described here.

2 Getting Started

2.1 Installation

The R software for using systemPipeRdata can be downloaded from CRAN. The systemPipeRdata package can be installed from within R as follows:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("systemPipeRdata")  # Installs from Bioconductor once
# available there
BiocManager::install("tgirke/systemPipeR", build_vignettes = TRUE, 
    dependencies = TRUE)  # Installs from github

2.2 Loading package and documentation

library("systemPipeRdata")  # Loads the package
library(help = "systemPipeRdata")  # Lists package info
vignette("systemPipeRdata")  # Opens vignette

2.3 Generate workflow template

Load one of the available NGS workflows into your current working directory. The following does this for the varseq template. The name of the resulting workflow directory can be specified under the mydirname argument. The default NULL uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. Besides, it is possible to choose different version of the workflow template. Please check the available options here, or provide the download URL to your template. The URL can be specified under url argument and the file name in the urlname argument. The default NULL copies the current version available in the systemPipeRdata.

genWorkenvir(workflow = "varseq", mydirname = NULL, url = NULL, 
    urlname = NULL)
setwd("varseq")

On Linux and OS X systems the same can be achieved from the command-line of a terminal with the following commands.

$ Rscript -e "systemPipeRdata::genWorkenvir(workflow='varseq', mydirname=NULL, url=NULL, urlname=NULL)"

2.3.1 Directory Structure

The workflow templates generated by genWorkenvir contain the following preconfigured directory structure:

  • workflow/ (e.g. rnaseq/)
    • This is the root directory of the R session running the workflow.
    • Run script ( *.Rmd) and sample annotation (targets.txt) files are located here.
    • Note, this directory can have any name (e.g. rnaseq, varseq). Changing its name does not require any modifications in the run script(s).
    • Important subdirectories:
      • param/
        • Stores non-CWL parameter files such as: *.param, *.tmpl and *.run.sh. These files are only required for backwards compatibility to run old workflows using the previous custom command-line interface.
        • param/cwl/: This subdirectory stores all the CWL parameter files. To organize workflows, each can have its own subdirectory, where all CWL param and input.yml files need to be in the same subdirectory.
      • data/
        • FASTQ files
        • FASTA file of reference (e.g. reference genome)
        • Annotation files
        • etc.
      • results/
        • Analysis results are usually written to this directory, including: alignment, variant and peak files (BAM, VCF, BED); tabular result files; and image/plot files
        • Note, the user has the option to organize results files for a given sample and analysis step in a separate subdirectory.

Note: Directory names are indicated in green. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.

Figure 1: systemPipeR’s preconfigured directory structure.

2.4 Run workflows

Next, run from within R the chosen sample workflow by executing the code provided in the corresponding *.Rmd template file. If preferred the corresponding *.Rnw or *.R versions can be used instead. Alternatively, one can run an entire workflow from start to finish with a single command by executing from the command-line 'make -B' within the workflow directory (here 'varseq'). Much more detailed information on running and customizing systemPipeR workflows is available in its overview vignette here. This vignette can also be opened from R with the following command.

library("systemPipeR")  # Loads systemPipeR which needs to be installed via BiocManager::install() from Bioconductor
vignette("systemPipeR", package = "systemPipeR")

2.5 Return paths to sample data

The location of the sample data provided by systemPipeRdata can be returned as a list.

pathList()
## $targets
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/param/targets.txt"
## 
## $targetsPE
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/param/targetsPE.txt"
## 
## $annotationdir
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/annotation/"
## 
## $fastqdir
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/fastq/"
## 
## $bamdir
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/bam/"
## 
## $paramdir
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/param/"
## 
## $workflows
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/workflows/"
## 
## $chipseq
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/workflows/chipseq/"
## 
## $rnaseq
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/workflows/rnaseq/"
## 
## $riboseq
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/workflows/riboseq/"
## 
## $varseq
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/workflows/varseq/"
## 
## $new
## [1] "/tmp/RtmpVL0PBd/Rinst10d72a8c7ff2/systemPipeRdata/extdata/workflows/new/"

3 Version information

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices
## [6] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] systemPipeRdata_1.16.2      batchtools_0.9.13          
##  [3] ape_5.4-1                   ggplot2_3.3.2              
##  [5] systemPipeR_1.22.0          ShortRead_1.46.0           
##  [7] GenomicAlignments_1.24.0    SummarizedExperiment_1.18.2
##  [9] DelayedArray_0.14.1         matrixStats_0.57.0         
## [11] Biobase_2.48.0              BiocParallel_1.22.0        
## [13] Rsamtools_2.4.0             Biostrings_2.56.0          
## [15] XVector_0.28.0              GenomicRanges_1.40.0       
## [17] GenomeInfoDb_1.24.2         IRanges_2.22.2             
## [19] S4Vectors_0.26.1            BiocGenerics_0.34.0        
## [21] BiocStyle_2.16.1           
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.10         GOstats_2.54.0          
##   [3] BiocFileCache_1.12.1     GSEABase_1.50.1         
##   [5] splines_4.0.2            usethis_1.6.3           
##   [7] digest_0.6.25            htmltools_0.5.0         
##   [9] GO.db_3.11.4             fansi_0.4.1             
##  [11] magrittr_1.5             checkmate_2.0.0         
##  [13] memoise_1.1.0            BSgenome_1.56.0         
##  [15] base64url_1.4            limma_3.44.3            
##  [17] remotes_2.2.0            annotate_1.66.0         
##  [19] askpass_1.1              prettyunits_1.1.1       
##  [21] jpeg_0.1-8.1             colorspace_1.4-1        
##  [23] blob_1.2.1               rappdirs_0.3.1          
##  [25] xfun_0.17                dplyr_1.0.2             
##  [27] callr_3.4.4              crayon_1.3.4            
##  [29] RCurl_1.98-1.2           jsonlite_1.7.1          
##  [31] graph_1.66.0             genefilter_1.70.0       
##  [33] brew_1.0-6               survival_3.2-3          
##  [35] VariantAnnotation_1.34.0 glue_1.4.2              
##  [37] gtable_0.3.0             zlibbioc_1.34.0         
##  [39] V8_3.2.0                 pkgbuild_1.1.0          
##  [41] Rgraphviz_2.32.0         scales_1.1.1            
##  [43] pheatmap_1.0.12          DBI_1.1.0               
##  [45] edgeR_3.30.3             Rcpp_1.0.5              
##  [47] xtable_1.8-4             progress_1.2.2          
##  [49] bit_4.0.4                rsvg_2.1                
##  [51] AnnotationForge_1.30.1   httr_1.4.2              
##  [53] RColorBrewer_1.1-2       ellipsis_0.3.1          
##  [55] pkgconfig_2.0.3          XML_3.99-0.5            
##  [57] dbplyr_1.4.4             locfit_1.5-9.4          
##  [59] tidyselect_1.1.0         rlang_0.4.7             
##  [61] AnnotationDbi_1.50.3     munsell_0.5.0           
##  [63] tools_4.0.2              cli_2.0.2               
##  [65] generics_0.0.2           RSQLite_2.2.0           
##  [67] devtools_2.3.2           evaluate_0.14           
##  [69] stringr_1.4.0            yaml_2.2.1              
##  [71] fs_1.5.0                 processx_3.4.4          
##  [73] knitr_1.30               bit64_4.0.5             
##  [75] purrr_0.3.4              RBGL_1.64.0             
##  [77] nlme_3.1-149             formatR_1.7             
##  [79] biomaRt_2.44.1           debugme_1.1.0           
##  [81] compiler_4.0.2           curl_4.3                
##  [83] png_0.1-7                testthat_2.3.2          
##  [85] tibble_3.0.3             stringi_1.5.3           
##  [87] ps_1.3.4                 GenomicFeatures_1.40.1  
##  [89] desc_1.2.0               lattice_0.20-41         
##  [91] Matrix_1.2-18            vctrs_0.3.4             
##  [93] pillar_1.4.6             lifecycle_0.2.0         
##  [95] BiocManager_1.30.10      data.table_1.13.0       
##  [97] bitops_1.0-6             rtracklayer_1.48.0      
##  [99] R6_2.4.1                 latticeExtra_0.6-29     
## [101] hwriter_1.3.2            bookdown_0.20           
## [103] sessioninfo_1.1.1        codetools_0.2-16        
## [105] assertthat_0.2.1         pkgload_1.1.0           
## [107] openssl_1.4.3            Category_2.54.0         
## [109] rprojroot_1.3-2          rjson_0.2.20            
## [111] withr_2.3.0              GenomeInfoDbData_1.2.3  
## [113] hms_0.5.3                grid_4.0.2              
## [115] DOT_0.1                  rmarkdown_2.3

4 Funding

This project was supported by funds from the National Institutes of Health (NIH) and the National Science Foundation (NSF).

References

H Backman, Tyler W, and Thomas Girke. 2016. “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics 17 (1):388. https://doi.org/10.1186/s12859-016-1241-0.