1 Introduction

cell-free DNA (cfDNA) enters human blood circulation via various biological processes, such as necrosis(i.e. abnormal death) and apoptosis (i.e. normal death) of cell, and direct secretion etc. They could be sheded by both healthy and cancer cells. Thus, cfDNA molecules extracted from liquid biopsies will be a mixture of DNA fragments from both normal and abnormal cells. It contains critical information for cancer diagnosis, therapeutic decision-making and treatment response surveillance.

One of those information is the length of cfDNA fragments in base pair. A facinating phenomenon is its 10bp periodic oscillation in the fragmentation pattern plot. Many studies have proved that cfDNA from cancer cells (also called circulating-tumor DNA, ctDNA) in plasma are shorter than healthy cfDNA.This key feature have been used for cancer signal enrichment and sample classification by subsetting cfDNA fragments between 90 and 150bp from the sequencing data.

Although ploting the fragmentation patterns is an important step for cfDNA sequencing data analysis, there isn’t an R package designed for automated and standardized analysis in this study area. Here, I wrote this tutorial showing how to use cfDNAPro functions to perform characterisation and visualisation of cfDNA whole genome sequencing data. The functions of cfDNAPro help reach a high degree of clarity, transparency and reproducibility of analyses.

cfDNAPro now includes two sets of functions for data characterisation and visualisation respectively. Data characterisation functions consist of examplePath, callMode, callSize, callMetrics, callPeakDistance, and callValleyDistance;

Data visualisation functions are: plotAllToOne, plotMetrics, plotMode, plotModeSummary, plotPeakDistance, plotValleyDistance, and plotSingleGroup.

Details about each function please refer to their built-in documentations in R. For example, typing ?callMode in R console will show documentation of callMode.

2 Installation

For steady release, install via bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cfDNAPro")

For develop version, install via github:

if(!require(devtools)) {
  install.packages("devtools", repos="https://cloud.r-project.org")}

devtools::install_github("hw538/cfDNAPro")

3 Fragment Size Distribution of Each Group

In the cfDNA study area, fragment size metrics data is usually generated using Picard tool.

Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. These file formats are defined in the Hts-specs repository. See especially the SAM specification and the VCF specification.

Thus, cfDNAPro accepts as input the insert sizes metrics files (i.e. txt files ) generated by Picard tools from Illumina NGS sequencing data using this function:CollectInsertSizeMetrics.

In addition, cfDNAPro could also directly read fragment size information from bam files (Of note, the bam should be NGS data generated from Illumina sequencing platforms) and extract the insert size data using these criteria: (a) reads have the “proper pair” flag, (b) mapping quality >= 30, and (c) CIGAR string doesn’t include “I” or “D”.

To use cfDNAPro package, gathering all txt files generated by Picard or bam files into sub-folders named by cohort name is required, even if when you have only one cohort. Example txt files are installed together with this package. Here is how to retrieve the example data:

library(cfDNAPro)
# Get the path to example data in cfDNAPro package library path.
data_path <- examplePath("groups_picard")

The example data has four groups separated into four sub-folders: “cohort_1”, “cohort_2”, “cohort_3”, and “cohort_4”. Each group contains their own samples (i.e. txt files). list.files(data_path, full.names = TRUE,recursive = TRUE) to check.

Note: the cohort 1, 2, 3 and 4 here are not biologically meaningful (i.e. these are simulated data, and without a cancer type related to them). In the sessions below, you would see several plots with the typical 10bp period oscillations in cfDNA fragmentation profile. Please be aware that these plots/data are for cfDNAPro function demonstration only. I aimed to show how cfDNAPro offers a convenient way to do data analysis and visualization.

In your own study, you could utilize these functions to analyze your cancer cfDNA NGS sequencing data, and to identify the fragmentation features useful for answering research questions.

3.1 Generate a plot for a single group/cohort

There are data from four groups, we want to simply plot one cohort (e.g. cohort_1) How to plot those three txt files in “cohort_1”?

cohort1_plot <- cfDNAPro::callSize(path = data_path) %>%
  dplyr::filter(group == as.character("cohort_1")) %>%
  cfDNAPro::plotSingleGroup()
#> setting default outfmt to df.
#> setting default input_type to picard.

In cohort1_plot (actually it is a list in R!) we could see three ggplot2 objects: prop_plot, cdf_plot, and one_minus_cdf_plot. This means you could modify those plots using ggplot2 syntax.

3.2 Matipulate your plots

In this session I will show you could manipulate the ggplot2 objects and plot multiple panels in a figure.

Explanation on how the metrics are calculated:

Upper panels: x-axis is the fragment length (from 30bp, 31bp, …, 500bp), these are discrete integers. y-axis is the proportion of reads with a specific read length.The line here is NOT a smoothed curve, it is a line connecting the different data points.

Lower panels (i.e. the cumulative plot): x-axis is the fragment length (from 30bp, 31bp, …, 500bp), these are discrete integers. The calculation of y is (1) count the number of reads <= 30bp (e.g. N); (2) normalize N to proportion Repeat step (1) and (2) for each fragment size (i.e. 30bp, 31bp, …, 500bp). Then plot these data points as a line graph. It is NOT a smoothed curve, it is a line connecting the different data points.

For the x-axis, we only plot 30bp-500bp, as the proportion of fragments < 30 are noisy. And the proportion of fragments > 500 are usually too low, thus not useful at all. If users want to expand the x-axis, there are parameters could be set in the callSizefunctions.

library(scales)
library(ggpubr)
library(ggplot2)
library(dplyr)


# Define a list for the groups/cohorts.
grp_list<-list("cohort_1"="cohort_1",
               "cohort_2"="cohort_2",
               "cohort_3"="cohort_3",
               "cohort_4"="cohort_4")

# Generating the plots and store them in a list.
result<-sapply(grp_list, function(x){
  result <-callSize(path = data_path) %>% 
    dplyr::filter(group==as.character(x)) %>% 
    plotSingleGroup()
}, simplify = FALSE)
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.

# Multiplexing the plots in one figure
suppressWarnings(
  multiplex <-
    ggarrange(result$cohort_1$prop_plot + 
              theme(axis.title.x = element_blank()),
            result$cohort_4$prop_plot + 
              theme(axis.title = element_blank()),
            result$cohort_1$cdf_plot,
            result$cohort_4$cdf_plot + 
              theme(axis.title.y = element_blank()),
            labels = c("Cohort 1 (n=5)", "Cohort 4 (n=4)"),
            label.x = 0.2,
            ncol = 2,
            nrow = 2))

multiplex


4 Median Size Metrics

In last session we plotted all samples in each group. We will calculate the median size fragment distribution of each group and then plot those median distribution together.

# Set an order for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate plots.
compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order)
#> setting default input_type to picard.

# Modify plots.
p1<-compare_grps$median_prop_plot +
  ylim(c(0, 0.028)) +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_text(size=12,face="bold")) +
  theme(legend.position = c(0.7, 0.5),
        legend.text = element_text( size = 11),
        legend.title = element_blank())

p2<-compare_grps$median_cdf_plot +
  scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +
  theme(axis.title=element_text(size=12,face="bold")) +
  theme(legend.position = c(0.7, 0.5),
        legend.text = element_text( size = 11),
        legend.title = element_blank())

# Finalize plots.
suppressWarnings(
  median_grps<-ggpubr::ggarrange(p1,
                       p2,
                       label.x = 0.3,
                       ncol = 1,
                       nrow = 2
                       ))


median_grps


6 Inter-peak/valley Distance

10 bp periodical oscillations were observed in cell-free DNA fragmentation patterns, to quantify this feature, we could calculate the inter-peak and inter-valley distances.

6.1 Inter-peak distance

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")

# Plot and modify inter-peak distances.

  inter_peak_dist<-callPeakDistance(path = data_path,  limit = c(50, 135)) %>%
  plotPeakDistance(order = order) +
  labs(y="Fraction") +
  theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
#> setting the mincount to 0.
#>  setting the xlim to c(7,13). 
#>  setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.


# Show the plot.
suppressWarnings(print(inter_peak_dist))

6.2 Inter-valley distance

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path,  
                                      limit = c(50, 135)) %>%
  plotValleyDistance(order = order) +
  labs(y="Fraction") +
  theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
#> setting the mincount to 0. 
#>  setting the xlim to c(7,13). 
#>  setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.

# Show the plot.
suppressWarnings(print(inter_valley_dist))


7 Others

Further modification of the plots generated by cfDNAPro package could be done by using ggplot2 syntax. For example, we could highlightthe peaks and valleys in the fragmentation patterns

library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL)
#> setting default outfmt to df.
#> setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop + 
  coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
  geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") +
  geom_vline(xintercept = valleys$insert_size,colour="blue")

# Show the plot.
suppressWarnings(print(exam_plot_prop))


# Label peaks and valleys with dots.
exam_plot_prop_dot<- exam_plot_all$prop + 
  coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
  geom_point(data= peaks, 
             mapping = aes(x= insert_size, y= prop),
             color="blue",alpha=0.5,size=3) +
  geom_point(data= valleys, 
             mapping = aes(x= insert_size, y= prop),
             color="red",alpha=0.5,size=3) 
# Show the plot.
suppressWarnings(print(exam_plot_prop_dot))

7.1 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.7      ggpubr_0.4.0     ggplot2_3.3.5    scales_1.1.1    
#> [5] cfDNAPro_1.0.0   magrittr_2.0.1   BiocStyle_2.22.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.0             tidyr_1.1.4            jsonlite_1.7.2        
#>  [4] carData_3.0-4          bslib_0.3.1            assertthat_0.2.1      
#>  [7] TTR_0.24.2             highr_0.9              BiocManager_1.30.16   
#> [10] stats4_4.1.1           cellranger_1.1.0       GenomeInfoDbData_1.2.7
#> [13] Rsamtools_2.10.0       yaml_2.2.1             pillar_1.6.4          
#> [16] backports_1.2.1        lattice_0.20-45        glue_1.4.2            
#> [19] digest_0.6.28          GenomicRanges_1.46.0   ggsignif_0.6.3        
#> [22] XVector_0.34.0         colorspace_2.0-2       cowplot_1.1.1         
#> [25] htmltools_0.5.2        pkgconfig_2.0.3        broom_0.7.9           
#> [28] magick_2.7.3           haven_2.4.3            bookdown_0.24         
#> [31] zlibbioc_1.40.0        purrr_0.3.4            openxlsx_4.2.4        
#> [34] rio_0.5.27             BiocParallel_1.28.0    tibble_3.1.5          
#> [37] farver_2.1.0           generics_0.1.1         IRanges_2.28.0        
#> [40] car_3.0-11             ellipsis_0.3.2         withr_2.4.2           
#> [43] BiocGenerics_0.40.0    quantmod_0.4.18        readxl_1.3.1          
#> [46] crayon_1.4.1           evaluate_0.14          fansi_0.5.0           
#> [49] rstatix_0.7.0          forcats_0.5.1          xts_0.12.1            
#> [52] foreign_0.8-81         tools_4.1.1            data.table_1.14.2     
#> [55] hms_1.1.1              lifecycle_1.0.1        stringr_1.4.0         
#> [58] S4Vectors_0.32.0       munsell_0.5.0          zip_2.2.0             
#> [61] Biostrings_2.62.0      compiler_4.1.1         jquerylib_0.1.4       
#> [64] GenomeInfoDb_1.30.0    rlang_0.4.12           grid_4.1.1            
#> [67] RCurl_1.98-1.5         labeling_0.4.2         bitops_1.0-7          
#> [70] rmarkdown_2.11         gtable_0.3.0           abind_1.4-5           
#> [73] DBI_1.1.1              curl_4.3.2             R6_2.5.1              
#> [76] zoo_1.8-9              knitr_1.36             fastmap_1.1.0         
#> [79] utf8_1.2.2             stringi_1.7.5          parallel_4.1.1        
#> [82] Rcpp_1.0.7             vctrs_0.3.8            tidyselect_1.1.1      
#> [85] xfun_0.27