Using the target package

Mahmoud Ahmed

2019-10-29

Overview

In this document, we describe the BETA algorithm for predicting associated peaks from binding ChIP data and integrating binding data and expression data to predict direct binding target regions. In addition, we describe the implementation of the algorithm in an R package, target. Finally, we provide an example for using target to predict associated peaks and direct gene targets of androgen receptors in the LNCap cell line.

The theory

The BETA algorithm in its simplest form, minus, is composed of three steps:

  1. Selecting the peaks (\(p\)) within a certain range from the regions of interest (\(g\)).
  2. Calculate the distance (\(\Delta\)) between the center of the peak and each of the regions expressed relative to a distance of 100 kb.
  3. Calculate a the peak scores (\(S_p\)) as the transformed exponential of the distance, \(\Delta\), as follows;

\[ S_p = e^{-(0.5+4\Delta)} \]

  1. Calculate the region/gene regulatory potential (\(S_g\)) as the sum of the scores, \(S_p\), as follows:

\[ S_g = \sum_{i=1}^k S_{pi} \] where \(p\) is \(\{1,...,k\}\) peaks near the region of interest.

In addition, in BETA basic another step is added to predict real region/gene targets

  1. Rank all regions based on their regulatory potential, \(S_g\), to give their binding potential (\(R_{gb}\)) and based on their differential expression (\(R_{ge}\)). The product of the two ranks predicts real region/gene targets.

\[ RP_g = \frac{R_{gb}\times R_{ge}}{n^2} \]

where \(n\) is the number of regions \(g\).

Python implementation

The original paper on this work presented an implementation of the algorithm in python which can be invoked form the command-line interface (CLI).

R implementation

The target package implement the BETA algorithm in several low-level functions that correspond to the previously described steps.

  1. merge_ranges: select the peaks in the genomic regions of interest, e.g. genes.
  2. find_distance: calculate the distance between the peaks and the regions of interest, e.g. transcription start sites (TSS).
  3. score_peaks: calculate a regulatory score for each peak in relation to a region of interest.
  4. score_regions: Calculate a regulatory score for regions of interest/genes
  5. rank_product: rank the regions of interest/genes based on the regulatory potential and another statistics, e.g. differential expression.

In addition, two high-level functions can be used to apply these functions sequentially and obtain only the final output. These are:

  1. associated_peaks: select and calculate a regulatory potential for peaks within a defined distance from regions of interest/genes.
  2. direct_targets: predict direct target regions among regions of interest/genes based on the regulatory potential of the peaks in the region and one more statistics such as differential expression.

Finally, two additional functions plot_predictions and test_predictions were added to visually and statistically examine the predictions made by target.

Example

The example below was presented in this paper. The dataset used in the example is from another published study. The study used the LNCap cell line to determine the androgen receptor (AR) binding sites using ChIP-on-chip and the gene expression in the cell line after treatment with physiological androgen 5α-dihydrotestosterone (DHT) for 16 hours using microarrays. The binding sites of AR are recorded in a bed file, 3656_peaks.bed. The differential gene expression results are recorded in AR_diff_expr.xls. The reference genome hg19 was used to define the gene coordinates and identifiers, hg19.refseq.

# load reguired libraries
library(target)
library(GenomicRanges)

Each of the three following chuncks is reading one of the required input data and transforming it into the appropriate format. The test data of the python package is shipped with the R target package for testing purposes. Two datasets real_peaks and real_transcripts are the two GRanges object that holds the identified peaks and the differential expression results respectively.

# load peaks and transcripts data
data("real_peaks")
data("real_transcripts")

The two high-level functions mentioned above can be called directly into the objects. associated_peaks takes as arguments two GRanges objects; peaks and regions. In this case, the two inputs are the peaks and transcripts which we prepared earlier. The output of this function is a GRanges, the same as the input peaks, with three additional metadata columns: assigned_region, distance and peak_score.

# get associated peaks
ap <- associated_peaks(real_peaks, real_transcripts, 'ID')
ap
#> GRanges object with 18877 ranges and 5 metadata columns:
#>           seqnames              ranges strand |     peak_name      pval
#>              <Rle>           <IRanges>  <Rle> |   <character> <numeric>
#>       [1]     chr1     1208689-1209509      * |    AR_LNCaP_2     51.58
#>       [2]     chr1     1208689-1209509      * |    AR_LNCaP_2     51.58
#>       [3]     chr1     1208689-1209509      * |    AR_LNCaP_2     51.58
#>       [4]     chr1     1208689-1209509      * |    AR_LNCaP_2     51.58
#>       [5]     chr1     1208689-1209509      * |    AR_LNCaP_2     51.58
#>       ...      ...                 ...    ... .           ...       ...
#>   [18873]     chrX 153362757-153363593      * | AR_LNCaP_7151     51.71
#>   [18874]     chrY   21706080-21707252      * | AR_LNCaP_7174     74.36
#>   [18875]     chrY   21706080-21707252      * | AR_LNCaP_7174     74.36
#>   [18876]     chrY   21706080-21707252      * | AR_LNCaP_7174     74.36
#>   [18877]     chrY   21706080-21707252      * | AR_LNCaP_7174     74.36
#>           assigned_region  distance         peak_score
#>               <character> <numeric>          <numeric>
#>       [1]       NM_030649    -34171  0.154611462906276
#>       [2]       NM_080605     41471  0.115458975799219
#>       [3]       NM_032348    -85075 0.0201812767090482
#>       [4]    NM_001130413     -6715  0.463661740571496
#>       [5]       NR_037668     -6715  0.463661740571496
#>       ...             ...       ...                ...
#>   [18873]    NM_001025243     77833 0.0269621836580386
#>   [18874]       NR_045128    -22576  0.245848447492684
#>   [18875]       NR_045129    -22576  0.245848447492684
#>   [18876]       NR_002923     41626  0.114745344691694
#>   [18877]       NR_033732     41626  0.114745344691694
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

direct_targets also takes as arguments two GRanges objects; peaks and regions. Two other arguments are required when the user desires to rank the target genes based on both the regulatory potential and the differential expression statistics. The arguments are regions_col and stats_col, these should be strings for the columns names in the metadata of the regions object for the gene names/symbols and the chosen statistics to rank the genes. The output of this function is a GRanges, the same as the input regions, with four additional metadata columns: score, stat, score_rank, stat_rank and rank. These correspond to the regulatory potential gene score, the chosen statistics, the rank of each and the final rank product. The values in the rank column are the product of the two ranks, the less the value the more likely a region/gene is a direct target. The direction of the regulation can be infered from the sign of the stat column.

# get direct targets
dt <- direct_targets(real_peaks, real_transcripts, 'ID', 't')
dt
#> GRanges object with 12955 ranges and 13 metadata columns:
#>         seqnames              ranges strand |          ID        logFC
#>            <Rle>           <IRanges>  <Rle> | <character>    <numeric>
#>       1    chr17     7023149-7223148      + |   NM_000018   0.04098785
#>       2    chr14   73503142-73703141      + |   NM_000021   0.11997125
#>       3    chr17   48143365-48343364      + |   NM_000023  0.137154141
#>       4     chr5 148106155-148306154      + |   NM_000024  0.599489816
#>       5    chr22   40642503-40842502      + |   NM_000026 -0.141894236
#>     ...      ...                 ...    ... .         ...          ...
#>   18869    chr19   39781836-39981835      - |   NR_046384 -0.140862532
#>   18871     chr9     6313150-6513149      + |   NR_046386  0.406689448
#>   18872     chr3 174733033-174933032      - |   NR_046390 -0.017411643
#>   18875    chr13 106259178-106459177      + |   NR_046391  0.076066568
#>   18877     chr6   37221747-37421746      + |   NR_046399 -0.068042186
#>             AveExpr            t     P.Value   adj.P.Val            B
#>           <numeric>    <numeric>   <numeric>   <numeric>    <numeric>
#>       1 11.01543577   0.74662859 0.474686287 0.739166575 -6.811489352
#>       2 8.391354991  2.343827872 0.044248549 0.207403547 -4.752522233
#>       3  5.12358028  1.795674187 0.106737952 0.341644528  -5.59209081
#>       4 8.445583801  9.238179467    7.83e-06 0.000798244  4.225684745
#>       5 9.815674229  -1.80045682 0.105937441 0.340460418 -5.585144983
#>     ...         ...          ...         ...         ...          ...
#>   18869 7.613652479 -1.838306396 0.099796932 0.328910692 -5.529881611
#>   18871 9.484553434   7.44883057    4.31e-05 0.002473107  2.453780658
#>   18872 3.483938143 -0.276204256 0.788743317 0.911230149 -7.068339788
#>   18875 3.502435237  1.573865232 0.150598988 0.412617316 -5.903996646
#>   18877 6.665619911 -1.211998373 0.256928474 0.545349185 -6.359345219
#>                name2              score score_rank         stat stat_rank
#>          <character>          <numeric>  <integer>    <numeric> <integer>
#>       1       ACADVL  0.148900050520613       6250   0.74662859      8615
#>       2        PSEN1  0.583249631433125       1396  2.343827872      3203
#>       3         SGCA  0.324542104282234       3646  1.795674187      4438
#>       4        ADRB2  0.323984371650432       3655  9.238179467       222
#>       5         ADSL  0.320544801207791       3697  -1.80045682      4426
#>     ...          ...                ...        ...          ...       ...
#>   18869         PAF1  0.643346335599329        745 -1.838306396      4307
#>   18871        UHRF2  0.598074470605471       1182   7.44883057       364
#>   18872 NAALADL2-AS3   0.61149419171844        896 -0.276204256     11329
#>   18875    LINC00343  0.618628434101177        863  1.573865232      5154
#>   18877         RNF8 0.0158377333409101      12077 -1.211998373      6507
#>                        rank
#>                   <numeric>
#>       1   0.320819283447244
#>       2  0.0266420428401552
#>       3  0.0964115638835913
#>       4 0.00483465536449316
#>       5   0.097495826556344
#>     ...                 ...
#>   18869  0.0191186098124002
#>   18871 0.00256356318169908
#>   18872  0.0604818061392038
#>   18875  0.0265021053043959
#>   18877   0.468236255863564
#>   -------
#>   seqinfo: 51 sequences from an unspecified genome; no seqlengths

The following code shows the relation between the peak distance and the peak score (left), the genes t-statitics and the gene regulatory potentials (middle), and the emperical cumlative distribution function (ECDF) of the regulatory potential ranks of the up, down and non-regulated genes (right) .

par(mfrow = c(1, 3))

# show peak distance vs score
plot(ap$distance, ap$peak_score,
     pch = 19, cex = .5, 
     xlab = 'Peak Distance', ylab = 'Peak Score')
abline(v = 0, lty = 2, col = 'gray')

# show gene stat vs score
plot(dt$stat, dt$score,
     pch = 19, cex = .5, 
     xlim = c(-35, 35),
     xlab = 'Gene t-stats', ylab = 'Gene Score')
abline(v = 0, lty = 2, col = 'gray')

# show gene regulatory potential ecdf
groups <- c('Down', 'None', 'Up')
colors <- c('darkgreen', 'gray', 'darkred')

fold_change <- cut(dt$logFC,
                   breaks = c(min(dt$logFC), -.5, .5, max(dt$logFC)),
                   labels = groups)

plot_predictions(dt$score_rank,
                 fold_change,
                 colors,
                 groups,
                 xlab = 'Gene Regulatory Potential Rank',
                 ylab = 'ECDF')

The graph shows that more of the up-regulated transcripts are ranking higher than the down- and none-regulated genes. We can test whether the distribution function of the two regulated group are drawn from the same distribution of the none-regulated transcripts.

# test up-regulated transcripts are not random
test_predictions(dt$score_rank,
                 group = fold_change,
                 compare = c('Up', 'None'),
                 alternative = 'greater')
#> 
#>  Two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D^+ = 0.39159, p-value < 2.2e-16
#> alternative hypothesis: the CDF of x lies above that of y

# test down-regulated transcripts are not random
test_predictions(dt$score_rank,
                 group = fold_change,
                 compare = c('Down', 'None'),
                 alternative = 'greater')
#> 
#>  Two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D^+ = 0.12398, p-value = 0.01296
#> alternative hypothesis: the CDF of x lies above that of y

The names of the top regulated transcript by rank, gene name and its associated peaks.

# show the top regulated transcript, gene name and its associated peaks
top_trans <- unique(dt$ID[dt$rank == min(dt$rank)])
top_trans
#> [1] "NR_045762"

unique(dt$name2[dt$ID == top_trans])
#> [1] "KLK2"
unique(ap$peak_name[ap$assigned_region == top_trans])
#> [1] "AR_LNCaP_4914" "AR_LNCaP_4915" "AR_LNCaP_4916"

Advantages of the R implementation

The target package implements the BETA algorithm for detecting the associated peaks of DNA-binding proteins or histone markers from ChIP data. In addition, when genetic or chemical perturbation data is provided the algorithm can predict direct target regions of the protein or the marker by integrating the binding and the expression data. The implementation of the algorithm in R provide a few advantages:

References

Wang S, Sun H, Ma J, et al. Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nat Protoc. 2013;8(12):2502–2515. doi:10.1038/nprot.2013.150

Wang Q, Li W, Liu XS, et al. A hierarchical network of transcription factors governs androgen receptor-dependent prostate cancer growth. Mol Cell. 2007;27(3):380–392. doi:10.1016/j.molcel.2007.05.041

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0  IRanges_2.20.0      
#> [4] S4Vectors_0.24.0     BiocGenerics_0.32.0  target_1.0.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2             knitr_1.25             XVector_0.26.0        
#>  [4] magrittr_1.5           zlibbioc_1.32.0        xtable_1.8-4          
#>  [7] R6_2.4.0               rlang_0.4.1            fastmap_1.0.1         
#> [10] stringr_1.4.0          tools_3.6.1            xfun_0.10             
#> [13] htmltools_0.4.0        matrixStats_0.55.0     yaml_2.2.0            
#> [16] digest_0.6.22          GenomeInfoDbData_1.2.2 shiny_1.4.0           
#> [19] later_1.0.0            promises_1.1.0         bitops_1.0-6          
#> [22] RCurl_1.95-4.12        mime_0.7               evaluate_0.14         
#> [25] rmarkdown_1.16         stringi_1.4.3          compiler_3.6.1        
#> [28] httpuv_1.5.2