APA sites can be detected using the PAT-Seq protocol. This protocol produces 3’-end focussed reads. We examine GSE83162. This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with \(\alpha\)-factor, which causes them to stop dividing in antici… pation of a chance to mate. When the \(\alpha\)-factor is washed away, they resume cycling.
Each gene has several APA sites, ordered from furthest upstrand to furthest downstrand. For each sample, we have a read count at each site.
For each gene:
We define the “shift” of a particular sample relative to all reads from all samples.
A “shift” score is first assigned to each site, being the proportion of all reads upstrand of the site minus the proportion of all reads downstrand of it (i.e. an average over all reads where upstrand reads are 1, downstrand reads are -1 and reads from the site itself are 0). The measurement for each sample is then an average over the site score for each read. The weight is the number of reads.
Shifts scores range from -1 to 1, and summarize whether upstrand (more negative) or downstrand (more positive) sites are being favoured. The weighted average score is zero.
The weights are the number of reads, however for a randomly chosen read we can estimate its variance based on the number of reads at each site and the site scores. (This estimate of variance includes any biological signal, so it’s not exactly a residual variance.) This is stored in the rowData()
of the weitrix, and can be used to further calibrate weights. We prefer to defer this sort of calibration until after we’ve discoverd components of variation, as it tends to give high weight to genes with little APA. There are clearly some alternative choices to how weighting could be performed, and we hope the weitrix package gives you basic building blocks with which you can experiment!
library(tidyverse) # ggplot2, dplyr, etc
library(patchwork) # side-by-side ggplots
library(reshape2) # melt()
library(weitrix) # Matrices with precision weights
# Produce consistent results
set.seed(12345)
# BiocParallel supports multiple backends.
# If the default hangs or errors, try others.
# The most reliable backed is to use serial processing
BiocParallel::register( BiocParallel::SerialParam() )
peaks <- system.file("GSE83162", "peaks.csv.gz", package="weitrix") %>%
read.csv(check.names=FALSE)
counts <- system.file("GSE83162", "peak_count.csv.gz", package="weitrix") %>%
read.csv(check.names=FALSE) %>%
column_to_rownames("name") %>%
as.matrix()
genes <- system.file("GSE83162", "genes.csv.gz", package="weitrix") %>%
read.csv(check.names=FALSE) %>%
column_to_rownames("name")
samples <- data.frame(sample=I(colnames(counts))) %>%
extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>%
mutate(
strain=factor(strain,unique(strain)),
time=factor(time,unique(time)))
rownames(samples) <- samples$sample
groups <- dplyr::select(peaks, group=gene_name, name=name)
# Note the order of this data frame is important
samples
## sample strain time
## WT-tpre WT-tpre WT tpre
## WT-t0m WT-t0m WT t0m
## WT-t15m WT-t15m WT t15m
## WT-t30m WT-t30m WT t30m
## WT-t45m WT-t45m WT t45m
## WT-t60m WT-t60m WT t60m
## WT-t75m WT-t75m WT t75m
## WT-t90m WT-t90m WT t90m
## WT-t105m WT-t105m WT t105m
## WT-t120m WT-t120m WT t120m
## DeltaSet1-tpre DeltaSet1-tpre DeltaSet1 tpre
## DeltaSet1-t0m DeltaSet1-t0m DeltaSet1 t0m
## DeltaSet1-t15m DeltaSet1-t15m DeltaSet1 t15m
## DeltaSet1-t30m DeltaSet1-t30m DeltaSet1 t30m
## DeltaSet1-t45m DeltaSet1-t45m DeltaSet1 t45m
## DeltaSet1-t60m DeltaSet1-t60m DeltaSet1 t60m
## DeltaSet1-t75m DeltaSet1-t75m DeltaSet1 t75m
## DeltaSet1-t90m DeltaSet1-t90m DeltaSet1 t90m
## DeltaSet1-t105m DeltaSet1-t105m DeltaSet1 t105m
## DeltaSet1-t120m DeltaSet1-t120m DeltaSet1 t120m
head(groups, 10)
## group name
## 1 YDR533C peak1653
## 2 YDR533C peak1652
## 3 YDR529C peak1648
## 4 YDR529C peak1647
## 5 snR84 peak1646
## 6 snR84 peak1645
## 7 snR84 peak1644
## 8 YDR524C-B peak1643
## 9 YDR524C-B peak1642
## 10 YDR500C peak1629
counts[1:10,1:5]
## WT-tpre WT-t0m WT-t15m WT-t30m WT-t45m
## peak3910 58 24 46 54 41
## peak3911 25 9 24 28 18
## peak3912 7 9 10 13 18
## peak4139 30 4 7 12 13
## peak4140 8 1 1 8 2
## peak4141 55 7 13 19 20
## peak8533 37 35 42 64 47
## peak8534 38 50 44 53 36
## peak5235 16 4 2 7 4
## peak5236 14 5 2 5 4
A “shift” weitrix is constructed based on a matrix of site-level counts, plus a data frame grouping sites into genes. The order of this data frame is important, earlier sites are considered upstrand of later sites.
wei <- counts_shift(counts, groups)
## Calculating shifts in 1 blocks
colData(wei) <- cbind(colData(wei), samples)
rowData(wei)$symbol <- genes$symbol[match(rownames(wei), rownames(genes))]
Having obtained a weitrix, everthing discussed for the poly(A) tail length example is applicable here as well. We will only perform a brief exploratory analysis here.
We can look for components of variation.
comp_seq <- weitrix_components_seq(wei, p=6, design=~0)
components_seq_screeplot(comp_seq)
Pushing a point somewhat, we examine four components.
comp <- comp_seq[[4]]
matrix_long(comp$col, row_info=samples, varnames=c("sample","component")) %>%
ggplot(aes(x=time, y=value, color=strain, group=strain)) +
geom_hline(yintercept=0) +
geom_line() +
geom_point(alpha=0.75, size=3) +
facet_grid(component ~ .) +
labs(title="Sample scores for each component", y="Sample score", x="Time", color="Strain")
Weights can be calibrated. Ideally the weights should be 1 over the residual variance. We will fit a gamma GLM with log link function to the squared residuals, and use the predictions from this to produce better weights. This model uses the per_read_var
information present in the rowData, as well as applying a non-linear transformation to the existing weights.
cal <- weitrix_calibrate_all(
wei,
design = comp,
trend_formula = ~log(per_read_var)+well_knotted_spline(log2(weight),3))
## log2(weight) range 0.00000 19.70383 knots 4.863982 7.206426 10.386132
metadata(cal)$weitrix$all_coef
## (Intercept) log(per_read_var)
## 0.2096519 1.1853604
## well_knotted_spline(log2(weight), 3)1 well_knotted_spline(log2(weight), 3)2
## -4.6927533 -5.2339368
## well_knotted_spline(log2(weight), 3)3 well_knotted_spline(log2(weight), 3)4
## -9.2618654 -6.3998552
A mean-variance trend is visible in the uncalibrated weighted residuals. Calibration has removed this, by making use of the information in per_read_var
.
(weitrix_calplot(wei, comp, covar=mu) + labs(title="Before")) +
(weitrix_calplot(cal, comp, covar=mu) + labs(title="After"))
We can look at the calibration to per_read_var
directly.
(weitrix_calplot(wei, comp, covar=per_read_var) + labs(title="Before")) +
(weitrix_calplot(cal, comp, covar=per_read_var) + labs(title="After"))
Weights were too large for large read counts. Calibration has applied a non-linear transformation to the weights that fixes this.
(weitrix_calplot(wei, comp, covar=log(weitrix_weights(wei))) + labs(title="Before")) +
(weitrix_calplot(cal, comp, covar=log(weitrix_weights(wei))) + labs(title="After"))
Another way of looking at this is with a “funnel” plot of 1/sqrt(weight) vs residuals. This should make a cone shape. Note how before calibration the pointy end of the cone is too wide.
(weitrix_calplot(wei, comp, funnel=TRUE, guides=FALSE) + labs(title="Before")) +
(weitrix_calplot(cal, comp, funnel=TRUE, guides=FALSE) + labs(title="After"))
We will estimate confident loadings for the different components using weitrix_confects
.
Treat these results with caution. Confindence bounds take into account uncertainty in the loadings but not in the scores! What follows is best regarded as exploratory rather than a final result.
weitrix_confects(cal, comp$col, "C1", fdr=0.05)
## $table
## confect effect se df fdr_zero row_mean typical_obs_err name
## 1 2.577 4.085 0.3277 34.28 7.053e-12 -0.0045737 0.03725 YLR058C
## 2 2.260 3.279 0.2335 34.28 3.964e-13 0.0005286 0.02559 YPR080W
## 3 1.953 2.756 0.1898 34.28 1.979e-13 -0.0102193 0.02056 YKL096W-A
## 4 1.894 2.613 0.1741 34.28 1.120e-13 -0.0038155 0.01828 YBL072C
## 5 1.842 3.216 0.3390 34.28 6.698e-09 -0.0050493 0.03715 YBR127C
## 6 1.819 2.649 0.2080 34.28 4.504e-12 -0.0064119 0.02183 YML026C
## 7 1.511 3.844 0.5930 34.28 1.527e-05 -0.0035558 0.06223 YDR342C
## 8 1.440 2.345 0.2328 34.28 1.765e-09 -0.0127900 0.02361 YKL180W
## 9 1.408 1.979 0.1482 34.28 1.403e-12 -0.0062268 0.01556 YDR524C-B
## 10 1.364 2.306 0.2473 34.28 9.010e-09 -0.0077192 0.02540 YBR249C
## per_read_var total_reads symbol
## 1 0.18682 6958 SHM2
## 2 0.20630 72240 TEF1
## 3 0.08003 374176 CWP2
## 4 0.12048 47370 RPS8A
## 5 0.19376 21804 VMA2
## 6 0.11684 42444 RPS18B
## 7 0.16038 2866 HXT7
## 8 0.17678 57320 RPL17A
## 9 0.12464 164069 <NA>
## 10 0.16609 15246 ARO4
## ...
## 122 of 1805 non-zero contrast at FDR 0.05
## Prior df 18.3
weitrix_confects(cal, comp$col, "C2", fdr=0.05)
## $table
## confect effect se df fdr_zero row_mean typical_obs_err name
## 1 5.358 6.729 0.2979 34.28 3.288e-19 0.033443 0.02498 YLL024C
## 2 4.849 6.310 0.3345 34.28 6.726e-17 0.025714 0.02959 YLR333C
## 3 4.207 4.856 0.1536 34.28 1.090e-23 0.026569 0.01405 YLR110C
## 4 3.357 5.008 0.3999 34.28 8.647e-12 0.020402 0.03612 YKL081W
## 5 3.314 5.471 0.5326 34.28 1.366e-09 0.011001 0.04921 YBR135W
## 6 2.580 4.558 0.4958 34.28 1.461e-08 0.030212 0.04313 YDR363W-A
## 7 2.339 4.600 0.5748 34.28 2.509e-07 -0.001391 0.05246 YNL098C
## 8 2.330 3.769 0.3700 34.28 1.494e-09 0.008160 0.03435 YNL067W
## 9 2.328 5.567 0.8422 34.28 7.606e-06 0.008116 0.07560 YBR269C
## 10 2.312 4.006 0.4447 34.28 2.028e-08 0.014319 0.03827 YJL111W
## per_read_var total_reads symbol
## 1 0.1960 133748 SSA2
## 2 0.2382 46413 RPS25B
## 3 0.2070 712909 CCW12
## 4 0.2490 104942 TEF4
## 5 0.2461 3619 CKS1
## 6 0.1454 21193 SEM1
## 7 0.2405 7185 RAS2
## 8 0.2809 30030 RPL9B
## 9 0.2497 2289 SDH8
## 10 0.2001 14445 CCT7
## ...
## 165 of 1805 non-zero contrast at FDR 0.05
## Prior df 18.3
weitrix_confects(cal, comp$col, "C3", fdr=0.05)
## $table
## confect effect se df fdr_zero row_mean typical_obs_err name
## 1 1.294 2.684 0.3139 34.28 9.246e-07 0.0135820 0.03458 YBR149W
## 2 1.294 3.147 0.4244 34.28 1.150e-05 -0.0180639 0.04727 YDR077W
## 3 1.238 3.985 0.6589 34.28 2.581e-04 0.0018156 0.07307 YFR052C-A
## 4 1.238 4.218 0.7219 34.28 2.676e-04 0.0083721 0.07333 snR40
## 5 -1.199 -3.766 0.6382 34.28 2.581e-04 -0.0042842 0.06997 YMR194C-B
## 6 1.199 3.130 0.4842 34.28 1.253e-04 -0.0013910 0.05246 YNL098C
## 7 -0.602 -3.637 0.7712 34.28 3.732e-03 -0.0156248 0.08880 YKL163W
## 8 -0.602 -2.536 0.5011 34.28 1.955e-03 0.0004136 0.05325 YNL130C
## 9 0.602 3.623 0.7848 34.28 4.332e-03 -0.0063790 0.08472 YGR243W
## 10 -0.597 -1.685 0.2859 34.28 2.581e-04 -0.0085628 0.03087 YLR372W
## per_read_var total_reads symbol
## 1 0.19814 9498 ARA1
## 2 0.07946 66406 SED1
## 3 0.14798 4967 <NA>
## 4 0.26488 3100 <NA>
## 5 0.23748 2270 CMC4
## 6 0.24047 7185 RAS2
## 7 0.13916 808 PIR3
## 8 0.23029 4144 CPT1
## 9 0.23448 1172 MPC3
## 10 0.20219 20547 ELO3
## ...
## 112 of 1805 non-zero contrast at FDR 0.05
## Prior df 18.3
weitrix_confects(cal, comp$col, "C4", fdr=0.05)
## $table
## confect effect se df fdr_zero row_mean typical_obs_err name
## 1 3.903 6.544 0.5791 34.28 7.525e-10 0.019197 0.06689 RDN5-1
## 2 -3.903 -6.838 0.6722 34.28 4.128e-09 0.022994 0.07180 YIL015W
## 3 1.392 4.512 0.7379 34.28 1.337e-04 0.011480 0.08226 tK(CUU)M
## 4 1.290 4.389 0.7507 34.28 2.170e-04 -0.001030 0.08276 YLR286C
## 5 1.162 3.315 0.5314 34.28 1.054e-04 -0.006542 0.06114 tK(CUU)J
## 6 1.099 4.711 0.9164 34.28 7.672e-04 0.004438 0.09942 tP(UGG)O3
## 7 1.099 2.914 0.4612 34.28 1.054e-04 0.002452 0.05017 YJL054W
## 8 -1.099 -2.335 0.3187 34.28 7.446e-06 0.001259 0.03542 YLR027C
## 9 1.099 3.116 0.5243 34.28 1.987e-04 0.018257 0.05455 YCR009C
## 10 -0.978 -2.787 0.4749 34.28 2.170e-04 -0.001391 0.05246 YNL098C
## per_read_var total_reads symbol
## 1 0.2395 4029 <NA>
## 2 0.2090 4313 BAR1
## 3 0.2882 2843 <NA>
## 4 0.2424 1548 CTS1
## 5 0.0900 1200 <NA>
## 6 0.2500 892 <NA>
## 7 0.1756 1918 TIM54
## 8 0.2498 12186 AAT2
## 9 0.2292 5560 RVS161
## 10 0.2405 7185 RAS2
## ...
## 172 of 1805 non-zero contrast at FDR 0.05
## Prior df 18.3
Instead of looking for genes following some particular pattern, we can look for genes that simply have surprisingly high variability with weitrix_sd_confects
.
weitrix_sd_confects(cal, step=0.01)
## $table
## confect effect row_mean typical_obs_err dispersion n_present df fdr_zero
## 1 0.18 0.2051 0.0204016 0.02667 60.117 20 19 0.00e+00
## 2 0.15 0.1793 0.0005286 0.02552 50.382 20 19 0.00e+00
## 3 0.15 0.1828 0.0257140 0.03040 37.148 20 19 0.00e+00
## 4 0.15 0.1976 0.0229944 0.05194 15.470 20 19 0.00e+00
## 5 0.15 0.1874 -0.0045737 0.03962 23.387 20 19 0.00e+00
## 6 0.14 0.1892 0.0191967 0.05425 13.164 20 19 0.00e+00
## 7 0.14 0.1700 -0.0050493 0.03110 30.879 20 19 0.00e+00
## 8 0.13 0.2445 0.0129602 0.12069 5.105 19 18 1.07e-10
## 9 0.13 0.1901 0.0081160 0.06682 9.095 20 19 0.00e+00
## 10 0.13 0.1488 0.0334434 0.02250 44.720 20 19 0.00e+00
## name per_read_var total_reads symbol
## 1 YKL081W 0.2490 104942 TEF4
## 2 YPR080W 0.2063 72240 TEF1
## 3 YLR333C 0.2382 46413 RPS25B
## 4 YIL015W 0.2090 4313 BAR1
## 5 YLR058C 0.1868 6958 SHM2
## 6 RDN5-1 0.2395 4029 <NA>
## 7 YBR127C 0.1938 21804 VMA2
## 8 tP(UGG)N1 0.2328 507 <NA>
## 9 YBR269C 0.2497 2289 SDH8
## 10 YLL024C 0.1960 133748 SSA2
## ...
## 470 of 1805 non-zero excess standard deviation at FDR 0.05
Let’s examine peak-level read counts for some genes we’ve identified.
examine <- function(gene_wanted, title) {
peak_names <- filter(peaks, gene_name==gene_wanted)$name
counts[peak_names,] %>% melt(value.name="reads", varnames=c("peak","sample")) %>%
left_join(samples, by="sample") %>%
ggplot(aes(x=factor(as.integer(peak)), y=reads)) +
facet_grid(strain ~ time) + geom_col() +
labs(x="Peak",y="Reads",title=title)
}
examine("YLR058C", "SHM2, from C1")
examine("YLR333C", "RPS25B, from C2")
examine("YDR077W", "SED1, from C3")
examine("YIL015W", "BAR1, from C4")
examine("tK(CUU)M", "tK(CUU)M, from C4")
examine("YKL081W", "TEF4, from weitrix_sd_confects")
examine("YPR080W", "TEF1, from weitrix_sd_confects")
For larger datasets, weitrix_calibrate_all
may use a lot of memory. The weitrix packages also has an older version of calibration, weitrix_calibrate_trend
. Rather than operating on weights individually, it applies a scaling to the weights of each row.
cal_trend <- weitrix_calibrate_trend(
wei,
design = comp,
trend_formula = ~log(per_read_var)+well_knotted_spline(log(total_reads),3))
## log(total_reads) range 5.214936 16.049079 knots 6.749379 8.203755 10.367033
A mean-variance trend is visible in the uncalibrated weighted residuals. Calibration removes this, by making use of the information in per_read_var
.
(weitrix_calplot(wei, comp, covar=mu) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=mu) + labs(title="After"))
We can look at the calibration to per_read_var
directly.
(weitrix_calplot(wei, comp, covar=per_read_var) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=per_read_var) + labs(title="After"))
Weights were too large for large read counts. Calibration has applied a row-level adjustment for this based on the total number of reads for each row.
(weitrix_calplot(wei, comp, covar=log(weitrix_weights(wei))) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=log(weitrix_weights(wei))) + labs(title="After"))