2. poly(A) tail length example

poly(A) tail length of transcripts can be measured using the PAT-Seq protocol. This protocol produces 3’-end focussed reads that include the poly(A) tail. We examine GSE83162. This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with \(\alpha\)-factor, which causes them to stop dividing in antici… pation of a chance to mate. When the \(\alpha\)-factor is washed away, they resume cycling.

Read files, extract experimental design from sample names

library(tidyverse)
library(reshape2)
library(SummarizedExperiment)
library(limma)
library(topconfects)
library(org.Sc.sgd.db)
library(weitrix)

# Produce consistent results
set.seed(12345)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
BiocParallel::register( BiocParallel::SnowParam() )

# The most reliable backed is to use serial processing
#BiocParallel::register( BiocParallel::SerialParam() )

tail <- system.file("GSE83162", "tail.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("Feature") %>%
    as.matrix()

tail_count <- system.file("GSE83162", "tail_count.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("Feature") %>%
    as.matrix()
    
samples <- data.frame(sample=I(colnames(tail))) %>%
    extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>%
    mutate(
        strain=factor(strain,unique(strain)), 
        time=factor(time,unique(time)))
rownames(samples) <- colnames(tail)

samples
##                          sample    strain  time
## WT-tpre                 WT-tpre        WT  tpre
## WT-t0m                   WT-t0m        WT   t0m
## WT-t15m                 WT-t15m        WT  t15m
## WT-t30m                 WT-t30m        WT  t30m
## WT-t45m                 WT-t45m        WT  t45m
## WT-t60m                 WT-t60m        WT  t60m
## WT-t75m                 WT-t75m        WT  t75m
## WT-t90m                 WT-t90m        WT  t90m
## WT-t105m               WT-t105m        WT t105m
## WT-t120m               WT-t120m        WT t120m
## DeltaSet1-tpre   DeltaSet1-tpre DeltaSet1  tpre
## DeltaSet1-t0m     DeltaSet1-t0m DeltaSet1   t0m
## DeltaSet1-t15m   DeltaSet1-t15m DeltaSet1  t15m
## DeltaSet1-t30m   DeltaSet1-t30m DeltaSet1  t30m
## DeltaSet1-t45m   DeltaSet1-t45m DeltaSet1  t45m
## DeltaSet1-t60m   DeltaSet1-t60m DeltaSet1  t60m
## DeltaSet1-t75m   DeltaSet1-t75m DeltaSet1  t75m
## DeltaSet1-t90m   DeltaSet1-t90m DeltaSet1  t90m
## DeltaSet1-t105m DeltaSet1-t105m DeltaSet1 t105m
## DeltaSet1-t120m DeltaSet1-t120m DeltaSet1 t120m

“tpre” is the cells in an unsynchronized state, other times are minutes after release into cycling.

The two strains are a wildtype and a strain with a mutated set1 gene.

Create weitrix object

From experience, noise scales with the length of the tail. Therefore to stabilize the variance we will be examing log2 of the tail length.

These tail lengths are each the average over many reads. We therefore weight each tail length by the number of reads. This is somewhat overoptimistic as there is biological noise that doesn’t go away with more reads, which we will correct for in the next step.

log2_tail <- log2(tail)

good <- rowMeans(tail_count) >= 10
table(good)
## good
## FALSE  TRUE 
##  1657  4875
wei <- as_weitrix(
    log2_tail[good,,drop=FALSE], 
    weights=tail_count[good,,drop=FALSE])

rowData(wei)$gene <- AnnotationDbi::select(
    org.Sc.sgd.db, keys=rownames(wei), columns=c("GENENAME"))$GENENAME
rowData(wei)$total_reads <- rowSums(weitrix_weights(wei))
colData(wei) <- cbind(colData(wei), samples)

Calibration

Our first step is to calibrate our weights. Our weights are overoptimistic for large numbers of reads, as there is a biological components of noise that does not go away with more reads.

Calibration requires a model explaining non-random effects. We provide a design matrix and a weighted linear model fit is found for each row. The lack of replicates makes life difficult, for simplicity here we will assume time and strain are independent.

design <- model.matrix(~ strain + time, data=colData(wei))

The dispersion is then calculated for each row. A spline curve is fitted to the log dispersion. Weights in each row are then scaled so as to flatten this curve.

cal_design <- weitrix_calibrate_trend(wei, design, ~splines::ns(log2(total_reads),3))

Examining the effect of calibration

Let’s unpack what happens in that weitrix_calibrate_trend step.

Consider first the dispersion if weights were uniform. Missing data is weighted 0, all other weights are 1.

There is information from the number of reads that we can remove.

Here are the dispersions using read-counts as weights.

In general it is improved, but now we have the opposite problem. There is a component of noise that does not go away with more and more reads. The red line is a trend fitted to this, which weitrix_calibrate_trend divides out of the weights.

Finally, here are the dispersion from the calibrated weitrix.

This is reasonably close to uniform, with no trend from the number of reads. limma can now estimate the variability of dispersions between genes, and apply its Emprical-Bayes squeezed dispersion based testing.

Testing

We are now ready to test things. We feed our calibrated weitrix to limma.

fit_cal_design <- cal_design %>%
    weitrix_elist() %>%
    lmFit(design)

ebayes_fit <- eBayes(fit_cal_design)
result_signif <- topTable(ebayes_fit, "strainDeltaSet1", n=Inf)

result_signif %>%
    dplyr::select(gene,diff_log2_tail=logFC,ave_log2_tail=AveExpr,
        adj.P.Val,total_reads) %>%
    head(20)
##                     gene diff_log2_tail ave_log2_tail    adj.P.Val total_reads
## YDR170W-A           <NA>     -0.3317487      4.679897 3.440790e-06        4832
## YJR027W/YJR026W     <NA>     -0.2564543      4.271790 2.287980e-05        6577
## YAR009C             <NA>     -0.4673765      3.931000 2.287980e-05        1866
## YIL015W             BAR1     -0.3653600      4.893617 6.233632e-05        3898
## YPL257W-B/YPL257W-A <NA>     -0.2558543      4.209088 1.151177e-04        5729
## YML045W/YML045W-A   <NA>     -0.2406881      4.354082 1.953731e-04        4553
## YLR157C-B/YLR157C-A <NA>     -0.2224134      4.200638 3.860716e-04        5519
## YGR027W-B/YGR027W-A <NA>     -0.2093877      4.206404 4.750243e-04        5685
## YLR035C-A           <NA>     -0.3804618      3.918629 5.945463e-04        1374
## YPR137C-B/YPR137C-A <NA>     -0.2105818      4.257820 6.441298e-04        7287
## YER138C/YER137C-A   <NA>     -0.2053578      4.187405 7.537499e-04        5652
## YML039W/YML040W     <NA>     -0.2343804      4.213771 7.928725e-04        5851
## YMR045C/YMR046C     <NA>     -0.2087868      4.266254 8.387704e-04        4579
## YJR029W/YJR028W     <NA>     -0.2062738      4.254686 1.128685e-03        5979
## YNL284C-B/YNL284C-A <NA>     -0.1899551      4.370570 1.157811e-03        4830
## YPL217C             BMS1     -0.3051718      4.086526 1.254310e-03        2071
## YPR158C-D/YPR158C-C <NA>     -0.1975358      4.227524 1.254310e-03        5381
## YIL053W             GPP1     -0.1463690      4.760486 1.480738e-03       25996
## YOR192C-B/YOR192C-A <NA>     -0.4356656      4.397721 1.480738e-03         864
## YER133W             GLC7     -0.7326781      4.084170 7.537499e-04         264

My package topconfects can be used to find top confident differential tail length. Rather than picking “most significant” genes, it will highlight genes with a large effect size.

result_confects <- limma_confects(
    fit_cal_design, "strainDeltaSet1", full=TRUE, fdr=0.05)

result_confects$table %>% 
    dplyr::select(gene,diff_log2_tail=effect,confect,total_reads) %>% 
    head(20)
##                     gene diff_log2_tail confect total_reads
## YAR009C             <NA>     -0.4673765  -0.157        1866
## YDR170W-A           <NA>     -0.3317487  -0.151        4832
## YER133W             GLC7     -0.7326781  -0.151         264
## YIL015W             BAR1     -0.3653600  -0.131        3898
## YJR027W/YJR026W     <NA>     -0.2564543  -0.107        6577
## YLR035C-A           <NA>     -0.3804618  -0.102        1374
## YCR014C             POL4     -0.4968279  -0.098         606
## YPL257W-B/YPL257W-A <NA>     -0.2558543  -0.094        5729
## YOR192C-B/YOR192C-A <NA>     -0.4356656  -0.094         864
## YML045W/YML045W-A   <NA>     -0.2406881  -0.085        4553
## YPL217C             BMS1     -0.3051718  -0.074        2071
## YLR157C-B/YLR157C-A <NA>     -0.2224134  -0.074        5519
## YGR027W-B/YGR027W-A <NA>     -0.2093877  -0.068        5685
## YML039W/YML040W     <NA>     -0.2343804  -0.068        5851
## YDR476C             <NA>     -0.3794150  -0.067        4196
## YPR137C-B/YPR137C-A <NA>     -0.2105818  -0.067        7287
## YER138C/YER137C-A   <NA>     -0.2053578  -0.063        5652
## YMR045C/YMR046C     <NA>     -0.2087868  -0.063        4579
## YJR029W/YJR028W     <NA>     -0.2062738  -0.060        5979
## YMR238W             DFG5     -0.2478286  -0.060        2502
cat(sum(!is.na(result_confects$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")
## 80 genes significantly non-zero at FDR 0.05

This lists the largest confident log2 fold changes in poly(A) tail length. The confect column is an inner confidence bound on the difference in log2 tail length, adjusted for multiple testing.

We discover some genes with less total reads, but large change in tail length.

Note that due to PCR amplification slippage and limited read length, the observed log2 poly(A) tail lengths may be an underestimate. However as all samples have been prepared in the same way, observed differences should indicate the existence of true differences.

Examine individual genes

Having discovered genes with differential tail length, let’s look at some genes in detail.

Exploratory analysis

The test we’ve performed was somewhat unsatisfactory. Due to the design of the experiment it’s difficul to specify differential tests that fully interrogate this dataset: the lack of replicates, and the difficult specifying apriori how tail length will change over time.

Perhaps we should let the data speak for itself.

Perhaps this is what we should have done first!

The weitrix package allows us to look for components of variation. We’ll try to explain the data with different numbers of components (from 1 to 10 components).

comp_seq <- weitrix_components_seq(wei, p=10)

weitrix_seq_screeplot shows how much additional variation in the data is explained as each further component is allowed. However the ultimate decision of how many components to examine is a matter of judgement.

components_seq_screeplot(comp_seq)

Looking at three components shows some of the major trends in this data-set.

comp <- comp_seq[[3]]

matrix_long(comp$col[,-1], row_info=samples, varnames=c("sample","component")) %>%
    ggplot(aes(x=time, y=value, color=strain, group=strain)) + 
    geom_hline(yintercept=0) + 
    geom_line() + 
    geom_point(alpha=0.75, size=3) + 
    facet_grid(component ~ .) +
    labs(title="Sample scores for each component", y="Sample score", x="Time", color="Strain")

We observe:

The log2 tail lengths are approximated by comp$row %*% t(comp$col) where comp$col is an \(n_\text{sample} \times (p+1)\) matrix of scores (shown above), and comp$row is an \(n_\text{gene} \times (p+1)\) matrix of gene loadings, which we will now examine. (The \(+1\) is the intercept “component”, allowing each gene to have a different baseline tail length.)

cal_comp <- weitrix_calibrate_trend(wei, comp, ~splines::ns(log2(total_reads),3))
fit_comp <- cal_comp %>%
    weitrix_elist() %>%
    lmFit(comp$col)

Treat these results with caution. Confindence bounds take into account uncertainty in the loadings but not in the scores! What follows is best regarded as exploratory rather than a final result.

Gene loadings for C1: gradual lengthing over time

##           gene  loading confect total_reads
## YBL016W   FUS3 3.863779   2.210        6287
## YOR096W  RPS7A 3.284584   2.210       72660
## YDR092W  UBC13 3.294252   1.863        6171
## YLR118C   <NA> 2.670593   1.759       10599
## YDL185W   VMA1 2.498601   1.658       22258
## YJL136C RPS21B 2.217832   1.544      179361
## YLR314C   CDC3 2.167261   1.531       38826
## YGR090W  UTP22 2.550122   1.505        6535
## YJL041W   NSP1 2.445628   1.467        5463
## YMR142C RPL13B 2.121719   1.462       62128
## 1137 genes significantly non-zero at FDR 0.05

FUS3 is involved in yeast mating. We see here a poly(A) tail signature of yeast realizing there are not actually any \(\alpha\) cells around to mate with.

Gene loadings for C2: longer tails in set1 mutant

##             gene  loading confect total_reads
## YDR476C     <NA> 5.962351   2.794        4196
## YPL131W     RPL5 2.093424   1.218       81051
## YCR014C     POL4 5.844536   1.096         606
## YGR251W    NOP19 6.503427   1.041         359
## YNL178W     RPS3 1.922230   1.041       46065
## YIL015W     BAR1 2.759544   0.996        3898
## YLR340W     RPP0 1.967901   0.988       25545
## YLR035C-A   <NA> 4.144392   0.943        1374
## YDL130W    RPP1B 1.853441   0.883       68331
## YDR450W   RPS18A 1.824185   0.878       30283
## 409 genes significantly non-zero at FDR 0.05

Gene loadings for C3: cell-cycle associated changes

Given the mixture of signs for effects in C3, different genes are longer in different stages of the cell cycle. We see many genes to do with replication, and also Mating Factor A.

##             gene   loading confect total_reads
## YFL014W    HSP12 -5.624068  -2.633        9509
## YNL036W   NCE103  4.452372   2.633        5712
## YJL173C     RFA3 -3.780392  -1.908        3963
## YBL003C     HTA2 -4.415526  -1.908       13473
## YDR097C     MSH6 -5.051116  -1.806         996
## YBR009C     HHF1 -4.460064  -1.795       33533
## YEL011W     GLC3  4.417161   1.795         710
## YDR461W     MFA1  3.694856   1.604       83042
## YLR342W-A   <NA> -3.323401  -1.604       12356
## YBR092C     PHO3  3.759809   1.497        8441
## 157 genes significantly non-zero at FDR 0.05

Discussion

Looking back to our initial differential testing in light of these results, a reasonable refinement would be to omit “tpre” and “t0m”, considering only the samples that have settled into cell cycling.