Function | Description |
---|---|
aggregate_duplicates |
Aggregate robustly abundance and annotation of duplicated transcripts |
scale_abundance |
Scale (normalise) abundance for RNA requencing depth |
reduce_dimensions |
Perform dimensionality reduction (PCA, MDS, tSNE) |
cluster_elements |
Labels elements with cluster identity (kmeans, SNN) |
remove_redundancy |
Filter out elements with highly correlated features |
adjust_abundance |
Remove known unwanted variation (Combat) |
test_differential_abundance |
Differenatial abundance testing (DE) |
deconvolve_cellularity |
Estimated tissue composition (Cibersort) |
keep_variable |
Filter top variable features |
keep_abundant |
Filter out lowly abundant transcripts |
test_gene_enrichment |
Gene enrichment analyses (EGSEA) |
test_gene_overrepresentation |
Gene enrichment on list of transcript names (no rank) |
impute_abundance |
Impute abundance for missing data points using sample groupings |
Utilities | Description |
---|---|
tidybulk |
add tidybulk attributes to a tibble object |
tidybulk_SAM_BAM |
Convert SAM BAM files into tidybulk tibble |
pivot_sample |
Select sample-wise columns/information |
pivot_transcript |
Select transcript-wise columns/information |
rotate_dimensions |
Rotate two dimensions of a degree |
ensembl_to_symbol |
Add gene symbol from ensembl IDs |
symbol_to_entrez |
Add entrez ID from gene symbol |
sample | transcript | abundance | annotation |
---|---|---|---|
chr or fctr |
chr or fctr |
integer |
… |
sample | transcript | abundance | annotation | new information |
---|---|---|---|---|
chr or fctr |
chr or fctr |
integer |
… | … |
tidybulk
tibble. It memorises key column namestt = counts_mini %>% tidybulk(sample, transcript, count)
All tidybulk methods are directly compatible with SummarizedExperiment as well.
transcripts
tidybulk provide the aggregate_duplicates
function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. aggregate_duplicates
takes a tibble and column names (as symbols; for sample
, transcript
and count
) as arguments and returns a tibble with aggregate transcript with the same name. All the rest of the column are appended, and factors and boolean are appended as characters.
tt.aggr = tt %>% aggregate_duplicates( aggregation_function = sum )
tt.aggr
## # A tibble: 2,635 x 7
## sample transcript `Cell type` count time condition `merged transcripts`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 SRR1740034 TNFRSF4 b_cell 6 0 d TRUE 1
## 2 SRR1740034 PLCH2 b_cell 926 0 d TRUE 1
## 3 SRR1740034 PADI4 b_cell 21 0 d TRUE 1
## 4 SRR1740034 PAX7 b_cell 0 0 d TRUE 1
## 5 SRR1740034 CDA b_cell 1 0 d TRUE 1
## 6 SRR1740034 RCAN3 b_cell 905 0 d TRUE 1
## 7 SRR1740034 SMPDL3B b_cell 3 0 d TRUE 1
## 8 SRR1740034 EPB41 b_cell 4667 0 d TRUE 1
## 9 SRR1740034 LCK b_cell 436 0 d TRUE 1
## 10 SRR1740034 COL8A2 b_cell 1 0 d TRUE 1
## # … with 2,625 more rows
All functions are also directly compatible with SummarizedExperiment
.
se.aggr = se_mini %>% aggregate_duplicates( aggregation_function = sum )
se.aggr
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(1): count
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(0):
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(4): Cell.type time condition merged.transcripts
counts
We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). scale_abundance
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a method as arguments and returns a tibble with additional columns with scaled data as <NAME OF COUNT COLUMN>_scaled
.
tt.norm = tt.aggr %>% scale_abundance(method="TMM")
tt.norm %>% select(`count`, count_scaled, lowly_abundant, everything())
## # A tibble: 2,635 x 11
## count count_scaled lowly_abundant sample transcript `Cell type` time
## <dbl> <dbl> <lgl> <chr> <chr> <chr> <chr>
## 1 1035 1900. TRUE SRR17… ABCB4 b_cell 0 d
## 2 45 82.6 TRUE SRR17… ABCB9 b_cell 0 d
## 3 7151 13129. FALSE SRR17… ACAP1 b_cell 0 d
## 4 2 3.67 TRUE SRR17… ACHE b_cell 0 d
## 5 2278 4182. FALSE SRR17… ACP5 b_cell 0 d
## 6 11156 20482. TRUE SRR17… ADAM28 b_cell 0 d
## 7 72 132. TRUE SRR17… ADAMDEC1 b_cell 0 d
## 8 0 0 TRUE SRR17… ADAMTS3 b_cell 0 d
## 9 298 547. FALSE SRR17… ADRB2 b_cell 0 d
## 10 8 14.7 TRUE SRR17… AIF1 b_cell 0 d
## # … with 2,625 more rows, and 4 more variables: condition <chr>, `merged
## # transcripts` <dbl>, TMM <dbl>, multiplier <dbl>
We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.
tt.norm %>%
ggplot(aes(count_scaled + 1, group=sample, color=`Cell type`)) +
geom_density() +
scale_x_log10() +
my_theme
All functions are also directly compatible with SummarizedExperiment
.
se.norm = se.aggr %>% scale_abundance(method="TMM")
se.norm
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(6): Cell.type time ... TMM multiplier
variable transcripts
We may want to identify and filter variable transcripts.
tt.norm.variable = tt.norm %>% keep_variable()
## Getting the 500 most variable genes
dimensions
We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. reduce_dimensions
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.
MDS (Robinson et al., 10.1093/bioinformatics/btp616)
tt.norm.MDS = tt.norm %>% reduce_dimensions(.abundance = count_scaled, method="MDS", .dims = 3)
tt.norm.MDS %>% select(sample, contains("Dim"), `Cell type`, time ) %>% distinct()
## # A tibble: 5 x 6
## sample Dim1 Dim2 Dim3 `Cell type` time
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 SRR1740034 -0.882 -0.708 -0.0291 b_cell 0 d
## 2 SRR1740035 -0.870 -0.715 0.0000434 b_cell 1 d
## 3 SRR1740043 1.40 0.0479 -0.471 monocyte 1 d
## 4 SRR1740058 -0.935 1.39 0.00575 t_cell 0 d
## 5 SRR1740067 1.29 -0.0121 0.494 dendritic_myeloid 1 d
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.
tt.norm.MDS %>%
select(contains("Dim"), sample, `Cell type`) %>%
distinct() %>%
GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=`Cell type`))
All functions are also directly compatible with SummarizedExperiment
.
se.norm.MDS = se.norm %>% reduce_dimensions(.abundance = count_scaled, method="MDS", .dims = 3)
se.norm.MDS
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(9): Cell.type time ... Dim2 Dim3
PCA
tt.norm.PCA = tt.norm %>% reduce_dimensions(.abundance = count_scaled, method="PCA" , .dims = 3)
## Getting the 500 most variable genes
## Fraction of variance explained by the selected principal components
## # A tibble: 3 x 2
## `Fraction of variance` PC
## <dbl> <int>
## 1 0.586 1
## 2 0.250 2
## 3 0.137 3
tt.norm.PCA %>% select(sample, contains("PC"), `Cell type`, time ) %>% distinct()
## # A tibble: 5 x 6
## sample PC1 PC2 PC3 `Cell type` time
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 SRR1740034 -0.505 -0.373 0.324 b_cell 0 d
## 2 SRR1740035 -0.506 -0.369 0.322 b_cell 1 d
## 3 SRR1740043 -0.409 0.600 -0.0473 monocyte 1 d
## 4 SRR1740058 -0.343 -0.306 -0.888 t_cell 0 d
## 5 SRR1740067 -0.451 0.521 -0.00632 dendritic_myeloid 1 d
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.
tt.norm.PCA %>%
select(contains("PC"), sample, `Cell type`) %>%
distinct() %>%
GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=`Cell type`))
All functions are also directly compatible with SummarizedExperiment
.
se.norm.PCA = se.norm %>% reduce_dimensions(.abundance = count_scaled, method="PCA" , .dims = 3)
## Getting the 500 most variable genes
## Fraction of variance explained by the selected principal components
## # A tibble: 3 x 2
## `Fraction of variance` PC
## <dbl> <int>
## 1 0.586 1
## 2 0.250 2
## 3 0.137 3
se.norm.PCA
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(9): Cell.type time ... PC2 PC3
tSNE
tt.norm.tSNE =
tt_tcga_breast %>%
reduce_dimensions(
.abundance = count_scaled,
method = "tSNE",
top = 500,
perplexity=10,
pca_scale =TRUE
)
## Getting the 500 most variable genes
## Performing PCA
## Read the 251 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.01 seconds (sparsity = 0.183394)!
## Learning embedding...
## Iteration 50: error is 67.798772 (50 iterations in 0.03 seconds)
## Iteration 100: error is 66.523430 (50 iterations in 0.04 seconds)
## Iteration 150: error is 69.019506 (50 iterations in 0.04 seconds)
## Iteration 200: error is 69.615528 (50 iterations in 0.04 seconds)
## Iteration 250: error is 66.768841 (50 iterations in 0.04 seconds)
## Iteration 300: error is 1.590393 (50 iterations in 0.03 seconds)
## Iteration 350: error is 1.277936 (50 iterations in 0.03 seconds)
## Iteration 400: error is 1.207074 (50 iterations in 0.03 seconds)
## Iteration 450: error is 1.176613 (50 iterations in 0.03 seconds)
## Iteration 500: error is 1.165104 (50 iterations in 0.03 seconds)
## Iteration 550: error is 1.146544 (50 iterations in 0.03 seconds)
## Iteration 600: error is 1.126579 (50 iterations in 0.03 seconds)
## Iteration 650: error is 1.116458 (50 iterations in 0.03 seconds)
## Iteration 700: error is 1.110235 (50 iterations in 0.03 seconds)
## Iteration 750: error is 1.108496 (50 iterations in 0.03 seconds)
## Iteration 800: error is 1.106832 (50 iterations in 0.03 seconds)
## Iteration 850: error is 1.104988 (50 iterations in 0.03 seconds)
## Iteration 900: error is 1.103719 (50 iterations in 0.03 seconds)
## Iteration 950: error is 1.101457 (50 iterations in 0.03 seconds)
## Iteration 1000: error is 1.101111 (50 iterations in 0.02 seconds)
## Fitting performed in 0.59 seconds.
tt.norm.tSNE %>%
select(contains("tSNE", ignore.case = FALSE), sample, Call) %>%
distinct()
## # A tibble: 251 x 4
## tSNE1 tSNE2 sample Call
## <dbl> <dbl> <chr> <fct>
## 1 -9.80 -2.13 TCGA-A1-A0SD-01A-11R-A115-07 LumA
## 2 4.20 -3.44 TCGA-A1-A0SF-01A-11R-A144-07 LumA
## 3 -16.3 -5.86 TCGA-A1-A0SG-01A-11R-A144-07 LumA
## 4 -2.27 7.68 TCGA-A1-A0SH-01A-11R-A084-07 LumA
## 5 -3.52 11.1 TCGA-A1-A0SI-01A-11R-A144-07 LumB
## 6 9.11 -0.424 TCGA-A1-A0SJ-01A-11R-A084-07 LumA
## 7 31.4 -10.4 TCGA-A1-A0SK-01A-12R-A084-07 Basal
## 8 0.545 6.96 TCGA-A1-A0SM-01A-11R-A084-07 LumA
## 9 2.26 7.14 TCGA-A1-A0SN-01A-11R-A144-07 LumB
## 10 -23.6 -3.69 TCGA-A1-A0SQ-01A-21R-A144-07 LumA
## # … with 241 more rows
tt.norm.tSNE %>%
pivot_sample() %>%
ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + my_theme
All functions are also directly compatible with SummarizedExperiment
.
se.norm.tSNE =
se_breast_tcga_mini %>%
reduce_dimensions(
.abundance = count_scaled,
method = "tSNE",
top = 500,
perplexity=10,
pca_scale =TRUE
)
## Getting the 500 most variable genes
## Performing PCA
## Read the 251 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.01 seconds (sparsity = 0.183394)!
## Learning embedding...
## Iteration 50: error is 64.458491 (50 iterations in 0.03 seconds)
## Iteration 100: error is 68.796641 (50 iterations in 0.03 seconds)
## Iteration 150: error is 68.483854 (50 iterations in 0.03 seconds)
## Iteration 200: error is 67.829609 (50 iterations in 0.04 seconds)
## Iteration 250: error is 68.495338 (50 iterations in 0.04 seconds)
## Iteration 300: error is 1.686966 (50 iterations in 0.04 seconds)
## Iteration 350: error is 1.418726 (50 iterations in 0.03 seconds)
## Iteration 400: error is 1.261585 (50 iterations in 0.03 seconds)
## Iteration 450: error is 1.221525 (50 iterations in 0.03 seconds)
## Iteration 500: error is 1.198642 (50 iterations in 0.03 seconds)
## Iteration 550: error is 1.180088 (50 iterations in 0.03 seconds)
## Iteration 600: error is 1.175056 (50 iterations in 0.02 seconds)
## Iteration 650: error is 1.170758 (50 iterations in 0.03 seconds)
## Iteration 700: error is 1.163392 (50 iterations in 0.02 seconds)
## Iteration 750: error is 1.150758 (50 iterations in 0.02 seconds)
## Iteration 800: error is 1.138254 (50 iterations in 0.02 seconds)
## Iteration 850: error is 1.133613 (50 iterations in 0.02 seconds)
## Iteration 900: error is 1.132582 (50 iterations in 0.02 seconds)
## Iteration 950: error is 1.133681 (50 iterations in 0.02 seconds)
## Iteration 1000: error is 1.133172 (50 iterations in 0.02 seconds)
## Fitting performed in 0.56 seconds.
se.norm.tSNE
## class: SummarizedExperiment
## dim: 500 251
## metadata(0):
## assays(2): count count_scaled
## rownames(500): ENSG00000002834 ENSG00000003989 ... ENSG00000265972
## ENSG00000272398
## rowData names(0):
## colnames(251): TCGA-A1-A0SD-01A-11R-A115-07
## TCGA-A1-A0SF-01A-11R-A144-07 ... TCGA-GM-A2DM-01A-11R-A180-07
## TCGA-GM-A2DN-01A-11R-A180-07
## colData names(3): Call tSNE1 tSNE2
dimensions
We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. rotate_dimensions
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as <NAME OF DIMENSION> rotated <ANGLE>
by default, or as specified in the input arguments.
tt.norm.MDS.rotated =
tt.norm.MDS %>%
rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, .element = sample)
Original On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.
tt.norm.MDS.rotated %>%
pivot_sample() %>%
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type` )) +
geom_point() +
my_theme
Rotated On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.
tt.norm.MDS.rotated %>%
pivot_sample() %>%
ggplot(aes(x=`Dim1 rotated 45`, y=`Dim2 rotated 45`, color=`Cell type` )) +
geom_point() +
my_theme
All functions are also directly compatible with SummarizedExperiment
.
se.norm.MDS %>%
rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, .element = sample)
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(11): Cell.type time ... Dim1.rotated.45 Dim2.rotated.45
differential abundance
We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). test_differential_abundance
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).
tt %>% test_differential_abundance( ~ condition, action="only")
## # A tibble: 527 x 8
## transcript logFC logCPM F PValue FDR significant lowly_abundant
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 CLEC7A -11.6 12.7 109. 0.000142 0.0170 TRUE FALSE
## 2 HK3 -12.2 13.5 90.1 0.000225 0.0170 TRUE FALSE
## 3 APOBEC3A -8.93 11.1 77.7 0.000319 0.0170 TRUE FALSE
## 4 IGSF6 -7.78 10.5 71.8 0.000385 0.0170 TRUE FALSE
## 5 RASSF4 -8.38 12.4 69.0 0.000422 0.0170 TRUE FALSE
## 6 IL2RA 8.37 9.18 65.5 0.000477 0.0170 TRUE FALSE
## 7 TLR8 -9.94 11.6 64.3 0.000497 0.0170 TRUE FALSE
## 8 C5AR1 -9.97 12.2 56.7 0.000667 0.0170 TRUE FALSE
## 9 FCN1 -12.6 15.4 56.6 0.000670 0.0170 TRUE FALSE
## 10 CCR7 8.35 11.9 56.0 0.000687 0.0170 TRUE FALSE
## # … with 517 more rows
All functions are also directly compatible with SummarizedExperiment
.
se_mini %>% test_differential_abundance( ~ condition)
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(1): count
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(7): logFC logCPM ... significant lowly_abundant
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(3): Cell.type time condition
counts
We may want to adjust counts
for (known) unwanted variation. adjust_abundance
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as <COUNT COLUMN>_adjusted
. At the moment just an unwanted covariated is allowed at a time.
tt.norm.adj =
tt.norm.batch %>%
adjust_abundance(
~ factor_of_interest + batch,
.abundance = count_scaled,
action = "only"
)
tt.norm.adj
## # A tibble: 910 x 3
## transcript sample count_scaled_adjusted
## <chr> <chr> <int>
## 1 ACAP1 SRR1740034 14644
## 2 ACP5 SRR1740034 3903
## 3 ADRB2 SRR1740034 719
## 4 AIM2 SRR1740034 3707
## 5 ALOX5 SRR1740034 7358
## 6 APOBEC3G SRR1740034 3104
## 7 APOL3 SRR1740034 2417
## 8 APOL6 SRR1740034 1110
## 9 ARHGAP22 SRR1740034 69
## 10 BACH2 SRR1740034 29826
## # … with 900 more rows
All functions are also directly compatible with SummarizedExperiment
.
se.norm.batch %>%
adjust_abundance(
~ factor_of_interest + batch,
.abundance = count_scaled
)
## class: SummarizedExperiment
## dim: 19544 48
## metadata(0):
## assays(3): count count_scaled count_scaled_adjusted
## rownames(19544): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(1): lowly_abundant
## colnames(48): SRR1740034 SRR1740035 ... SRR1740088 SRR1740089
## colData names(7): Cell.type time ... TMM multiplier
Cell type composition
We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). deconvolve_cellularity
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with additional columns for the adjusted cell type proportions.
columns truncated
tt.cibersort =
tt %>%
deconvolve_cellularity(action="get", cores=2)
tt.cibersort %>% select(sample, contains("cibersort:"))
## # A tibble: 5 x 23
## sample `cibersort: B cells naive` `cibersort: B cells memory`
## <chr> <dbl> <dbl>
## 1 SRR17… 0.622 0.238
## 2 SRR17… 0.611 0.257
## 3 SRR17… 0 0
## 4 SRR17… 0.00232 0
## 5 SRR17… 0 0
## # … with 20 more variables: `cibersort: Plasma cells` <dbl>, `cibersort: T
## # cells CD8` <dbl>, `cibersort: T cells CD4 naive` <dbl>, `cibersort: T cells
## # CD4 memory resting` <dbl>, `cibersort: T cells CD4 memory activated` <dbl>,
## # `cibersort: T cells follicular helper` <dbl>, `cibersort: T cells
## # regulatory (Tregs)` <dbl>, `cibersort: T cells gamma delta` <dbl>,
## # `cibersort: NK cells resting` <dbl>, `cibersort: NK cells activated` <dbl>,
## # `cibersort: Monocytes` <dbl>, `cibersort: Macrophages M0` <dbl>,
## # `cibersort: Macrophages M1` <dbl>, `cibersort: Macrophages M2` <dbl>,
## # `cibersort: Dendritic cells resting` <dbl>, `cibersort: Dendritic cells
## # activated` <dbl>, `cibersort: Mast cells resting` <dbl>, `cibersort: Mast
## # cells activated` <dbl>, `cibersort: Eosinophils` <dbl>, `cibersort:
## # Neutrophils` <dbl>
With the new annotated data frame, we can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation. On the x axis we have the cell types inferred by Cibersort, on the y axis we have the inferred proportions. The data is facetted and coloured by nominal cell types (annotation given by the researcher after FACS sorting).
tt.cibersort %>%
gather(`Cell type inferred`, `proportion`, 5:26) %>%
distinct(sample, `Cell type`, `Cell type inferred`, proportion) %>%
ggplot(aes(x=`Cell type inferred`, y=proportion, fill=`Cell type`)) +
geom_boxplot() +
facet_wrap(~`Cell type`) +
my_theme +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5)
All functions are also directly compatible with SummarizedExperiment
.
se.cibersort %>% deconvolve_cellularity(cores=2)
## class: SummarizedExperiment
## dim: 19544 48
## metadata(0):
## assays(1): count
## rownames(19544): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(48): SRR1740034 SRR1740035 ... SRR1740088 SRR1740089
## colData names(27): Cell.type time ... cibersort..Eosinophils
## cibersort..Neutrophils
samples
We may want to cluster our data (e.g., using k-means sample-wise). cluster_elements
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods.
k-means
tt.norm.cluster = tt.norm %>%
cluster_elements(.abundance = count_scaled, method="kmeans", centers = 2 )
tt.norm.cluster
## # A tibble: 2,635 x 12
## sample transcript `Cell type` count time condition `merged transcripts`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 SRR17… ABCB4 b_cell 1035 0 d TRUE 1
## 2 SRR17… ABCB9 b_cell 45 0 d TRUE 1
## 3 SRR17… ACAP1 b_cell 7151 0 d TRUE 1
## 4 SRR17… ACHE b_cell 2 0 d TRUE 1
## 5 SRR17… ACP5 b_cell 2278 0 d TRUE 1
## 6 SRR17… ADAM28 b_cell 11156 0 d TRUE 1
## 7 SRR17… ADAMDEC1 b_cell 72 0 d TRUE 1
## 8 SRR17… ADAMTS3 b_cell 0 0 d TRUE 1
## 9 SRR17… ADRB2 b_cell 298 0 d TRUE 1
## 10 SRR17… AIF1 b_cell 8 0 d TRUE 1
## # … with 2,625 more rows, and 5 more variables: count_scaled <dbl>, TMM <dbl>,
## # multiplier <dbl>, lowly_abundant <lgl>, `cluster kmeans` <fct>
We can add cluster annotation to the MDS dimesion reduced data set and plot.
tt.norm.MDS %>%
cluster_elements(
.abundance = count_scaled,
method="kmeans",
centers = 2,
action="get"
) %>%
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster kmeans`)) +
geom_point() +
my_theme
All functions are also directly compatible with SummarizedExperiment
.
se.norm %>%
cluster_elements(.abundance = count_scaled, method="kmeans", centers = 2 )
## class: SummarizedExperiment
## dim: 527 5
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(7): Cell.type time ... multiplier cluster.kmeans
SNN
tt.norm.SNN = tt.norm.tSNE %>% cluster_elements(.abundance= count_scaled, method = "SNN")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 251
## Number of edges: 8484
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.5329
## Number of communities: 4
## Elapsed time: 0 seconds
tt.norm.SNN %>%
pivot_sample()
## # A tibble: 251 x 5
## sample Call tSNE1 tSNE2 `cluster SNN`
## <chr> <fct> <dbl> <dbl> <fct>
## 1 TCGA-A1-A0SD-01A-11R-A115-07 LumA -9.80 -2.13 0
## 2 TCGA-A1-A0SF-01A-11R-A144-07 LumA 4.20 -3.44 2
## 3 TCGA-A1-A0SG-01A-11R-A144-07 LumA -16.3 -5.86 1
## 4 TCGA-A1-A0SH-01A-11R-A084-07 LumA -2.27 7.68 0
## 5 TCGA-A1-A0SI-01A-11R-A144-07 LumB -3.52 11.1 0
## 6 TCGA-A1-A0SJ-01A-11R-A084-07 LumA 9.11 -0.424 1
## 7 TCGA-A1-A0SK-01A-12R-A084-07 Basal 31.4 -10.4 3
## 8 TCGA-A1-A0SM-01A-11R-A084-07 LumA 0.545 6.96 2
## 9 TCGA-A1-A0SN-01A-11R-A144-07 LumB 2.26 7.14 2
## 10 TCGA-A1-A0SQ-01A-21R-A144-07 LumA -23.6 -3.69 1
## # … with 241 more rows
tt.norm.SNN %>%
select(contains("tSNE", ignore.case = FALSE), `cluster SNN`, sample, Call) %>%
gather(source, Call, c("cluster SNN", "Call")) %>%
distinct() %>%
ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + facet_grid(~source) + my_theme
# Do differential transcription between clusters
tt.norm.SNN %>%
mutate(factor_of_interest = `cluster SNN` == 3) %>%
test_differential_abundance(
~ factor_of_interest,
action="only"
)
## # A tibble: 500 x 8
## ens logFC logCPM F PValue FDR significant lowly_abundant
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 ENSG00000065… 1.59 10.2 395. 1.31e-53 6.56e-51 TRUE FALSE
## 2 ENSG00000111… 2.94 9.64 389. 4.51e-53 1.13e-50 TRUE FALSE
## 3 ENSG00000140… 2.69 9.54 345. 3.67e-49 6.12e-47 TRUE FALSE
## 4 ENSG00000186… 6.17 7.99 343. 5.74e-49 7.18e-47 TRUE FALSE
## 5 ENSG00000181… 7.93 9.12 299. 8.93e-45 8.93e-43 TRUE FALSE
## 6 ENSG00000137… 3.88 8.28 267. 1.85e-41 1.54e-39 TRUE FALSE
## 7 ENSG00000083… 1.37 9.39 247. 2.79e-39 1.99e-37 TRUE FALSE
## 8 ENSG00000143… 1.02 10.7 246. 3.56e-39 2.22e-37 TRUE FALSE
## 9 ENSG00000124… 4.66 8.62 238. 2.37e-38 1.32e-36 TRUE FALSE
## 10 ENSG00000092… 2.91 8.40 232. 1.36e-37 6.82e-36 TRUE FALSE
## # … with 490 more rows
All functions are also directly compatible with SummarizedExperiment
.
se.norm.tSNE %>% cluster_elements(.abundance= count_scaled, method = "SNN")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 251
## Number of edges: 8484
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.5329
## Number of communities: 4
## Elapsed time: 0 seconds
## class: SummarizedExperiment
## dim: 500 251
## metadata(0):
## assays(2): count count_scaled
## rownames(500): ENSG00000002834 ENSG00000003989 ... ENSG00000265972
## ENSG00000272398
## rowData names(0):
## colnames(251): TCGA-A1-A0SD-01A-11R-A115-07
## TCGA-A1-A0SF-01A-11R-A144-07 ... TCGA-GM-A2DM-01A-11R-A180-07
## TCGA-GM-A2DN-01A-11R-A180-07
## colData names(4): Call tSNE1 tSNE2 cluster.SNN
redundant
We may want to remove redundant elements from the original data set (e.g., samples or transcripts), for example if we want to define cell-type specific signatures with low sample redundancy. remove_redundancy
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble dropped recundant elements (e.g., samples). Two redundancy estimation approaches are supported:
Approach 1
tt.norm.non_redundant = tt.norm.MDS %>% remove_redundancy( method = "correlation" )
## Getting the 527 most variable genes
We can visualise how the reduced redundancy with the reduced dimentions look like
tt.norm.non_redundant %>%
pivot_sample() %>%
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type`)) +
geom_point() +
my_theme
All functions are also directly compatible with SummarizedExperiment
.
se.norm.MDS %>% remove_redundancy( method = "correlation" )
## Getting the 527 most variable genes
## class: SummarizedExperiment
## dim: 527 4
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(4): SRR1740035 SRR1740043 SRR1740058 SRR1740067
## colData names(9): Cell.type time ... Dim2 Dim3
Approach 2
tt.norm.non_redundant =
tt.norm.MDS %>%
remove_redundancy(
method = "reduced_dimensions",
.element = sample,
.feature = transcript,
Dim_a_column = `Dim1`,
Dim_b_column = `Dim2`
)
We can visualise MDS reduced dimensions of the samples with the closest pair removed.
tt.norm.non_redundant %>%
pivot_sample() %>%
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type`)) +
geom_point() +
my_theme
All functions are also directly compatible with SummarizedExperiment
.
se.norm.MDS %>%
remove_redundancy(
method = "reduced_dimensions",
.element = sample,
.feature = transcript,
Dim_a_column = `Dim1`,
Dim_b_column = `Dim2`
)
## class: SummarizedExperiment
## dim: 527 3
## metadata(0):
## assays(2): count count_scaled
## rownames(527): ABCB4 ABCB9 ... ZNF324 ZNF442
## rowData names(1): lowly_abundant
## colnames(3): SRR1740035 SRR1740058 SRR1740067
## colData names(9): Cell.type time ... Dim2 Dim3
The above wrapper streamline the most common processing of bulk RNA sequencing data. Other useful wrappers are listed above.
We can calculate gene counts (using FeatureCounts; Liao Y et al., 10.1093/nar/gkz114) from a list of BAM/SAM files and format them into a tidy structure (similar to counts).
counts = tidybulk_SAM_BAM(
file_names,
genome = "hg38",
isPairedEnd = TRUE,
requireBothEndsMapped = TRUE,
checkFragLength = FALSE,
useMetaFeatures = TRUE
)
We can add gene symbols from ensembl identifiers. This is useful since different resources use ensembl IDs while others use gene symbol IDs.
counts_ensembl %>% ensembl_to_symbol(ens)
## # A tibble: 119 x 8
## ens iso `read count` sample cases_0_project_disease_type
## <chr> <chr> <dbl> <chr> <chr>
## 1 ENSG… 13 144 TARGE… Acute Myeloid Leukemia
## 2 ENSG… 13 72 TARGE… Acute Myeloid Leukemia
## 3 ENSG… 13 0 TARGE… Acute Myeloid Leukemia
## 4 ENSG… 13 1099 TARGE… Acute Myeloid Leukemia
## 5 ENSG… 13 11 TARGE… Acute Myeloid Leukemia
## 6 ENSG… 13 2 TARGE… Acute Myeloid Leukemia
## 7 ENSG… 13 3 TARGE… Acute Myeloid Leukemia
## 8 ENSG… 13 2678 TARGE… Acute Myeloid Leukemia
## 9 ENSG… 13 751 TARGE… Acute Myeloid Leukemia
## 10 ENSG… 13 1 TARGE… Acute Myeloid Leukemia
## # … with 109 more rows, and 3 more variables:
## # cases_0_samples_0_sample_type <chr>, transcript <chr>, ref_genome <chr>
Every function takes a tidytranscriptomics structured data as input, and (i) with action=“add” outputs the new information joint to the original input data frame (default), (ii) with action=“get” the new information with the sample or transcript relative informatin depending on what the analysis is about, or (iii) with action=“only” just the new information. For example, from this data set
tt.norm
## # A tibble: 2,635 x 11
## sample transcript `Cell type` count time condition `merged transcripts`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 SRR17… ABCB4 b_cell 1035 0 d TRUE 1
## 2 SRR17… ABCB9 b_cell 45 0 d TRUE 1
## 3 SRR17… ACAP1 b_cell 7151 0 d TRUE 1
## 4 SRR17… ACHE b_cell 2 0 d TRUE 1
## 5 SRR17… ACP5 b_cell 2278 0 d TRUE 1
## 6 SRR17… ADAM28 b_cell 11156 0 d TRUE 1
## 7 SRR17… ADAMDEC1 b_cell 72 0 d TRUE 1
## 8 SRR17… ADAMTS3 b_cell 0 0 d TRUE 1
## 9 SRR17… ADRB2 b_cell 298 0 d TRUE 1
## 10 SRR17… AIF1 b_cell 8 0 d TRUE 1
## # … with 2,625 more rows, and 4 more variables: count_scaled <dbl>, TMM <dbl>,
## # multiplier <dbl>, lowly_abundant <lgl>
action=“add” (Default) We can add the MDS dimensions to the original data set
tt.norm %>%
reduce_dimensions(
.abundance = count_scaled,
method="MDS" ,
.element = sample,
.feature = transcript,
.dims = 3,
action="add"
)
## # A tibble: 2,635 x 14
## sample transcript `Cell type` count time condition `merged transcripts`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 SRR17… ABCB4 b_cell 1035 0 d TRUE 1
## 2 SRR17… ABCB9 b_cell 45 0 d TRUE 1
## 3 SRR17… ACAP1 b_cell 7151 0 d TRUE 1
## 4 SRR17… ACHE b_cell 2 0 d TRUE 1
## 5 SRR17… ACP5 b_cell 2278 0 d TRUE 1
## 6 SRR17… ADAM28 b_cell 11156 0 d TRUE 1
## 7 SRR17… ADAMDEC1 b_cell 72 0 d TRUE 1
## 8 SRR17… ADAMTS3 b_cell 0 0 d TRUE 1
## 9 SRR17… ADRB2 b_cell 298 0 d TRUE 1
## 10 SRR17… AIF1 b_cell 8 0 d TRUE 1
## # … with 2,625 more rows, and 7 more variables: count_scaled <dbl>, TMM <dbl>,
## # multiplier <dbl>, lowly_abundant <lgl>, Dim1 <dbl>, Dim2 <dbl>, Dim3 <dbl>
action=“get” We can add the MDS dimensions to the original data set selecting just the sample-wise column
tt.norm %>%
reduce_dimensions(
.abundance = count_scaled,
method="MDS" ,
.element = sample,
.feature = transcript,
.dims = 3,
action="get"
)
## # A tibble: 5 x 10
## sample `Cell type` time condition `merged transcripts` TMM multiplier
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 SRR17… b_cell 0 d TRUE 1 0.791 1.84
## 2 SRR17… b_cell 1 d TRUE 1 0.799 1.61
## 3 SRR17… monocyte 1 d FALSE 1 1.14 2.55
## 4 SRR17… t_cell 0 d TRUE 1 1.06 1.32
## 5 SRR17… dendritic_… 1 d FALSE 1 1.30 2.56
## # … with 3 more variables: Dim1 <dbl>, Dim2 <dbl>, Dim3 <dbl>
action=“only” We can get just the MDS dimensions relative to each sample
tt.norm %>%
reduce_dimensions(
.abundance = count_scaled,
method="MDS" ,
.element = sample,
.feature = transcript,
.dims = 3,
action="only"
)
## # A tibble: 5 x 4
## sample Dim1 Dim2 Dim3
## <chr> <dbl> <dbl> <dbl>
## 1 SRR1740034 -0.882 -0.708 -0.0291
## 2 SRR1740035 -0.870 -0.715 0.0000434
## 3 SRR1740043 1.40 0.0479 -0.471
## 4 SRR1740058 -0.935 1.39 0.00575
## 5 SRR1740067 1.29 -0.0121 0.494
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidybulk_1.0.2 purrr_0.3.4 rlang_0.4.6 ggplot2_3.3.1 tidyr_1.1.0
## [6] magrittr_1.5 dplyr_1.0.0 tibble_3.0.1 knitr_1.28
##
## loaded via a namespace (and not attached):
## [1] Seurat_3.1.5 Rtsne_0.15
## [3] colorspace_1.4-1 ellipsis_0.3.1
## [5] ggridges_0.5.2 XVector_0.28.0
## [7] GenomicRanges_1.40.0 tidytext_0.2.4
## [9] leiden_0.3.3 listenv_0.8.0
## [11] farver_2.0.3 SnowballC_0.7.0
## [13] ggrepel_0.8.2 bit64_0.9-7
## [15] AnnotationDbi_1.50.0 fansi_0.4.1
## [17] codetools_0.2-16 splines_4.0.0
## [19] jsonlite_1.6.1 broom_0.5.6
## [21] annotate_1.66.0 ica_1.0-2
## [23] cluster_2.1.0 png_0.1-7
## [25] uwot_0.1.8 sctransform_0.2.1
## [27] readr_1.3.1 compiler_4.0.0
## [29] httr_1.4.1 backports_1.1.7
## [31] lazyeval_0.2.2 assertthat_0.2.1
## [33] Matrix_1.2-18 limma_3.44.1
## [35] cli_2.0.2 htmltools_0.4.0
## [37] tools_4.0.0 rsvd_1.0.3
## [39] igraph_1.2.5 gtable_0.3.0
## [41] glue_1.4.1 GenomeInfoDbData_1.2.3
## [43] reshape2_1.4.4 RANN_2.6.1
## [45] rappdirs_0.3.1 Rcpp_1.0.4.6
## [47] Biobase_2.48.0 vctrs_0.3.1
## [49] ape_5.4 preprocessCore_1.50.0
## [51] nlme_3.1-148 lmtest_0.9-37
## [53] xfun_0.14 stringr_1.4.0
## [55] globals_0.12.5 lifecycle_0.2.0
## [57] irlba_2.3.3 XML_3.99-0.3
## [59] future_1.17.0 edgeR_3.30.3
## [61] zlibbioc_1.34.0 MASS_7.3-51.6
## [63] zoo_1.8-8 scales_1.1.1
## [65] hms_0.5.3 parallel_4.0.0
## [67] SummarizedExperiment_1.18.1 RColorBrewer_1.1-2
## [69] gridExtra_2.3 memoise_1.1.0
## [71] reticulate_1.16 pbapply_1.4-2
## [73] stringi_1.4.6 RSQLite_2.2.0
## [75] highr_0.8 genefilter_1.70.0
## [77] tokenizers_0.2.1 S4Vectors_0.26.1
## [79] BiocGenerics_0.34.0 BiocParallel_1.22.0
## [81] GenomeInfoDb_1.24.0 pkgconfig_2.0.3
## [83] matrixStats_0.56.0 bitops_1.0-6
## [85] evaluate_0.14 lattice_0.20-41
## [87] ROCR_1.0-11 htmlwidgets_1.5.1
## [89] patchwork_1.0.0 labeling_0.3
## [91] cowplot_1.0.0 bit_1.1-15.2
## [93] tidyselect_1.1.0 RcppAnnoy_0.0.16
## [95] plyr_1.8.6 R6_2.4.1
## [97] IRanges_2.22.2 generics_0.0.2
## [99] DelayedArray_0.14.0 DBI_1.1.0
## [101] pillar_1.4.4 withr_2.2.0
## [103] mgcv_1.8-31 fitdistrplus_1.1-1
## [105] survival_3.1-12 RCurl_1.98-1.2
## [107] tsne_0.1-3 future.apply_1.5.0
## [109] janeaustenr_0.1.5 widyr_0.1.3
## [111] crayon_1.3.4 KernSmooth_2.23-17
## [113] utf8_1.1.4 plotly_4.9.2.1
## [115] locfit_1.5-9.4 grid_4.0.0
## [117] sva_3.36.0 data.table_1.12.8
## [119] blob_1.2.1 digest_0.6.25
## [121] xtable_1.8-4 stats4_4.0.0
## [123] munsell_0.5.0 viridisLite_0.3.0