This vignette illustrates use cases and visualizations of the data found in the depmap package. See the depmap vignette for details about the datasets.
The depmap
package aims to provide a reproducible research framework to cancer dependency
data described by Tsherniak, Aviad, et al. “Defining a cancer dependency map.”
Cell 170.3 (2017): 564-576.. The
data found in the depmap
package has been formatted to facilitate the use of common R packages such as
dplyr
and ggplot2
. We hope that this package will allow researchers to more
easily mine, explore and visually illustrate dependency data taken from the
Depmap cancer genomic dependency study.
Perhaps the most interesting datasets found within the
depmap
package are those that relate to the cancer gene dependency score, such as
rnai
and crispr
. These datasets contain a score expressing how vital a
particular gene is in terms of how lethal the knockout/knockdown of that gene is
on a target cell line. For example, a highly negative dependency score implies
that a cell line is highly dependent on that gene.
Load necessary libaries.
library("dplyr")
library("ggplot2")
library("viridis")
library("tibble")
library("gridExtra")
library("stringr")
library("depmap")
library("ExperimentHub")
Load the rnai
, crispr
and copyNumber
datasets for visualization.
## create ExperimentHub query object
eh <- ExperimentHub()
query(eh, "depmap")
## ExperimentHub with 22 records
## # snapshotDate(): 2019-10-22
## # $dataprovider: Broad Institute
## # $species: Homo sapiens
## # $rdataclass: tibble
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH2260"]]'
##
## title
## EH2260 | rnai_19Q1
## EH2261 | crispr_19Q1
## EH2262 | copyNumber_19Q1
## EH2263 | RPPA_19Q1
## EH2264 | TPM_19Q1
## ... ...
## EH3083 | RPPA_19Q3
## EH3084 | TPM_19Q3
## EH3085 | mutationCalls_19Q3
## EH3086 | metadata_19Q3
## EH3087 | drug_sensitivity_19Q3
rnai <- eh[["EH2260"]]
crispr <- eh[["EH2261"]]
copyNumber <- eh[["EH2262"]]
# note: the datasets listed above are from the 19Q1 release. Newer datasets,
# such as 19Q2 and 19Q3 are available.
We will demonstrate how to obtain individual dependency scores corresponding to
a specific gene and cell lineage. For example, shown below is the dependency of
a breast cancer lineage, such as 184A1_BREAST
has on a human tumor suppressor
gene, like BRCA1
when it is knocked down via rnai. Shown below is the
comparison for data found within the rnai
dataset. This shows a score which is
slightly positive, indicating that the knockdown of this gene is slightly
beneficial to the vitality of this cancer cell lineage. However, it may be
insightful to put this single dependency score in context.
dep_score_BRCA1_184A1Breast <- rnai %>%
select(cell_line, gene_name, dependency) %>%
filter(cell_line == "184A1_BREAST",
gene_name == "BRCA1")
dep_score_BRCA1_184A1Breast
## # A tibble: 1 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 184A1_BREAST BRCA1 0.0144
Shown below is the average dependency score for BRCA1
for all cancer cell
lines in the rnai
dataset.
brca1_dep_score_avg_rnai <- rnai %>%
select(gene_name, dependency) %>%
filter(gene_name == "BRCA1") %>%
summarise(mean_dependency_brca1 =
mean(dependency, na.rm=TRUE))
brca1_dep_score_avg_rnai
## # A tibble: 1 x 1
## mean_dependency_brca1
## <dbl>
## 1 -0.158
rnai
datasetOr to see the average gene dependency across all genes in the entire rnai
dataset. As one can see below, the average dependency for an average gene in the
rnai
dataset is slightly negative but close to zero.
all_gene_dep_score_avg_rnai <- rnai %>%
select(gene_name, dependency) %>%
summarise(mean_dependency_all_genes_rnai =
mean(dependency, na.rm=TRUE))
all_gene_dep_score_avg_rnai
## # A tibble: 1 x 1
## mean_dependency_all_genes_rnai
## <dbl>
## 1 -0.0659
rnai
dataset with “soft tissue” in the nameIf we are interested researching soft tissue sarcomas and wanted to find the
cell lines withing the rnai
dataset that had “soft tissue” in the CCLE name of
cancer cell line, and sort by the highest dependency score. The results of such
a search is shown below. Note: CCLE names are in ALL CAPS with an underscore.
soft_tissue_dependency_rnai <- rnai %>%
select(cell_line, gene_name, dependency) %>%
filter(stringr::str_detect(cell_line,
"SOFT_TISSUE")) %>%
arrange(dependency)
soft_tissue_dependency_rnai
## # A tibble: 432,725 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 FUJI_SOFT_TISSUE RPL14 -3.60
## 2 SJRH30_SOFT_TISSUE RAN -3.41
## 3 SJRH30_SOFT_TISSUE RPL14 -3.36
## 4 SJRH30_SOFT_TISSUE RBX1 -3.31
## 5 HS729_SOFT_TISSUE PSMA3 -3.22
## 6 SJRH30_SOFT_TISSUE RUVBL2 -3.13
## 7 KYM1_SOFT_TISSUE RPL14 -3.03
## 8 RH41_SOFT_TISSUE RBX1 -3.01
## 9 HS729_SOFT_TISSUE NUTF2 -2.90
## 10 SJRH30_SOFT_TISSUE NUTF2 -2.85
## # … with 432,715 more rows
Sometimes it is difficult to find the subset with the exact gene name one wishes
to find. In this case, it is better to search by entrez_id
. For example, a
recent paper describes
gene knockdown of NRF2 increases chemosensitivity in certain types of cancer.
It might be interesting to see what interactions knockdown of this gene has on
other cancer cell lines. However, searching by filter(gene_name == “NRF2”)
will not yield any results. We know from NCBI that the Entrez ID for this gene
is “4780” and it is possible to search this dataset by that criteria. Here it
can be shown that the gene name for NRF2 in the rnai
dataset is NFE2L2.
entrez_id_NRF2 <- rnai %>%
select(entrez_id, cell_line, gene_name, dependency) %>%
filter(entrez_id == "4780")
entrez_id_NRF2
## # A tibble: 712 x 4
## entrez_id cell_line gene_name dependency
## <chr> <chr> <chr> <dbl>
## 1 4780 127399_SOFT_TISSUE NFE2L2 0.0788
## 2 4780 1321N1_CENTRAL_NERVOUS_SYSTEM NFE2L2 -0.105
## 3 4780 143B_BONE NFE2L2 0.0617
## 4 4780 184A1_BREAST NFE2L2 -0.0333
## 5 4780 184B5_BREAST NFE2L2 -0.0360
## 6 4780 22RV1_PROSTATE NFE2L2 0.116
## 7 4780 2313287_STOMACH NFE2L2 -0.0752
## 8 4780 600MPE_BREAST NFE2L2 -0.195
## 9 4780 697_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE NFE2L2 0.0128
## 10 4780 769P_KIDNEY NFE2L2 -0.0951
## # … with 702 more rows
Below the highest dependency scores via rnai knock down of a specific gene, NFE2L2 will be obtained and the cancer cell lines associated with those values will be listed. It appears that the knockdown of this gene is strongly associated with cell death with in lung and kidney cancer cell lines.
top_dep_score_NFE2L2_rnai <- rnai %>%
select(cell_line, gene_name, dependency) %>%
filter(gene_name == "NFE2L2") %>%
arrange(dependency)
top_dep_score_NFE2L2_rnai
## # A tibble: 712 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 NCIH2066_LUNG NFE2L2 -1.29
## 2 NCIH2122_LUNG NFE2L2 -0.886
## 3 CAKI2_KIDNEY NFE2L2 -0.865
## 4 NCIH1792_LUNG NFE2L2 -0.860
## 5 NCIH28_PLEURA NFE2L2 -0.802
## 6 A498_KIDNEY NFE2L2 -0.779
## 7 LC1SQSF_LUNG NFE2L2 -0.743
## 8 LK2_LUNG NFE2L2 -0.689
## 9 NCIH1437_LUNG NFE2L2 -0.603
## 10 AU565_BREAST NFE2L2 -0.601
## # … with 702 more rows
If we would like to obtain the top 10 lowest dependency scores for a
particular cell line (for example NCIH2066_LUNG
) along with the genes
associated with those values:
top_dep_score_NCIH2066_LUNG_rnai <- rnai %>%
select(cell_line, gene_name, dependency) %>%
filter(cell_line == "NCIH2066_LUNG") %>%
arrange(dependency)
top_dep_score_NCIH2066_LUNG_rnai
## # A tibble: 17,309 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 NCIH2066_LUNG KIF11 -3.46
## 2 NCIH2066_LUNG ATP6V0C -3.02
## 3 NCIH2066_LUNG CKAP5 -3.02
## 4 NCIH2066_LUNG CASP8AP2 -2.87
## 5 NCIH2066_LUNG RAN -2.81
## 6 NCIH2066_LUNG SF3B2 -2.76
## 7 NCIH2066_LUNG USP39 -2.72
## 8 NCIH2066_LUNG SNRNP200 -2.65
## 9 NCIH2066_LUNG TACC3 -2.61
## 10 NCIH2066_LUNG MAD2L1 -2.59
## # … with 17,299 more rows
Below shows the most significant genes that deplete cancer cell lines upon
knockdown and their dependency scores for the entire rnai
data.
greatest_dep_score_gene_rnai <- rnai %>%
select(cell_line, gene_name, dependency) %>%
arrange(dependency)
greatest_dep_score_gene_rnai
## # A tibble: 12,324,008 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 SW1088_CENTRAL_NERVOUS_SYSTEM UBC -5.93
## 2 COV318_OVARY UBC -5.42
## 3 MEL285_UVEA PSMB5 -4.97
## 4 MEL285_UVEA PSMA3 -4.79
## 5 COLO678_LARGE_INTESTINE NXF1 -4.74
## 6 CW2_LARGE_INTESTINE CTNNB1 -4.71
## 7 COV318_OVARY PUF60 -4.65
## 8 CCK81_LARGE_INTESTINE MCL1 -4.60
## 9 CW2_LARGE_INTESTINE USP39 -4.60
## 10 MEL285_UVEA VARS -4.59
## # … with 12,323,998 more rows
Below shows the least significant genes that induce cancer cell line vitality
upon knockdown and their dependency scores for the entire rnai
data.
Unsurprisingly, we see high incidence of “TP53”, a well known cancer driver.
lowest_dep_score_gene_rnai <- rnai %>%
select(cell_line, gene_name, dependency) %>%
arrange(desc(dependency))
lowest_dep_score_gene_rnai
## # A tibble: 12,324,008 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 SKNSH_AUTONOMIC_GANGLIA TP53 2.77
## 2 OVTOKO_OVARY TP53 2.38
## 3 NB1_AUTONOMIC_GANGLIA UBBP4 2.07
## 4 RVH421_SKIN TP53 2.01
## 5 SNU738_CENTRAL_NERVOUS_SYSTEM COPB2 1.96
## 6 SNU1079_BILIARY_TRACT TP53 1.95
## 7 C32_SKIN TP53 1.93
## 8 JHUEM2_ENDOMETRIUM CDKN2A 1.92
## 9 MEL285_UVEA TP53 1.91
## 10 NCIH28_PLEURA MED12 1.89
## # … with 12,323,998 more rows
Below we will apply some of the same selections as shown in the above examples
on the crispr
gene knockout dataset and observe the difference between that
dataset and rnai
. First we will look at the most significant dependency scores
in the crispr
dataset. As can be seen below, there is a different population
of significant genes with the highest dependency score.
greatest_dep_score_gene_crispr <- crispr %>%
select(cell_line, gene_name, dependency) %>%
arrange(dependency)
greatest_dep_score_gene_crispr
## # A tibble: 9,839,772 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 NCIH446_LUNG HIST2H3A -3.18
## 2 KE97_STOMACH BUB3 -3.09
## 3 HT1376_URINARY_TRACT RAN -3.04
## 4 EN_ENDOMETRIUM HIST2H3A -2.91
## 5 KE97_STOMACH CCT3 -2.84
## 6 HSB2_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE RAN -2.82
## 7 KE97_STOMACH SNRPD1 -2.80
## 8 EN_ENDOMETRIUM RAN -2.78
## 9 SR786_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE TBC1D3C -2.75
## 10 NCIH2887_LUNG RAN -2.75
## # … with 9,839,762 more rows
First we will look at the least significant (most cancer inducing) dependency
scores in the crispr
dataset.
lowest_dep_score_gene_crispr <- crispr %>%
select(cell_line, gene_name, dependency) %>%
arrange(desc(dependency))
lowest_dep_score_gene_crispr
## # A tibble: 9,839,772 x 3
## cell_line gene_name dependency
## <chr> <chr> <dbl>
## 1 TC32_BONE PTEN 5.44
## 2 KE97_STOMACH UBA52 4.97
## 3 KE97_STOMACH HNRNPA1 4.58
## 4 KS1_CENTRAL_NERVOUS_SYSTEM TP53 4.07
## 5 KE97_STOMACH RPS27 4.03
## 6 DKMG_CENTRAL_NERVOUS_SYSTEM TP53 3.88
## 7 SR786_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE TBC1D3 3.71
## 8 KE97_STOMACH HNRNPA3 3.64
## 9 KE97_STOMACH PSME1 3.61
## 10 KE97_STOMACH RPL34 3.31
## # … with 9,839,762 more rows
Here we will plot the difference in expression between the most signficant genes
found in the crispr
and rnai
datasets.
Compare the count of top 50 unique genes for crispr
and rnai
datasets for
the most cancer-vitality inducing genes.
Mean log copy number (total dataset) and mean log copy number for each gene
## [1] "ID"
## [2] "na.omit(each_log_copy_num_gene$mean_log_copy_number)"
Find genes with greatest mean log copy number
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] stringr_1.4.0 gridExtra_2.3 tibble_2.1.3
## [4] viridis_0.5.1 viridisLite_0.3.0 ggplot2_3.2.1
## [7] ExperimentHub_1.12.0 AnnotationHub_2.18.0 BiocFileCache_1.10.0
## [10] dbplyr_1.4.2 BiocGenerics_0.32.0 depmap_1.0.0
## [13] dplyr_0.8.3 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 assertthat_0.2.1
## [3] zeallot_0.1.0 digest_0.6.22
## [5] utf8_1.1.4 mime_0.7
## [7] R6_2.4.0 backports_1.1.5
## [9] stats4_3.6.1 RSQLite_2.1.2
## [11] evaluate_0.14 httr_1.4.1
## [13] pillar_1.4.2 rlang_0.4.1
## [15] lazyeval_0.2.2 curl_4.2
## [17] blob_1.2.0 S4Vectors_0.24.0
## [19] rmarkdown_1.16 labeling_0.3
## [21] bit_1.1-14 munsell_0.5.0
## [23] shiny_1.4.0 compiler_3.6.1
## [25] httpuv_1.5.2 xfun_0.10
## [27] pkgconfig_2.0.3 htmltools_0.4.0
## [29] tidyselect_0.2.5 interactiveDisplayBase_1.24.0
## [31] bookdown_0.14 IRanges_2.20.0
## [33] fansi_0.4.0 crayon_1.3.4
## [35] withr_2.1.2 later_1.0.0
## [37] rappdirs_0.3.1 grid_3.6.1
## [39] xtable_1.8-4 gtable_0.3.0
## [41] DBI_1.0.0 magrittr_1.5
## [43] scales_1.0.0 cli_1.1.0
## [45] stringi_1.4.3 promises_1.1.0
## [47] vctrs_0.2.0 tools_3.6.1
## [49] bit64_0.9-7 Biobase_2.46.0
## [51] glue_1.3.1 purrr_0.3.3
## [53] BiocVersion_3.10.1 fastmap_1.0.1
## [55] yaml_2.2.0 AnnotationDbi_1.48.0
## [57] colorspace_1.4-1 BiocManager_1.30.9
## [59] memoise_1.1.0 knitr_1.25