Progenetix is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing genomic variant data within the Progenetix database. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset
parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.
library(pgxRpi)
pgxLoader
functionThis function loads various data from Progenetix
database.
The parameters of this function used in this tutorial:
type
A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.output
A string specifying output file format. When the parameter type
is “variant”,
available options are NULL, “pgxseg” ,“pgxmatrix”, “coverage” or “seg”.filters
Identifiers for cancer type, literature, cohorts, and age such as
c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”).
For more information about filters, see the documentation.individual_id
Identifiers used in Progenetix database for identifying individuals.biosample_id
Identifiers used in Progenetix database for identifying biosamples.codematches
A logical value determining whether to exclude samples
from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon).
If TRUE, retrieved samples only keep samples exactly encoded by specified filters.
Do not use this parameter when filters
include ontology-irrelevant filters such as PMID and cohort identifiers.
Default is FALSE.limit
Integer to specify the number of returned CNV coverage profiles for each filter.
Default is 0 (return all).skip
Integer to specify the number of skipped CNV coverage profiles for each filter.
E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned.
Default is NULL (no skip).save_file
A logical value determining whether to save the segment variant data as file
instead of direct return. Only used when the parameter type
is “variant” and output
is “pgxseg” or “seg”. Default is FALSE.filename
A string specifying the path and name of the file to be saved.
Only used if the parameter save_file
is TRUE. Default is “variants.seg/pgxseg” in current work directory.num_cores
Integer to specify the number of cores on your local computer to be used for the variant query. The default is 1.dataset
A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.type, output, filters, individual_id, biosample_id, codematches, skip, limit, dataset
cnv_matrix <- pgxLoader(type="variant", output="pgxmatrix", filters = "NCIT:C2948")
The data looks like this
print(dim(cnv_matrix))
#> [1] 47 6215
cnv_matrix[c(1:3), c(1:5,6213:6215)]
#> analysis_id biosample_id group_id chr1.000000000.000400000.DUP
#> 1 pgxcs-kftvs0ri pgxbs-kftvh262 NCIT:C2948 0
#> 2 pgxcs-kftvu9w2 pgxbs-kftvh9fp NCIT:C2948 0
#> 3 pgxcs-kftvw1kw pgxbs-kftvhf4h NCIT:C8893 0
#> chr1.000400000.001400000.DUP chrY.054400000.055400000.DEL
#> 1 0 0
#> 2 0 1
#> 3 0 0
#> chrY.055400000.056400000.DEL chrY.056400000.057227415.DEL
#> 1 0 0
#> 2 1 1
#> 3 0 0
In this dataframe, analysis_id
is the identifier for individual analysis, biosample_id
is the identifier for individual biosample. It is noted that the number of analysis profiles does not necessarily equal the number of samples. One biosample_id may correspond to multiple analysis_id. group_id
equals the meaning of filters
. It’s followed by all “gain status” columns (3106 intervals) plus all “loss status” columns (3106 intervals). The status is indicated by a coverage value, i.e. the fraction of how much the binned interval overlaps with one or more CNVs of the given type (DUP/DEL). For example, if the column chr1.400000.1400000.DUP
is 0.200 in one row, it means that one or more duplication events overlapped with 20% of the genomic bin located in chromosome 1: 400000-1400000 in the corresponding analysis.
cnv_coverage <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948")
It includes CNV coverage across chromosome arms, whole chromosomes, or whole genome.
names(cnv_coverage)
#> [1] "chrom_arm_coverage" "whole_chrom_coverage" "whole_genome_coverage"
The data of CNV coverage across chromosomal arms looks like this
head(cnv_coverage$chrom_arm_coverage)[,c(1:4, 49:52)]
#> chr1p.dup chr1q.dup chr2p.dup chr2q.dup chr1p.del chr1q.del
#> pgxbs-kftvh262 0 0.000 0.000 0.000 0.000 0
#> pgxbs-kftvh9fp 0 0.000 0.000 0.000 0.000 0
#> pgxbs-kftvhf4h 0 0.000 0.000 0.000 0.000 0
#> pgxbs-kftvhf4i 0 0.000 0.000 0.000 0.000 0
#> pgxbs-kftvhf4k 0 0.000 0.000 0.000 0.225 0
#> pgxbs-kftvhf4m 0 0.979 0.989 0.003 0.000 0
#> chr2p.del chr2q.del
#> pgxbs-kftvh262 0 0
#> pgxbs-kftvh9fp 0 0
#> pgxbs-kftvhf4h 0 0
#> pgxbs-kftvhf4i 0 0
#> pgxbs-kftvhf4k 0 0
#> pgxbs-kftvhf4m 0 0
The row names are id of biosamples from the group NCIT:C2948. There are 96 columns. The first 48 columns are duplication coverage across chromosomal arms, followed by deletion coverage. The data of CNV coverage across whole chromosomes is similar, with the only difference in columns.
The data of CNV coverage across genome (hg38) looks like this
head(cnv_coverage$whole_genome_coverage)
#> cnvfraction dupfraction delfraction
#> pgxbs-kftvh262 0.080 0.036 0.044
#> pgxbs-kftvh9fp 0.058 0.010 0.048
#> pgxbs-kftvhf4h 0.000 0.000 0.000
#> pgxbs-kftvhf4i 0.000 0.000 0.000
#> pgxbs-kftvhf4k 0.027 0.000 0.027
#> pgxbs-kftvhf4m 0.176 0.159 0.017
The first column is the total called coverage, followed by duplication coverage and deletion coverage.
codematches
useSetting codematches = True
can exclude profiles with group_id belonging to child terms of the input filters.
21 samples are excluded from the original 47 samples in this case.
cnv_coverage_2 <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948",
codematches = TRUE)
print(dim(cnv_coverage$chrom_arm_coverage))
#> [1] 47 96
print(dim(cnv_coverage_2$chrom_arm_coverage))
#> [1] 26 96
By default, it returns all available profiles (limit=0), so the query may take a while
when the number of retrieved samples is large. You can use the parameters limit
and
skip
to access a subset of samples.
cnv_matrix_2 <- pgxLoader(type="variant", output="pgxmatrix",
filters = "NCIT:C2948",
skip = 0, limit=10)
# the dimention of subset
print(dim(cnv_matrix_2))
#> [1] 10 6215
# the dimention of original set
print(dim(cnv_matrix))
#> [1] 47 6215
cnv_ind_matrix <- pgxLoader(type="variant", output="pgxmatrix",
biosample_id = "pgxbs-kftva604",
individual_id = "pgxind-kftx5g4t")
cnv_ind_cov <- pgxLoader(type="variant", output="coverage",
biosample_id = "pgxbs-kftva604",
individual_id = "pgxind-kftx5g4t")
Because of a time-out issue, segment variant data can only be accessed by biosample id instead of filters.
To speed up this process, you can set the num_cores
parameter for parallel processing.
type, output, biosample_id, save_file, filename, num_cores, dataset
The biosample information is also obtained by pgxLoader
and the vignette about metadata
query see Introduction_1_loadmetadata.
biosamples <- pgxLoader(type="biosample", filters = "PMID:20229506", limit=2)
biosample_id <- biosamples$biosample_id
There are three output formats.
The default output format extracts variant data from the Beacon v2 response, containing variant id and associated analysis id, biosample id and individual id. The CNV data is represented as copy number change class following the GA4GH Variation Representation Specification (VRS).
variant_1 <- pgxLoader(type="variant", biosample_id = biosample_id)
head(variant_1)
#> variant_id analysis_id biosample_id individual_id
#> 1 pgxvar-66577cf9b44ee5c2598e5148 pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 2 pgxvar-66577cf9b44ee5c2598e5149 pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 3 pgxvar-66577cf9b44ee5c2598e514a pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 4 pgxvar-66577cf9b44ee5c2598e514b pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 5 pgxvar-66577cf9b44ee5c2598e514c pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 6 pgxvar-66577cf9b44ee5c2598e514d pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> variant variant_copychange
#> 1 1:1731500-12832655:EFO_0030068 EFO:0030068
#> 2 1:12849386-57712606:EFO_0030068 EFO:0030068
#> 3 1:57713043-64335282:EFO_0030068 EFO:0030068
#> 4 1:64338058-68715276:EFO_0030068 EFO:0030068
#> 5 1:68716685-72284670:EFO_0030068 EFO:0030068
#> 6 1:72303254-77320849:EFO_0030068 EFO:0030068
output
= “pgxseg”)This format is ‘.pgxseg’ file format. It contains segment mean values (in log2
column), which are equal to log2(copy number of measured sample/copy number of control sample (usually 2)). A few variants are point mutations represented by columns reference_bases
and alternate_bases
.
variant_2 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "pgxseg")
head(variant_2)
#> biosample_id reference_name start end log2 variant_type
#> 1 pgxbs-kftviq25 1 1731500 12832655 -0.4922 DEL
#> 2 pgxbs-kftviq25 1 12849386 57712606 -0.4888 DEL
#> 3 pgxbs-kftviq25 1 57713043 64335282 -0.4254 DEL
#> 4 pgxbs-kftviq25 1 64338058 68715276 -0.4098 DEL
#> 5 pgxbs-kftviq25 1 68716685 72284670 -0.3219 DEL
#> 6 pgxbs-kftviq25 1 72303254 77320849 -0.3330 DEL
#> reference_bases alternate_bases variant_state_id variant_state_label
#> 1 . . EFO:0030068 low-level copy number loss
#> 2 . . EFO:0030068 low-level copy number loss
#> 3 . . EFO:0030068 low-level copy number loss
#> 4 . . EFO:0030068 low-level copy number loss
#> 5 . . EFO:0030068 low-level copy number loss
#> 6 . . EFO:0030068 low-level copy number loss
output
= “seg”)This format is similar to the general ‘.seg’ file format and compatible with IGV tool for visualization. The only difference between this file format and the general ‘.seg’ file format is the fifth column. It represents variant type in this format while in the general ‘.seg’ file format, it represents number of probes or bins covered by the segment. In addition, the point mutation variants are excluded in this file format.
variant_3 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "seg")
head(variant_3)
#> biosample_id reference_name start end variant_type log2
#> 1 pgxbs-kftviq25 1 1731500 12832655 DEL -0.4922
#> 2 pgxbs-kftviq25 1 12849386 57712606 DEL -0.4888
#> 3 pgxbs-kftviq25 1 57713043 64335282 DEL -0.4254
#> 4 pgxbs-kftviq25 1 64338058 68715276 DEL -0.4098
#> 5 pgxbs-kftviq25 1 68716685 72284670 DEL -0.3219
#> 6 pgxbs-kftviq25 1 72303254 77320849 DEL -0.3330
Setting save_file
as TRUE in pgxLoader
function would make this function doesn’t return
variants data directly but let the retrieved data saved in the current work directory by default or other
paths (specified by filename
). The export is only available for variants data (type=‘variant’).
The following command creates a ‘.pgxseg’ file with the name “variants.pgxseg” in “~/Downloads/” folder.
pgxLoader(type="variant", output="pgxseg", biosample_id=biosample_id, save_file=TRUE,
filename="~/Downloads/variants.pgxseg")
To visualize the ‘.pgxseg’ file, you can either upload it to this link or use the byconaut package for local visualization when dealing with a large number of samples.
The following command creates a special ‘.seg’ file with the name “variants.seg” in “~/Downloads/” folder.
pgxLoader(type="variant", output="seg", biosample_id=biosample_id, save_file=TRUE,
filename="~/Downloads/variants.seg")
You can upload this ‘.seg’ file to IGV tool for visualization.
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pgxRpi_1.0.5 BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.5 xfun_0.48 bslib_0.8.0
#> [4] ggplot2_3.5.1 rstatix_0.7.2 lattice_0.22-6
#> [7] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
#> [10] parallel_4.4.1 curl_5.2.3 tibble_3.2.1
#> [13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
#> [16] Matrix_1.7-0 data.table_1.16.2 lifecycle_1.0.4
#> [19] compiler_4.4.1 farver_2.1.2 munsell_0.5.1
#> [22] tinytex_0.53 carData_3.0-5 htmltools_0.5.8.1
#> [25] sass_0.4.9 yaml_2.3.10 Formula_1.2-5
#> [28] pillar_1.9.0 car_3.1-3 ggpubr_0.6.0
#> [31] jquerylib_0.1.4 tidyr_1.3.1 cachem_1.1.0
#> [34] survminer_0.4.9 magick_2.8.5 abind_1.4-8
#> [37] parallelly_1.38.0 km.ci_0.5-6 tidyselect_1.2.1
#> [40] digest_0.6.37 dplyr_1.1.4 purrr_1.0.2
#> [43] bookdown_0.40 labeling_0.4.3 splines_4.4.1
#> [46] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
#> [49] cli_3.6.3 magrittr_2.0.3 survival_3.7-0
#> [52] utf8_1.2.4 broom_1.0.7 withr_3.0.1
#> [55] scales_1.3.0 backports_1.5.0 lubridate_1.9.3
#> [58] timechange_0.3.0 rmarkdown_2.28 httr_1.4.7
#> [61] gridExtra_2.3 ggsignif_0.6.4 zoo_1.8-12
#> [64] evaluate_1.0.1 knitr_1.48 KMsurv_0.1-5
#> [67] survMisc_0.5.6 rlang_1.1.4 Rcpp_1.0.13
#> [70] xtable_1.8-4 glue_1.8.0 BiocManager_1.30.25
#> [73] attempt_0.3.1 jsonlite_1.8.9 R6_2.5.1
#> [76] plyr_1.8.9