Contents

Progenetix is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing genomic variant data within the Progenetix database. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

1 Load library

library(pgxRpi)

1.1 pgxLoader function

This function loads various data from Progenetix database.

The parameters of this function used in this tutorial:

  • type A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.
  • output A string specifying output file format. When the parameter type is “variant”, available options are NULL, “pgxseg” ,“pgxmatrix”, “coverage” or “seg”.
  • filters Identifiers for cancer type, literature, cohorts, and age such as c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”). For more information about filters, see the documentation.
  • individual_id Identifiers used in Progenetix database for identifying individuals.
  • biosample_id Identifiers used in Progenetix database for identifying biosamples.
  • codematches A logical value determining whether to exclude samples from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). If TRUE, retrieved samples only keep samples exactly encoded by specified filters. Do not use this parameter when filters include ontology-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
  • limit Integer to specify the number of returned CNV coverage profiles for each filter. Default is 0 (return all).
  • skip Integer to specify the number of skipped CNV coverage profiles for each filter. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).
  • save_file A logical value determining whether to save the segment variant data as file instead of direct return. Only used when the parameter type is “variant” and output is “pgxseg” or “seg”. Default is FALSE.
  • filename A string specifying the path and name of the file to be saved. Only used if the parameter save_file is TRUE. Default is “variants.seg/pgxseg” in current work directory.
  • num_cores Integer to specify the number of cores on your local computer to be used for the variant query. The default is 1.
  • dataset A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.

2 Retrive CNV coverage of biosamples

2.1 Relevant parameters

type, output, filters, individual_id, biosample_id, codematches, skip, limit, dataset

2.2 Across genomic bins

cnv_matrix <- pgxLoader(type="variant", output="pgxmatrix", filters = "NCIT:C2948")

The data looks like this

print(dim(cnv_matrix))
#> [1]   47 6215
cnv_matrix[c(1:3), c(1:5,6213:6215)]
#>      analysis_id   biosample_id   group_id chr1.000000000.000400000.DUP
#> 1 pgxcs-kftvs0ri pgxbs-kftvh262 NCIT:C2948                            0
#> 2 pgxcs-kftvu9w2 pgxbs-kftvh9fp NCIT:C2948                            0
#> 3 pgxcs-kftvw1kw pgxbs-kftvhf4h NCIT:C8893                            0
#>   chr1.000400000.001400000.DUP chrY.054400000.055400000.DEL
#> 1                            0                            0
#> 2                            0                            1
#> 3                            0                            0
#>   chrY.055400000.056400000.DEL chrY.056400000.057227415.DEL
#> 1                            0                            0
#> 2                            1                            1
#> 3                            0                            0

In this dataframe, analysis_id is the identifier for individual analysis, biosample_id is the identifier for individual biosample. It is noted that the number of analysis profiles does not necessarily equal the number of samples. One biosample_id may correspond to multiple analysis_id. group_id equals the meaning of filters. It’s followed by all “gain status” columns (3106 intervals) plus all “loss status” columns (3106 intervals). The status is indicated by a coverage value, i.e. the fraction of how much the binned interval overlaps with one or more CNVs of the given type (DUP/DEL). For example, if the column chr1.400000.1400000.DUP is 0.200 in one row, it means that one or more duplication events overlapped with 20% of the genomic bin located in chromosome 1: 400000-1400000 in the corresponding analysis.

2.3 Across chromosomes or the whole genome

cnv_coverage <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948")

It includes CNV coverage across chromosome arms, whole chromosomes, or whole genome.

names(cnv_coverage)
#> [1] "chrom_arm_coverage"    "whole_chrom_coverage"  "whole_genome_coverage"

The data of CNV coverage across chromosomal arms looks like this

head(cnv_coverage$chrom_arm_coverage)[,c(1:4, 49:52)]
#>                chr1p.dup chr1q.dup chr2p.dup chr2q.dup chr1p.del chr1q.del
#> pgxbs-kftvh262         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvh9fp         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvhf4h         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvhf4i         0     0.000     0.000     0.000     0.000         0
#> pgxbs-kftvhf4k         0     0.000     0.000     0.000     0.225         0
#> pgxbs-kftvhf4m         0     0.979     0.989     0.003     0.000         0
#>                chr2p.del chr2q.del
#> pgxbs-kftvh262         0         0
#> pgxbs-kftvh9fp         0         0
#> pgxbs-kftvhf4h         0         0
#> pgxbs-kftvhf4i         0         0
#> pgxbs-kftvhf4k         0         0
#> pgxbs-kftvhf4m         0         0

The row names are id of biosamples from the group NCIT:C2948. There are 96 columns. The first 48 columns are duplication coverage across chromosomal arms, followed by deletion coverage. The data of CNV coverage across whole chromosomes is similar, with the only difference in columns.

The data of CNV coverage across genome (hg38) looks like this

head(cnv_coverage$whole_genome_coverage)
#>                cnvfraction dupfraction delfraction
#> pgxbs-kftvh262       0.080       0.036       0.044
#> pgxbs-kftvh9fp       0.058       0.010       0.048
#> pgxbs-kftvhf4h       0.000       0.000       0.000
#> pgxbs-kftvhf4i       0.000       0.000       0.000
#> pgxbs-kftvhf4k       0.027       0.000       0.027
#> pgxbs-kftvhf4m       0.176       0.159       0.017

The first column is the total called coverage, followed by duplication coverage and deletion coverage.

2.4 Parameter codematches use

Setting codematches = True can exclude profiles with group_id belonging to child terms of the input filters.

21 samples are excluded from the original 47 samples in this case.

cnv_coverage_2 <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948",
                            codematches = TRUE)

print(dim(cnv_coverage$chrom_arm_coverage))
#> [1] 47 96
print(dim(cnv_coverage_2$chrom_arm_coverage))
#> [1] 26 96

2.5 Access a subset of samples

By default, it returns all available profiles (limit=0), so the query may take a while when the number of retrieved samples is large. You can use the parameters limit and skip to access a subset of samples.

cnv_matrix_2 <- pgxLoader(type="variant", output="pgxmatrix", 
                          filters = "NCIT:C2948",
                          skip = 0, limit=10)
# the dimention of subset 
print(dim(cnv_matrix_2))
#> [1]   10 6215
# the dimention of original set
print(dim(cnv_matrix))
#> [1]   47 6215

2.6 Access by biosample id and individual id

cnv_ind_matrix <- pgxLoader(type="variant", output="pgxmatrix", 
                          biosample_id = "pgxbs-kftva604",
                          individual_id = "pgxind-kftx5g4t")

cnv_ind_cov <- pgxLoader(type="variant", output="coverage", 
                          biosample_id = "pgxbs-kftva604",
                          individual_id = "pgxind-kftx5g4t")

3 Retrieve segment variants

Because of a time-out issue, segment variant data can only be accessed by biosample id instead of filters. To speed up this process, you can set the num_cores parameter for parallel processing.

3.1 Relevant parameters

type, output, biosample_id, save_file, filename, num_cores, dataset

3.2 Get biosample id

The biosample information is also obtained by pgxLoader and the vignette about metadata query see Introduction_1_loadmetadata.

biosamples <- pgxLoader(type="biosample", filters = "PMID:20229506", limit=2)

biosample_id <- biosamples$biosample_id

There are three output formats.

3.3 The first output format (by default)

The default output format extracts variant data from the Beacon v2 response, containing variant id and associated analysis id, biosample id and individual id. The CNV data is represented as copy number change class following the GA4GH Variation Representation Specification (VRS).

variant_1 <- pgxLoader(type="variant", biosample_id = biosample_id)
head(variant_1)
#>                        variant_id    analysis_id   biosample_id   individual_id
#> 1 pgxvar-66577cf9b44ee5c2598e5148 pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 2 pgxvar-66577cf9b44ee5c2598e5149 pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 3 pgxvar-66577cf9b44ee5c2598e514a pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 4 pgxvar-66577cf9b44ee5c2598e514b pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 5 pgxvar-66577cf9b44ee5c2598e514c pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 6 pgxvar-66577cf9b44ee5c2598e514d pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#>                           variant variant_copychange
#> 1  1:1731500-12832655:EFO_0030068        EFO:0030068
#> 2 1:12849386-57712606:EFO_0030068        EFO:0030068
#> 3 1:57713043-64335282:EFO_0030068        EFO:0030068
#> 4 1:64338058-68715276:EFO_0030068        EFO:0030068
#> 5 1:68716685-72284670:EFO_0030068        EFO:0030068
#> 6 1:72303254-77320849:EFO_0030068        EFO:0030068

3.4 The second output format (output = “pgxseg”)

This format is ‘.pgxseg’ file format. It contains segment mean values (in log2 column), which are equal to log2(copy number of measured sample/copy number of control sample (usually 2)). A few variants are point mutations represented by columns reference_bases and alternate_bases.

variant_2 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "pgxseg")
head(variant_2)
#>     biosample_id reference_name    start      end    log2 variant_type
#> 1 pgxbs-kftviq25              1  1731500 12832655 -0.4922          DEL
#> 2 pgxbs-kftviq25              1 12849386 57712606 -0.4888          DEL
#> 3 pgxbs-kftviq25              1 57713043 64335282 -0.4254          DEL
#> 4 pgxbs-kftviq25              1 64338058 68715276 -0.4098          DEL
#> 5 pgxbs-kftviq25              1 68716685 72284670 -0.3219          DEL
#> 6 pgxbs-kftviq25              1 72303254 77320849 -0.3330          DEL
#>   reference_bases alternate_bases variant_state_id        variant_state_label
#> 1               .               .      EFO:0030068 low-level copy number loss
#> 2               .               .      EFO:0030068 low-level copy number loss
#> 3               .               .      EFO:0030068 low-level copy number loss
#> 4               .               .      EFO:0030068 low-level copy number loss
#> 5               .               .      EFO:0030068 low-level copy number loss
#> 6               .               .      EFO:0030068 low-level copy number loss

3.5 The third output format (output = “seg”)

This format is similar to the general ‘.seg’ file format and compatible with IGV tool for visualization. The only difference between this file format and the general ‘.seg’ file format is the fifth column. It represents variant type in this format while in the general ‘.seg’ file format, it represents number of probes or bins covered by the segment. In addition, the point mutation variants are excluded in this file format.

variant_3 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "seg")
head(variant_3)
#>     biosample_id reference_name    start      end variant_type    log2
#> 1 pgxbs-kftviq25              1  1731500 12832655          DEL -0.4922
#> 2 pgxbs-kftviq25              1 12849386 57712606          DEL -0.4888
#> 3 pgxbs-kftviq25              1 57713043 64335282          DEL -0.4254
#> 4 pgxbs-kftviq25              1 64338058 68715276          DEL -0.4098
#> 5 pgxbs-kftviq25              1 68716685 72284670          DEL -0.3219
#> 6 pgxbs-kftviq25              1 72303254 77320849          DEL -0.3330

4 Export variants data for visualization

Setting save_file as TRUE in pgxLoader function would make this function doesn’t return variants data directly but let the retrieved data saved in the current work directory by default or other paths (specified by filename). The export is only available for variants data (type=‘variant’).

4.1 Upload ‘pgxseg’ file to Progenetix website

The following command creates a ‘.pgxseg’ file with the name “variants.pgxseg” in “~/Downloads/” folder.

pgxLoader(type="variant", output="pgxseg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.pgxseg")

To visualize the ‘.pgxseg’ file, you can either upload it to this link or use the byconaut package for local visualization when dealing with a large number of samples.

4.2 Upload ‘.seg’ file to IGV

The following command creates a special ‘.seg’ file with the name “variants.seg” in “~/Downloads/” folder.

pgxLoader(type="variant", output="seg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.seg")

You can upload this ‘.seg’ file to IGV tool for visualization.

5 Session Info

#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pgxRpi_1.0.5     BiocStyle_2.32.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5        xfun_0.48           bslib_0.8.0        
#>  [4] ggplot2_3.5.1       rstatix_0.7.2       lattice_0.22-6     
#>  [7] vctrs_0.6.5         tools_4.4.1         generics_0.1.3     
#> [10] parallel_4.4.1      curl_5.2.3          tibble_3.2.1       
#> [13] fansi_1.0.6         highr_0.11          pkgconfig_2.0.3    
#> [16] Matrix_1.7-0        data.table_1.16.2   lifecycle_1.0.4    
#> [19] compiler_4.4.1      farver_2.1.2        munsell_0.5.1      
#> [22] tinytex_0.53        carData_3.0-5       htmltools_0.5.8.1  
#> [25] sass_0.4.9          yaml_2.3.10         Formula_1.2-5      
#> [28] pillar_1.9.0        car_3.1-3           ggpubr_0.6.0       
#> [31] jquerylib_0.1.4     tidyr_1.3.1         cachem_1.1.0       
#> [34] survminer_0.4.9     magick_2.8.5        abind_1.4-8        
#> [37] parallelly_1.38.0   km.ci_0.5-6         tidyselect_1.2.1   
#> [40] digest_0.6.37       dplyr_1.1.4         purrr_1.0.2        
#> [43] bookdown_0.40       labeling_0.4.3      splines_4.4.1      
#> [46] fastmap_1.2.0       grid_4.4.1          colorspace_2.1-1   
#> [49] cli_3.6.3           magrittr_2.0.3      survival_3.7-0     
#> [52] utf8_1.2.4          broom_1.0.7         withr_3.0.1        
#> [55] scales_1.3.0        backports_1.5.0     lubridate_1.9.3    
#> [58] timechange_0.3.0    rmarkdown_2.28      httr_1.4.7         
#> [61] gridExtra_2.3       ggsignif_0.6.4      zoo_1.8-12         
#> [64] evaluate_1.0.1      knitr_1.48          KMsurv_0.1-5       
#> [67] survMisc_0.5.6      rlang_1.1.4         Rcpp_1.0.13        
#> [70] xtable_1.8-4        glue_1.8.0          BiocManager_1.30.25
#> [73] attempt_0.3.1       jsonlite_1.8.9      R6_2.5.1           
#> [76] plyr_1.8.9