Contents

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. Progenetix uses the “.pgxseg” data format to store variant data, which encompasses CNV (Copy Number Variation) and SNV (Single Nucleotide Variant), as well as the metadata of associated samples. This vignette describes how to work with local “.pgxseg” files using this package. For more details about the “.pgxseg” file format, please refer to the the documentation.

1 Load library

library(pgxRpi)
library(GenomicRanges) # for pgxfreq object

1.1 pgxSegprocess function

This function extracts segment variants, CNV frequency, and metadata from local “pgxseg” files. Additionally, it supports survival data visualization if survival data is available within the file.

The parameters of this function used in this tutorial:

  • file: A string specifying the path and name of the “pgxseg” file where the data is to be read.
  • group_id: A string specifying which id is used for grouping in KM plot or CNV frequency calculation. Default is “group_id”.
  • show_KM_plot: A logical value determining whether to return the Kaplan-Meier plot based on metadata. Default is FALSE.
  • return_metadata: A logical value determining whether to return metadata. Default is FALSE.
  • return_seg: A logical value determining whether to return segment data. Default is FALSE.
  • return_frequency: A logical value determining whether to return CNV frequency data. The frequency calculation is based on segments in segment data and specified group id in metadata. Default is FALSE.
  • cnv_column_idx: Index of the column specifying the CNV state used for calculating CNV frequency. The index must be at least 6, with the default set to 6. The CNV states should either contain “DUP” for duplications and “DEL” for deletions, or level-specific CNV states represented using Experimental Factor Ontology (EFO) codes.
  • assembly: A string specifying the genome assembly version to apply to CNV frequency calculation and plotting. Allowed options are “hg19” and “hg38”. Default is “hg38”.
  • ... Other parameters relevant to KM plot. These include pval, pval.coord, pval.method, conf.int, linetype, and palette (see ggsurvplot function from survminer package)

2 Extract segment data

# specify the location of the example file
file_name <- system.file("extdata", "example.pgxseg",package = 'pgxRpi')

# extract segment data
seg <- pgxSegprocess(file=file_name,return_seg = TRUE)

The segment data looks like this

head(seg)
#>   biosample_id reference_name    start       end  log2 variant_type
#> 1 GIST-gun-079             14 17200000 107043718 -1.00          DEL
#> 2 GIST-gun-080              3        0  90900000 -1.00          DEL
#> 3 GIST-gun-080              4 50000000 190214555 -1.00          DEL
#> 4 GIST-gun-080              5        0  48800000  1.32          DUP
#> 5 GIST-gun-080              5 48800000 181538259 -1.00          DEL
#> 6 GIST-gun-080              8        0  45200000 -1.00          DEL
#>   reference_bases alternate_bases variant_state_id variant_state_label
#> 1               .               .      EFO:0030067    copy number loss
#> 2               .               .      EFO:0030067    copy number loss
#> 3               .               .      EFO:0030067    copy number loss
#> 4               .               .      EFO:0030070    copy number gain
#> 5               .               .      EFO:0030067    copy number loss
#> 6               .               .      EFO:0030067    copy number loss

3 Extract metadata

meta <- pgxSegprocess(file=file_name,return_metadata = TRUE)

The metadata looks like this

head(meta)
#>   biosample_id histological_diagnosis_id histological_diagnosis_label
#> 1 GIST-gun-079               NCIT:C27243                  NCIT:C27243
#> 2 GIST-gun-080               NCIT:C27243                  NCIT:C27243
#> 3 GIST-gun-081               NCIT:C27243                  NCIT:C27243
#> 4 GIST-gun-082               NCIT:C27243                  NCIT:C27243
#> 5 GIST-gun-083               NCIT:C27243                  NCIT:C27243
#> 6 GIST-gun-084               NCIT:C27243                  NCIT:C27243
#>   icdo_morphology_id                     icdo_morphology_label        group_id
#> 1    pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 2    pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 3    pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 4    pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 5    pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 6    pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#>   icdo_topography_id group_label icdo_topography_label sampled_tissue_id
#> 1    pgx:icdot-C16.9     stomach               stomach    UBERON:0000945
#> 2    pgx:icdot-C16.9     stomach               stomach    UBERON:0000945
#> 3    pgx:icdot-C16.9     stomach               stomach    UBERON:0000945
#> 4    pgx:icdot-C16.9     stomach               stomach    UBERON:0000945
#> 5    pgx:icdot-C16.9     stomach               stomach    UBERON:0000945
#> 6    pgx:icdot-C16.9     stomach               stomach    UBERON:0000945
#>   sampled_tissue_label stage age_iso followup_state_id followup_state_label
#> 1       UBERON:0000945  High    P68Y       EFO:0030041                alive
#> 2       UBERON:0000945  High    P66Y       EFO:0030041                alive
#> 3       UBERON:0000945  High    P74Y       EFO:0030041                alive
#> 4       UBERON:0000945  High    P39Y       EFO:0030041                alive
#> 5       UBERON:0000945  High    P82Y       EFO:0030041                alive
#> 6       UBERON:0000945  High    P72Y       EFO:0030041                alive
#>   followup_time
#> 1          P11M
#> 2          P59M
#> 3          P13M
#> 4         P109M
#> 5          P47M
#> 6          P62M

4 Visualize survival data in metadata

The KM plot is plotted from samples with available followup state and followup time. The default grouping is “group_id” column in metadata.

pgxSegprocess(file=file_name,show_KM_plot = TRUE)

You can try different grouping by group_id parameter

pgxSegprocess(file=file_name,show_KM_plot = TRUE,group_id = 'histological_diagnosis_id')

You can specify more parameters to modify this plot (see parameter ... in documentation)

pgxSegprocess(file=file_name,show_KM_plot = TRUE,pval=TRUE,palette='npg')

5 Calculate CNV frequency

The CNV frequency is calculated from segments of samples with the same group id. The group id is specified in group_id parameter. More details about CNV frequency see the vignette Introduction_3_loadfrequency.

# Default is "group_id" in metadata
frequency <- pgxSegprocess(file=file_name,return_frequency = TRUE) 
# Use different ids for grouping
frequency_2 <- pgxSegprocess(file=file_name,return_frequency = TRUE, 
                             group_id ='icdo_morphology_id')
frequency
#> GRangesList object of length 4:
#> $`pgx:icdot-C16.9`
#> GRanges object with 3106 ranges and 2 metadata columns:
#>          seqnames            ranges strand | gain_frequency loss_frequency
#>             <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>      [1]        1          0-400000      * |              0        28.5714
#>      [2]        1    400000-1400000      * |              0        28.5714
#>      [3]        1   1400000-2400000      * |              0        28.5714
#>      [4]        1   2400000-3400000      * |              0        28.5714
#>      [5]        1   3400000-4400000      * |              0        28.5714
#>      ...      ...               ...    ... .            ...            ...
#>   [3102]        Y 52400000-53400000      * |              0              0
#>   [3103]        Y 53400000-54400000      * |              0              0
#>   [3104]        Y 54400000-55400000      * |              0              0
#>   [3105]        Y 55400000-56400000      * |              0              0
#>   [3106]        Y 56400000-57227415      * |              0              0
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths
#> 
#> ...
#> <3 more elements>

The returned object is same as the CNV frequency object with “pgxfreq” format returned by pgxLoader function. The CNV frequency is calculated from groups which exist in both metadata and segment data. It is noted that not all groups in metadata must exist in segment data (e.g. some samples don’t have CNV calls).

head(frequency[["pgx:icdot-C16.9"]])
#> GRanges object with 6 ranges and 2 metadata columns:
#>       seqnames          ranges strand | gain_frequency loss_frequency
#>          <Rle>       <IRanges>  <Rle> |      <numeric>      <numeric>
#>   [1]        1        0-400000      * |              0        28.5714
#>   [2]        1  400000-1400000      * |              0        28.5714
#>   [3]        1 1400000-2400000      * |              0        28.5714
#>   [4]        1 2400000-3400000      * |              0        28.5714
#>   [5]        1 3400000-4400000      * |              0        28.5714
#>   [6]        1 4400000-5400000      * |              0        28.5714
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

The associated metadata in CNV frequency objects looks like this

mcols(frequency)
#> DataFrame with 4 rows and 3 columns
#>                        group_id            group_label sample_count
#>                     <character>            <character>    <integer>
#> pgx:icdot-C16.9 pgx:icdot-C16.9                stomach           28
#> pgx:icdot-C73.9 pgx:icdot-C73.9          thyroid gland            6
#> pgx:icdot-C50.9 pgx:icdot-C50.9                 breast            2
#> pgx:icdot-C49.9 pgx:icdot-C49.9 connective and soft ..            7
mcols(frequency_2)
#> DataFrame with 3 rows and 3 columns
#>                        group_id            group_label sample_count
#>                     <character>            <character>    <integer>
#> pgx:icdom-89363 pgx:icdom-89363 Gastrointestinal str..           35
#> pgx:icdom-83353 pgx:icdom-83353   Follicular carcinoma            6
#> pgx:icdom-80753 pgx:icdom-80753 Squamous cell carcin..            2

You can visualize the CNV frequency of the interesting group using pgxFreqplot function. For more details on this function, see the vignette Introduction_3_loadfrequency.

pgxFreqplot(frequency, filters="pgx:icdot-C16.9")

pgxFreqplot(frequency, filters="pgx:icdot-C16.9",chrom = c(1,8,14), layout = c(3,1))

pgxFreqplot(frequency, filters=c("pgx:icdot-C16.9","pgx:icdot-C73.9"),circos = TRUE)

6 Extract all data

If you want to extract different types of data, such as segment variants, metadata, CNV frequency and visualize the survival data simultaneously, you can set the corresponding parameters to TRUE. The returned data will be an object that includes all specified data. Note that in this case, the CNV frequency and KM plot will use the same group_id.

info <- pgxSegprocess(file=file_name,show_KM_plot = TRUE, return_seg = TRUE, 
                      return_metadata = TRUE, return_frequency = TRUE)

names(info)
#> [1] "seg"       "meta"      "frequency"

7 Session Info

#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [3] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
#>  [5] IRanges_2.41.0              S4Vectors_0.45.0           
#>  [7] BiocGenerics_0.53.0         MatrixGenerics_1.19.0      
#>  [9] matrixStats_1.4.1           pgxRpi_1.3.0               
#> [11] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] gridExtra_2.3           attempt_0.3.1           rlang_1.1.4            
#>  [4] magrittr_2.0.3          compiler_4.5.0          vctrs_0.6.5            
#>  [7] pkgconfig_2.0.3         shape_1.4.6.1           crayon_1.5.3           
#> [10] fastmap_1.2.0           backports_1.5.0         magick_2.8.5           
#> [13] XVector_0.47.0          labeling_0.4.3          KMsurv_0.1-5           
#> [16] utf8_1.2.4              rmarkdown_2.28          UCSC.utils_1.3.0       
#> [19] tinytex_0.53            purrr_1.0.2             xfun_0.48              
#> [22] zlibbioc_1.53.0         cachem_1.1.0            jsonlite_1.8.9         
#> [25] highr_0.11              DelayedArray_0.33.1     broom_1.0.7            
#> [28] parallel_4.5.0          R6_2.5.1                bslib_0.8.0            
#> [31] parallelly_1.38.0       car_3.1-3               lubridate_1.9.3        
#> [34] jquerylib_0.1.4         Rcpp_1.0.13             bookdown_0.41          
#> [37] knitr_1.48              future.apply_1.11.3     zoo_1.8-12             
#> [40] Matrix_1.7-1            splines_4.5.0           timechange_0.3.0       
#> [43] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
#> [46] codetools_0.2-20        curl_5.2.3              listenv_0.9.1          
#> [49] lattice_0.22-6          tibble_3.2.1            withr_3.0.2            
#> [52] evaluate_1.0.1          future_1.34.0           survival_3.7-0         
#> [55] circlize_0.4.16         survMisc_0.5.6          pillar_1.9.0           
#> [58] BiocManager_1.30.25     ggpubr_0.6.0            carData_3.0-5          
#> [61] generics_0.1.3          ggplot2_3.5.1           munsell_0.5.1          
#> [64] scales_1.3.0            globals_0.16.3          xtable_1.8-4           
#> [67] glue_1.8.0              survminer_0.5.0         tools_4.5.0            
#> [70] data.table_1.16.2       ggsignif_0.6.4          grid_4.5.0             
#> [73] tidyr_1.3.1             colorspace_2.1-1        GenomeInfoDbData_1.2.13
#> [76] Formula_1.2-5           cli_3.6.3               km.ci_0.5-6            
#> [79] fansi_1.0.6             S4Arrays_1.7.1          dplyr_1.1.4            
#> [82] gtable_0.3.6            ggsci_3.2.0             rstatix_0.7.2          
#> [85] sass_0.4.9              digest_0.6.37           SparseArray_1.7.0      
#> [88] farver_2.1.2            htmltools_0.5.8.1       lifecycle_1.0.4        
#> [91] httr_1.4.7              GlobalOptions_0.1.2