Progenetix is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. Progenetix uses the “.pgxseg” data format to store variant data, which encompasses CNV (Copy Number Variation) and SNV (Single Nucleotide Variant), as well as the metadata of associated samples. This vignette describes how to work with local “.pgxseg” files using this package. For more details about the “.pgxseg” file format, please refer to the the documentation.
library(pgxRpi)
library(GenomicRanges)
pgxSegprocess
functionThis function extracts segments, CNV frequency, and metadata from local “pgxseg” files and supports survival data visualization
file
A string specifying the path and name of the “pgxseg” file where the data is to be read.group_id
A string specifying which id is used for grouping in KM plot or CNV frequency calculation. Default is “group_id”.show_KM_plot
A logical value determining whether to return the Kaplan-Meier plot based on metadata. Default is FALSE.return_metadata
A logical value determining whether to return metadata. Default is FALSE.return_seg
A logical value determining whether to return segment data. Default is FALSE.return_frequency
A logical value determining whether to return CNV frequency data. The frequency calculation is based on segments in segment data and specified group id in metadata. Default is FALSE.assembly
A string specifying which genome assembly version should be applied to CNV frequency calculation and plotting. Allowed options are “hg19”, “hg38”. Default is “hg38”....
Other parameters relevant to KM plot. These include pval
, pval.coord
, pval.method
, conf.int
, linetype
, and palette
(see ggsurvplot function from survminer package)# specify the location of file
file_name <- system.file("extdata", "example.pgxseg",package = 'pgxRpi')
# extract segment data
seg <- pgxSegprocess(file=file_name,return_seg = TRUE)
The segment data looks like this
head(seg)
#> biosample_id reference_name start end log2 variant_type
#> 1 GIST-gun-079 14 17200000 107043718 -1.00 DEL
#> 2 GIST-gun-080 3 0 90900000 -1.00 DEL
#> 3 GIST-gun-080 4 50000000 190214555 -1.00 DEL
#> 4 GIST-gun-080 5 0 48800000 1.32 DUP
#> 5 GIST-gun-080 5 48800000 181538259 -1.00 DEL
#> 6 GIST-gun-080 8 0 45200000 -1.00 DEL
#> reference_bases alternate_bases
#> 1 None None
#> 2 None None
#> 3 None None
#> 4 None None
#> 5 None None
#> 6 None None
meta <- pgxSegprocess(file=file_name,return_metadata = TRUE)
The metadata looks like this
head(meta)
#> biosample_id histological_diagnosis_id histological_diagnosis_label
#> 1 GIST-gun-079 NCIT:C27243 NCIT:C27243
#> 2 GIST-gun-080 NCIT:C27243 NCIT:C27243
#> 3 GIST-gun-081 NCIT:C27243 NCIT:C27243
#> 4 GIST-gun-082 NCIT:C27243 NCIT:C27243
#> 5 GIST-gun-083 NCIT:C27243 NCIT:C27243
#> 6 GIST-gun-084 NCIT:C27243 NCIT:C27243
#> icdo_morphology_id icdo_morphology_label group_id
#> 1 pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 2 pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 3 pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 4 pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 5 pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> 6 pgx:icdom-89363 Gastrointestinal stromal tumor, malignant pgx:icdot-C16.9
#> icdo_topography_id group_label icdo_topography_label sampled_tissue_id
#> 1 pgx:icdot-C16.9 stomach stomach UBERON:0000945
#> 2 pgx:icdot-C16.9 stomach stomach UBERON:0000945
#> 3 pgx:icdot-C16.9 stomach stomach UBERON:0000945
#> 4 pgx:icdot-C16.9 stomach stomach UBERON:0000945
#> 5 pgx:icdot-C16.9 stomach stomach UBERON:0000945
#> 6 pgx:icdot-C16.9 stomach stomach UBERON:0000945
#> sampled_tissue_label stage age_iso followup_state_id followup_state_label
#> 1 UBERON:0000945 High P68Y EFO:0030041 alive
#> 2 UBERON:0000945 High P66Y EFO:0030041 alive
#> 3 UBERON:0000945 High P74Y EFO:0030041 alive
#> 4 UBERON:0000945 High P39Y EFO:0030041 alive
#> 5 UBERON:0000945 High P82Y EFO:0030041 alive
#> 6 UBERON:0000945 High P72Y EFO:0030041 alive
#> followup_time(days)
#> 1 335
#> 2 335
#> 3 335
#> 4 335
#> 5 335
#> 6 335
The KM plot is plotted from samples with available followup state and followup time. The default grouping is “group_id” column in metadata.
pgxSegprocess(file=file_name,show_KM_plot = TRUE)
You can try different grouping by group_id
parameter
pgxSegprocess(file=file_name,show_KM_plot = TRUE,group_id = 'histological_diagnosis_id')
You can specify more parameters to modify this plot (see parameter ...
in documentation)
pgxSegprocess(file=file_name,show_KM_plot = TRUE,pval=TRUE,palette='npg')
The CNV frequency is calculated from segments of samples with the same group_id. The group_id
is specified in group_id
parameter.
# Default is "group_id" in metadata
frequency <- pgxSegprocess(file=file_name,return_frequency = TRUE)
# Use different ids for grouping
frequency_2 <- pgxSegprocess(file=file_name,return_frequency = TRUE,
group_id ='icdo_morphology_id')
frequency
#> GRangesList object of length 4:
#> $`pgx:icdot-C16.9`
#> GRanges object with 3106 ranges and 2 metadata columns:
#> seqnames ranges strand | gain_frequency loss_frequency
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] 1 0-400000 * | 0 28.5714
#> [2] 1 400000-1400000 * | 0 28.5714
#> [3] 1 1400000-2400000 * | 0 28.5714
#> [4] 1 2400000-3400000 * | 0 28.5714
#> [5] 1 3400000-4400000 * | 0 28.5714
#> ... ... ... ... . ... ...
#> [3102] Y 52400000-53400000 * | 0 0
#> [3103] Y 53400000-54400000 * | 0 0
#> [3104] Y 54400000-55400000 * | 0 0
#> [3105] Y 55400000-56400000 * | 0 0
#> [3106] Y 56400000-57227415 * | 0 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
#>
#> ...
#> <3 more elements>
The returned object is same as the CNV frequency object with “pgxfreq” format returned by pgxLoader
function (format details see the vignette Introduction_3_loadfrequency). The CNV frequency is calculated from groups which exist in both metadata and segment data. It is noted that not all groups in metadata must exist in segment data (e.g. some samples don’t have CNV calls).
head(frequency[["pgx:icdot-C16.9"]])
#> GRanges object with 6 ranges and 2 metadata columns:
#> seqnames ranges strand | gain_frequency loss_frequency
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] 1 0-400000 * | 0 28.5714
#> [2] 1 400000-1400000 * | 0 28.5714
#> [3] 1 1400000-2400000 * | 0 28.5714
#> [4] 1 2400000-3400000 * | 0 28.5714
#> [5] 1 3400000-4400000 * | 0 28.5714
#> [6] 1 4400000-5400000 * | 0 28.5714
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
The associated metadata in CNV frequency objects looks like this
mcols(frequency)
#> DataFrame with 4 rows and 3 columns
#> group_id group_label sample_count
#> <character> <character> <integer>
#> pgx:icdot-C16.9 pgx:icdot-C16.9 stomach 28
#> pgx:icdot-C49.9 pgx:icdot-C73.9 thyroid gland 6
#> pgx:icdot-C50.9 pgx:icdot-C50.9 breast 2
#> pgx:icdot-C73.9 pgx:icdot-C49.9 connective and soft .. 7
mcols(frequency_2)
#> DataFrame with 3 rows and 3 columns
#> group_id group_label sample_count
#> <character> <character> <integer>
#> pgx:icdom-80753 pgx:icdom-89363 Gastrointestinal str.. 35
#> pgx:icdom-83353 pgx:icdom-83353 Follicular carcinoma 6
#> pgx:icdom-89363 pgx:icdom-80753 Squamous cell carcin.. 2
You can visualize the CNV frequency of the interesting group.
pgxFreqplot(frequency, filters="pgx:icdot-C16.9")
pgxFreqplot(frequency, filters="pgx:icdot-C16.9",chrom = c(1,8,14), layout = c(3,1))
Circos plot supports multiple group visualization
pgxFreqplot(frequency, filters=c("pgx:icdot-C16.9","pgx:icdot-C73.9"),circos = TRUE)
The details of pgxFreqplot
function see the vignette Introduction_3_loadfrequency.
If you want different types of data such as segment data and metadata, and visualize the survival data at the same time, you can just set the corresponding parameters as TRUE. The returned data is an object including all specified data. It is noted that in this case the CNV frequency and KM plot use the same group_id.
info <- pgxSegprocess(file=file_name,show_KM_plot = TRUE, return_seg = TRUE,
return_metadata = TRUE, return_frequency = TRUE)
names(info)
#> [1] "seg" "meta" "frequency"
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [3] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
#> [5] IRanges_2.38.1 S4Vectors_0.42.1
#> [7] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
#> [9] matrixStats_1.4.1 pgxRpi_1.0.5
#> [11] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
#> [4] fastmap_1.2.0 digest_0.6.37 timechange_0.3.0
#> [7] lifecycle_1.0.4 survival_3.7-0 magrittr_2.0.3
#> [10] compiler_4.4.1 rlang_1.1.4 sass_0.4.9
#> [13] tools_4.4.1 utf8_1.2.4 yaml_2.3.10
#> [16] data.table_1.16.2 knitr_1.48 ggsignif_0.6.4
#> [19] S4Arrays_1.4.1 labeling_0.4.3 curl_5.2.3
#> [22] DelayedArray_0.30.1 plyr_1.8.9 abind_1.4-8
#> [25] withr_3.0.1 purrr_1.0.2 grid_4.4.1
#> [28] fansi_1.0.6 ggpubr_0.6.0 xtable_1.8-4
#> [31] colorspace_2.1-1 ggplot2_3.5.1 scales_1.3.0
#> [34] tinytex_0.53 cli_3.6.3 crayon_1.5.3
#> [37] rmarkdown_2.28 generics_0.1.3 km.ci_0.5-6
#> [40] httr_1.4.7 survminer_0.4.9 cachem_1.1.0
#> [43] zlibbioc_1.50.0 splines_4.4.1 parallel_4.4.1
#> [46] BiocManager_1.30.25 XVector_0.44.0 survMisc_0.5.6
#> [49] vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.9
#> [52] carData_3.0-5 bookdown_0.40 car_3.1-3
#> [55] rstatix_0.7.2 Formula_1.2-5 magick_2.8.5
#> [58] attempt_0.3.1 tidyr_1.3.1 jquerylib_0.1.4
#> [61] glue_1.8.0 parallelly_1.38.0 shape_1.4.6.1
#> [64] lubridate_1.9.3 gtable_0.3.5 UCSC.utils_1.0.0
#> [67] munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
#> [70] htmltools_0.5.8.1 circlize_0.4.16 GenomeInfoDbData_1.2.12
#> [73] R6_2.5.1 KMsurv_0.1-5 evaluate_1.0.1
#> [76] lattice_0.22-6 highr_0.11 backports_1.5.0
#> [79] ggsci_3.2.0 broom_1.0.7 bslib_0.8.0
#> [82] Rcpp_1.0.13 SparseArray_1.4.8 gridExtra_2.3
#> [85] xfun_0.48 GlobalOptions_0.1.2 zoo_1.8-12
#> [88] pkgconfig_2.0.3