psichomics is an interactive R package for integrative analyses of alternative splicing using data from The Cancer Genome Atlas (TCGA) (containing molecular data associated with 34 tumour types) and from the Genotype-Tissue Expression (GTEx) project (containing data for multiple normal human tissues). The data leveraged from these projects includes clinical information and transcriptomic data, such as the quantification of RNA-Seq reads aligning to splice junctions (henceforth called junction quantification) and exons.
Install psichomics by typing the following in an R console (the R environment is required):
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("psichomics")
After the installation, start the visual interface of the program in your default web browser by typing:
The following case study can be read in psichomics’ original article: Saraiva-Agostinho N and Barbosa-Morais NL (2018) “psichomics: graphical application for alternative splicing quantification and analysis”. bioRxiv.
Breast cancer is the cancer type with the highest incidence and mortality in women (Torre et al., 2015) and multiple studies have suggested that transcriptome-wide analyses of alternative splicing changes in breast tumours are able to uncover tumour-specific biomarkers (Tsai et al., 2015; Danan-Gotthold et al., 2015; Anczuków et al., 2015). Given the relevance of early detection of breast cancer to patient survival, we can use psichomics to identify novel tumour stage-I-specific molecular signatures based on differentially spliced events.
The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI (Wang et al., 2008).
To estimate this value for each splicing event, both alternative splicing annotation and junction quantification are required. While alternative splicing annotation is provided by the package, junction quantification may be retrieved from muliple sources.
Start by loading breast cancer data by following these instructions:
Note there is also the option for Gene expression (normalised by RSEM). However, we recommend to load the raw gene expression data instead, followed by filtering and normalisation as demonstrated afterwards.
Downloading multiple files: Note that multiple files will be requested for download at once. Some web browsers (such as Google Chrome) will ask for your confirmation before allowing such behaviour. In order to proceed, please allow multiple downloads.
Windows limitations: If you are using Windows, note that the downloaded files have huge names that may be over Windows Maximum Path Length. A workaround would be to manually rename the downloaded files to have shorter names, move all downloaded files to a single folder and load such folder by going to Load user files > Folder input and selecting the newly-created folder in Folder where data is stored.
After the data finish loading (keep an eye on the progress bar at the top-right corner), the on-screen instructions at the right will be replaced by the loaded data, including options to view and save such data.
To filter and normalise gene expression, click the green panel Gene expression filtering and normalisation. Within this panel, click the different grey sections (Gene filtering, Normalisation and Compute CPM and log-transform) to check the settings available for processing gene expression. When you are ready to proceed, click Filter and normalise gene expression.
After loading the clinical and alternative splicing junction quantification data from TCGA, quantify alternative splicing by clicking the green panel Alternative splicing quantification.
Custom splicing annotation: Additional alternative splicing annotations can be prepared for psichomics by parsing the annotation from programs like VAST-TOOLS, MISO, SUPPA and rMATS. Note that SUPPA and rMATS are able to create their splicing annotation based on transcript annotation. For more information, read this tutorial.
In order to group data for downstream analyses, look for the navigation bar on the top and click Groups. In the displayed table, confirm that three groups are automatically present based on the available sample types: Metastatic, Primary solid Tumor and Solid Tissue Normal.
Next, to create groups by tumour stage, click in the field Select attribute and type tumor stage. Select the first hit (it should be patient.stage_event.pathologic_stage_tumor_stage) and click Create group.
The table on the right will be updated with the created groups per tumour stage. Next, we will merge tumour stages so as to have only Stage I, II, III and IV. To do so:
Hint: You can shift-click to select multiple groups at once.
Do the same for tumour stages II, III and IV with their respective subgroups (ignore stage X samples as they are uncharacterised tumour samples). We also recommend to remove groups that are of no interest by selecting them and clicking Remove. In the end, you should end up with a table similar to the one below.
Changing group colours: The colours defined for each group will be used to represent those same groups in the plots throughout psichomics. To change the colour of a given group, select that group and, next to the rename field, change its associated colour (by clicking on the colour field and picking a new colour or by inputting a HEX code) and click Set colour.
The created groups can then be saved in a text file and loaded in a future session. To do so, in the toolbar below the table click the folder icon (right next to More) and select Save elements from all groups.
PCA is a technique to reduce data dimensionality by identifying variable combinations (called principal components) that explain the variance in the data (Ringnér, 2008). To analyse principal components, click on the Analyses tab located in the navigation menu at the top and select Principal component analysis (PCA).
To perform PCA on alternative splicing data using all samples:
As PCA cannot be performed on data with missing values, missing values need to be either removed (thus discarding data from whole splicing events or genes) or imputed (i.e. attributing to missing values the median of the non-missing ones). This input allows to select the number of missing values that are tolerable per event (i.e. if a splicing event or gene has less than N missing values, those missing values will be imputed; otherwise, the event is discarded from PCA).
After PCA is performed, the Plot PCA panel will automatically open. Note that the explained variance of each principal component (PC) is shown next to the respective component and that PC1 explains most of the data variance, followed by PC2, then PC3, then PC4, etc. The variance plot is also available to compare the explained variance across principal components (by clicking Show variance plot). Now:
For performance reasons, only the Top 100 variables that most contribute to the select principal components are plotted by default.
Two PCA plots are then rendered. The plot above is a score plot that shows the clinical samples, while the loadings plot below displays the variables (in this case, alternative splicing events). The table below the loadings plot depicts the contribution of each variable to each PC.
Hint: As most plots in psichomics, PCA plots can be zoomed-in by clicking-and-dragging within the plot (click Reset zoom to zoom-out). To toggle the visibility of the data series represented in the plot, click its respective name in the plot legend.
To perform PCA on alternative splicing data using only Tumour Stage I and Normal samples, click on the blue Perform PCA panel and select to Perform PCA on… Samples from selected groups. In the field that appears, select Tumour Stage 1 and Solid Tissue Normal. Afterwards, click Perform PCA and follow the same steps as before to plot it.
Now, click on one of the events that most contribute to the separation between tumour stage I and normal samples (one of the events with extreme values for the PC1; i.e. the X axis). Differential splicing analysis for that splicing event across selected groups are shown.
To perform PCA using gene expression (both using all samples and only Tumour Stage I and Normal samples), go back to principal component analysis, click in the Perform PCA panel, change Data to perform PCA on to Gene expression (normalised) and follow previous instructions.
One of the splicing events that most contribute the separation between tumour stage I and normal samples is NUMB exon 12 inclusion, whose protein is crucial for cell differentiation as a key regulator of the Notch pathway. The RNA-binding protein QKI has been shown to repress NUMB exon 12 inclusion in lung cancer cells by competing with core splicing factor SF1 for binding to the branch-point sequence, thereby repressing the Notch signalling pathway, which results in decreased cancer cell proliferation (Zong et al., 2014).
The identifier for NUMB exon 12 inclusion in psichomics is SE 14 - 73749067 73746132 73745989 73744001 NUMB. On the top right corner, the selected alternative splicing event can be altered by clicking Change…. Click Change… and input the given identifier.
In order to check whether a significant difference in NUMB exon 12 inclusion between tumour and normal TCGA breast samples. To do so, go to Analyses > Individual alternative splicing event and click Samples by selected groups. In the group input element, insert the groups Solid Tissue Normal and Primary solid Tumor. Finally, click Perform analyses.
Consistent with the cited article, NUMB exon 12 inclusion is significantly increased in cancer.
Also of interest:
To check if NUMB exon 12 inclusion is correlated with QKI expression, go to Analyses > Correlation of gene expression and alternative splicing and perform the following:
According to the obtained results and also consistent with the previous article, the inclusion of the exon is negatively correlated with QKI expression.
To analyse differential splicing, click on the Analyses tab located in the navigation menu at the top and select Exploratory (multiple splicing events). Next:
When the analyses complete, the results are shown in a plot and in a filterable and sortable table.
Filter events in both the plot and the table by a considerable difference in median between the selected groups (|Δ Median PSI| > 0.1):
Next, filter statistically significant splicing events (Wilcoxon q-value ≤ 0.01):
The table below is filtered according to highlighted events shown in the plot. If you zoom in the plot (by clicking and dragging), the table will be filtered according to the highlighted events in the zoomed area only (reset zoom to show all highlighted events again). If no events are highlighted, the table presents all events currently shown in the plot.
The table itself is also filterable and sortable. For instance, to sort the table by the difference in variance, click once on Δ Variance. Note that horizontal scrolling is required to visualise all available columns.
The table also provides a column with a density plot of the distribution of the alternative splicing quantification for each event. By clicking on the density plot (or its respective event identifier), a page dedicated to that alternative splicing event’s statistics and exhibiting the density plot in greater detail will show up.
To study the impact of alternative splicing events on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by the optimal PSI cutoff for a given alternative splicing event that maximises the significance of group differences in survival analysis (i.e. minimises the p-value of the log-rank tests of difference in survival between individuals whose samples have their PSI below and above that threshold).
Given the slow process of calculating the optimal splicing quantification cutoff for multiple events, it is recommended to perform this after filtering the table for differentially spliced events supported by statistical significance.
Kaplan-Meier plots will appear in the table. Click on the plotted curves to automatically go to the Survival analyses tab, where you can manually adjust the alternative splicing quantification cutoff.
Detected alterations in alternative splicing may simply be a reflection of changes in gene expression levels. Therefore, to disentangle these two effects, differential expression analysis between tumour stage I and normal samples should also be performed. To do so, click on the Analyses tab located in the navigation menu at the top and select Exploratory (multiple genes). Next:
You can further filter the analyses as previously mentioned for differential splicing analyses.
One splicing event with prognostic value is the alternative splicing of UHRF2 exon 10. Cell-cycle regulator UHRF2 promotes cell proliferation and inhibits the expression of tumour suppressors in breast cancer (Wu et al., 2012).
The identifier for UHRF2 exon 10 inclusion in psichomics is SE 9 + 6486925 6492303 6492401 6493826 UHRF2. On the top right corner, the selected alternative splicing event can be altered by clicking Change…. Click Change… and input the given identifier.
In order to test for a significant difference in UHRF2 exon 10 inclusion between tumour stage I and normal samples, go to Analyses > Individual alternative splicing event and click Samples by selected groups. In the group input element, insert the groups Solid Tissue Normal and Primary solid Tumor. Finally, click Perform analyses.
Higher inclusion of UHRF2 exon 10 is associated with normal samples.
To study the impact of alternative splicing events on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by a given PSI cutoff for a given alternative splicing event. The optimal PSI cutoff maximises the significance of group differences in survival analysis (i.e. minimises the p-value of the log-rank tests of difference in survival between individuals whose samples have a PSI below and above that threshold).
To perform survival analysis on a specific event, go to Analyses > Survival analysis.
As per the results, higher inclusion of UHRF2 exon 10 (PSI ≥ 0.09) is associated with better prognosis.
To check whether alternative splicing changes are related with gene expression alterations, let us perform differential expression analysis on UHRF2. Go to Analyses > Individual gene.
It seems UHRF2 is differentially expressed between Tumour Stage I and Solid Tissue Normal. However, going back to exploratory differential gene expression (Analyses > Exploratory (multiple genes)) and looking for UHRF2 (use the Search field above the table), UHRF2 has a log2(|fold-change|) ≤ 1. Following this criterium, the difference in gene expression between these conditions may not be considered biologically relevant.
To confirm if gene expression has an overall prognostic value, go to Analyses > Survival analysis and perform the following:
There seems to be no significant difference in survival between patient groups stratified by UHRF2’s optimal gene expression cutoff in tumour samples (log-rank p-value = 0.279).
If an event is differentially spliced and has an impact on patient survival, its association with the studied disease might be already described in the literature. To check so, go to Analyses > Gene, transcript and protein information where information regarding the associated gene (such as description and genomic position), transcripts and protein domain annotation are available.
Higher inclusion of UHRF2 exon 10 is associated with normal samples and better prognosis, and potentially disrupts UHRF2’s SRA-YDG protein domain, related to the binding affinity to epigenetic marks. Hence, exon 10 inclusion may suppress UHRF2’s oncogenic role in breast cancer by impairing its activity through the induction of a truncated protein or a non-coding isoform. Moreover, this hypothesis is independent from gene expression changes, as UHRF2 is not differentially expressed between tumour stage I and normal samples (|log2(fold-change)| < 1) and there is no significant difference in survival between patient groups stratified by its expression in tumour samples (log-rank p-value = 0.279).
All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any suggestions and comments to:
Nuno Saraiva-Agostinho (nunoagostinho@medicina.ulisboa.pt)
Disease Transcriptomics Lab, Instituto de Medicina Molecular (Portugal)
Anczuków,O. et al. (2015) SRSF1-Regulated Alternative Splicing in Breast Cancer. Molecular Cell, 60, 105–117.
Danan-Gotthold,M. et al. (2015) Identification of recurrent regulated alternative splicing events across human solid tumors. Nucleic Acids Research, 43, 5130–5144.
Ringnér,M. (2008) What is principal component analysis? Nature biotechnology, 26, 303–304.
Torre,L.A. et al. (2015) Global cancer statistics, 2012. CA: a cancer journal for clinicians, 65, 87–108.
Tsai,Y.S. et al. (2015) Transcriptome-wide identification and study of cancer-specific splicing events across multiple tumors. Oncotarget, 6, 6825–6839.
Wang,E.T. et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470–476.
Wu,J. et al. (2012) Identification and functional analysis of 9p24 amplified genes in human breast cancer. Oncogene, 31, 333–341.
Zong,F.-Y. et al. (2014) The RNA-binding protein QKI suppresses cancer-associated aberrant splicing. PLoS genetics, 10, e1004289.