CancerSubtypes is a package for cancer subtype analysis that includes various functions from dataset processing to result validation. In CancerSubtypes package, we provide a unified framework for analysing cancer subtpes from raw data to result visualisation. The main functions include genomic data pre-processing, cancer subtypes identification, results validation, visualization and comparison. CancerSubtypes provides the common data imputation and normalization methods for the genomic data pre-processing. Meanwhile, there are four feature selection methods to screen the key features in genomic dataset. The common cancer subtypes identification methods are integrated in this package such as Consensus clustering (CC) [From R package ConsensusClusterPlus], Consensus Nonnegative matrix factorization (CNMF) [From R package NMF], Integrative clustering (iCluster)[From R package iCluster], Similarity Network Fusion (SNF) [From R package SNFtool], Combined SNF and CC (SNF.CC) and Weighted Similarity Network Fusion (WSNF). We implement these cancer subtypes identification methods in a unified input and output data format. The process of analysing cancer subtypes can be easily conducted in a standard workflow. CancerSubtypes provides the most useful feature selection methods and subtypes validation method and helps users to focus on their cancer genomic data and the results from different methods can be compared and evaluation in visualization way easily.
For the basic data processing, CancerSubtypes provides the methods for data distribution check, imputation and normalization and feature selection. There are four feature selection methods (Variance-Var, Median Absolute Deviation-MAD, COX model, Principal Component Analysis-PCA) in CancerSubtypes package. All the data processing methods possess the same input and output data format.
### Prepare a TCGA gene expression dataset for analysis.
library(CancerSubtypes)
library("RTCGA.mRNA")
rm(list = ls())
data(BRCA.mRNA)
mRNA=t(as.matrix(BRCA.mRNA[,-1]))
colnames(mRNA)=BRCA.mRNA[,1]
###To observe the mean, variance and Median Absolute Deviation distribution of the dataset, it helps users to get the distribution characteristics of the data, e.g. To evaluate whether the dataset fits a normal distribution or not.
data.checkDistribution(mRNA)
The raw genomic dataset always contains missing observations, especially in microarray gene expression data. It is not wise to remove all the features with missing observations in very few samples because the useful information will be discarded. The common method is to impute the proper value for the missing observations. CancerSubtypes integrates three common imputation methods for genomic datasets.
index=which(is.na(mRNA))
res1=data.imputation(mRNA,fun="median")
res2=data.imputation(mRNA,fun="mean")
res3=data.imputation(mRNA,fun="microarray")
result1=data.normalization(mRNA,type="feature_Median",log2=FALSE)
result2=data.normalization(mRNA,type="feature_zscore",log2=FALSE)
###The top 1000 most variance features will be selected.
data1=FSbyVar(mRNA, cut.type="topk",value=1000)
###The features with (variance>0.5) are selected.
data2=FSbyVar(mRNA, cut.type="cutoff",value=0.5)
data1=FSbyMAD(mRNA, cut.type="topk",value=1000)
data2=FSbyMAD(mRNA, cut.type="cutoff",value=0.5)
mRNA1=data.imputation(mRNA,fun="microarray")
data1=FSbyPCA(mRNA1, PC_percent=0.9,scale = TRUE)
data(GeneExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
Consensus clustering (CC, 2003) as an unsupervised subtypes discovery method, was a frequently used and valuable approach in many genomic studies and have lots of successful applications.
### The input dataset is single gene expression matrix.
data(GeneExp)
result=ExecuteCC(clusterNum=3,d=GeneExp,maxK=10,clusterAlg="hc",distance="pearson",title="GBM")
### The input dataset is multi-genomics data as a list
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteCC(clusterNum=3,d=GBM,maxK=10,clusterAlg="hc",distance="pearson",title="GBM")
Non-negative matrix factorization (CNMF, 2004), as an effective dimension reduction method, was used in distinguishing molecular patterns for high-dimensional genomic data and provided a powerful method for class discovery. We apply the NMF package to execute the non-negative matrix factorization for the cancer genomic dataset. So this method allows users to input the number of core-CPUs for parallel processing.
### The input dataset is single gene expression matrix.
data(GeneExp)
result=ExecuteCNMF(GeneExp,clusterNum=3,nrun=30)
### The input dataset is multi-genomics data as a list
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteCNMF(GBM,clusterNum=3,nrun=30)
Integrative clustering (iCluster, 2009) used a joint latent variable model for integrative clustering for multiple types of omics data.
data(GeneExp)
data(miRNAExp)
data1=FSbyVar(GeneExp, cut.type="topk",value=1000)
data2=FSbyVar(miRNAExp, cut.type="topk",value=300)
GBM=list(GeneExp=data1,miRNAExp=data2)
result=ExecuteiCluster(datasets=GBM, k=3, lambda=list(0.44,0.33,0.28))
Similarity network fusion (SNF, 2014) is a computational method on fusion similarity network for aggregating multi-omics data.
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20)
We propose to combine the SNF and CC together to generate a new cancer subtypes identification method.
data(GeneExp)
data(miRNAExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
data2=FSbyCox(miRNAExp,time,status,cutoff=0.05)
GBM=list(GeneExp=data1,miRNAExp=data2)
result=ExecuteSNF.CC(GBM, clusterNum=3, K=20, alpha=0.5, t=20,maxK = 10, pItem = 0.8,reps=500,
title = "GBM", plot = "png", finalLinkage ="average")
WSNF is a caner subtype identificaton method with the assistance of the gene regulatory network information. It makes use of the miRNA-TF-mRNA regulatory network to take the importance of the features into consideration.
data(GeneExp)
data(miRNAExp)
data(Ranking)
####Retrieve there feature ranking for genes
gene_Name=rownames(GeneExp)
index1=match(gene_Name,Ranking$mRNA_TF_miRNA.v21_SYMBOL)
gene_ranking=data.frame(gene_Name,Ranking[index1,],stringsAsFactors=FALSE)
index2=which(is.na(gene_ranking$ranking_default))
gene_ranking$ranking_default[index2]=min(gene_ranking$ranking_default,na.rm =TRUE)
####Retrieve there feature ranking for genes
miRNA_ID=rownames(miRNAExp)
index3=match(miRNA_ID,Ranking$mRNA_TF_miRNA_ID)
miRNA_ranking=data.frame(miRNA_ID,Ranking[index3,],stringsAsFactors=FALSE)
index4=which(is.na(miRNA_ranking$ranking_default))
miRNA_ranking$ranking_default[index4]=min(miRNA_ranking$ranking_default,na.rm =TRUE)
###Clustering
ranking1=list(gene_ranking$ranking_default ,miRNA_ranking$ranking_default)
GBM=list(GeneExp,miRNAExp)
result=ExecuteWSNF(datasets=GBM, feature_ranking=ranking1, beta = 0.8, clusterNum=3,
K = 20,alpha = 0.5, t = 20, plot = TRUE)
The identified cancer subtypes by the computational methods should be in accordance with biological meanings and reveal the distinct molecular patterns.
Silhouette width is used to measure how similar a sample is matched to its identified subtype compared to other subtypes, a high value indicates that the sample is well matched. Each horizontal line represents a sample in the Silhouette plot. The length of the line is the silhouette width the sample has.
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
## Warning in SNFtool::SNF(W_temp, K = K, t = t): Dim names not consistent across all matrices in Wall.
## Returned matrix will have no dim names.
###Similarity smaple matrix
sil=silhouette_SimilarityMatrix(result$group, result$distanceMatrix)
plot(sil)
Note: If the input matrix is a dissimilarity matrix between samples, please use the silhouette() in cluster package to compute the silhouette width, otherwise a wrong result will be generated.
sil1=silhouette(result$group, result$distanceMatrix)
plot(sil1) ##wrong result
All the samples have the negative silhouette width.
Survival analysis is used to judge the different survival patterns between subtypes.
data(GeneExp)
data(miRNAExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
data2=FSbyCox(miRNAExp,time,status,cutoff=0.05)
GBM=list(GeneExp=data1,miRNAExp=data2)
#### 1.ExecuteSNF
result1=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
## Warning in SNFtool::SNF(W_temp, K = K, t = t): Dim names not consistent across all matrices in Wall.
## Returned matrix will have no dim names.
group1=result1$group
distanceMatrix1=result1$distanceMatrix
p_value=survAnalysis(mainTitle="GBM1",time,status,group1,
distanceMatrix1,similarity=TRUE)
##
## *****************************************************
## GBM1 Cluster= 3 Call:
## survdiff(formula = Surv(time, status) ~ group)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## group=1 46 44 41.3 0.173 0.311
## group=2 40 39 27.3 4.990 7.387
## group=3 14 13 27.4 7.529 11.661
##
## Chisq= 14.2 on 2 degrees of freedom, p= 8e-04
This is a combination figure with three parts: Survival curves, heatmap of the sample similarity matrix and Silhouette width plots for the identified cancer subtypes. The samples in all of the plots have be reorganized by the identified caner subtypes. This kind of figure provides the visible results that can be easily evaluated.
#### 2.ExecuteSNF.CC
result2=ExecuteSNF.CC(GBM, clusterNum=3, K=20, alpha=0.5, t=20,
maxK = 5, pItem = 0.8,reps=500,
title = "GBM2", plot = "png",
finalLinkage ="average")
## Warning in SNFtool::SNF(W_temp, K = K, t = t): Dim names not consistent across all matrices in Wall.
## Returned matrix will have no dim names.
group2=result2$group
distanceMatrix2=result2$distanceMatrix
p_value=survAnalysis(mainTitle="GBM2",time,status,group2,
distanceMatrix2,similarity=TRUE)
##
## *****************************************************
## GBM2 Cluster= 3 Call:
## survdiff(formula = Surv(time, status) ~ group)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## group=1 40 39 27.3 4.990 7.387
## group=2 46 44 41.3 0.173 0.311
## group=3 14 13 27.4 7.529 11.661
##
## Chisq= 14.2 on 2 degrees of freedom, p= 8e-04
Statistical significance of clustering is a pure statistical approach to test the significance difference data distribution between subtypes. Different expression is to test the expression difference between each subtypes and a reference group (always a set of normal samples).
data(GeneExp)
data(miRNAExp)
data(time)
data(status)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
group=result$group
sigclust=sigclustTest(miRNAExp,group, nsim=1000, nrep=1, icovest=1)
sigclust
## Subtype 1 Subtype 2 Subtype 3
## Subtype 1 1.000 0.018 0.000
## Subtype 2 0.018 1.000 0.221
## Subtype 3 0.000 0.221 1.000
The SigClust summary plots show the distributions of cluster index (CI). The blue points, representing the simulated CIs, are plotted with random vertical jitter for better visualization. The solid and dotted lines correspond to the estimated nonparametric density and Gaussian density fit to the simulated CIs. The p-value show the significant level between the two subtypes. Please refer to [5] to get the more information about sigclust.
Differential expression analysis is to test the expression difference between each subtypes and a reference group (always a set of normal samples). Here we apply limma package to conduct the different expression analysis between each subtypes and normal samples.
library("RTCGA.mRNA")
#require(TCGAbiolinks)
rm(list = ls())
data(BRCA.mRNA)
mRNA=t(as.matrix(BRCA.mRNA[,-1]))
colnames(mRNA)=BRCA.mRNA[,1]
mRNA1=data.imputation(mRNA,fun="microarray")
mRNA1=FSbyMAD(mRNA1, cut.type="topk",value=5000)
###Split the normal and tumor samples
index=which(as.numeric(substr(colnames(mRNA1),14,15))>9)
mRNA_normal=mRNA1[,index]
mRNA_tumor=mRNA1[,-index]
### Remove the duplicate samples
index1=which(as.numeric(substr(colnames(mRNA_tumor),14,15))>1)
mRNA_tumor=mRNA_tumor[,-index1]
##### Identify cancer subtypes
result=ExecuteCC(clusterNum=5,d=mRNA_tumor,maxK=5,clusterAlg="hc",distance="pearson",title="BRCA")
group=result$group
res=DiffExp.limma(Tumor_Data=mRNA_tumor,Normal_Data=mRNA_normal,group=group,topk=NULL,RNAseq=FALSE)
## Differently expression genes in subtype 1
head(res[[1]])
## ID logFC AveExpr t P.Value adj.P.Val
## 2439 COL10A1 4.712350 5.0043839 38.57960 3.234769e-151 1.617385e-147
## 3725 MMP11 3.528733 2.6563911 34.61589 3.206291e-134 8.015728e-131
## 3797 CA4 -4.357770 -1.0831256 -27.32458 7.230942e-101 1.205157e-97
## 3235 CAV1 -3.006418 -1.5612746 -26.90504 7.051163e-99 8.813954e-96
## 4586 CXCL3 -2.499337 -2.9294699 -26.24577 9.675861e-96 9.675861e-93
## 4133 RYR3 -2.575299 -0.4590268 -25.94906 2.524404e-94 2.103670e-91
## B
## 2439 334.8796
## 3725 295.9708
## 3797 219.5287
## 3235 214.9674
## 4586 207.7722
## 4133 204.5236
## Differently expression genes in subtype 2
head(res[[2]])
## ID logFC AveExpr t P.Value adj.P.Val B
## 637 ITM2A -6.173926 0.8088468 -12.356546 1.091109e-18 5.455546e-15 31.11131
## 4942 CKAP2 4.229372 -2.1539566 9.959475 1.093416e-14 2.267182e-11 22.64778
## 4596 RCBTB2 -4.492770 -0.2484435 -9.904508 1.360309e-14 2.267182e-11 22.44505
## 3297 GPR19 4.272492 -2.2195806 9.791601 2.132269e-14 2.665337e-11 22.02758
## 4829 CASP4 -3.767096 0.7556694 -9.613507 4.342293e-14 4.342293e-11 21.36631
## 2065 MCM4 5.079992 -3.2200565 9.518389 6.355510e-14 5.296258e-11 21.01180
The CancerSubtypes R package provides a suite of cancer subtypes analysis tools and embeds the analysis in a standardized workflow. It provides a powerful way to analyze cancer subtype on genome-wide scale.
[1] Monti, Stefano, et al. “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.” Machine learning 52.1-2 (2003): 91-118.
[2] Brunet, Jean-Philippe, et al. “Metagenes and molecular pattern discovery using matrix factorization.” Proceedings of the national academy of sciences 101.12 (2004): 4164-4169.
[3] Shen, Ronglai, Adam B. Olshen, and Marc Ladanyi. “Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.” Bioinformatics 25.22 (2009): 2906-2912.
[4] Wang, Bo, et al. “Similarity network fusion for aggregating data types on a genomic scale.” Nature methods 11.3 (2014): 333-337.
[5] Liu, Yufeng, et al. “Statistical significance of clustering for high-dimension, low?Csample size data.” Journal of the American Statistical Association (2012).
[6] Rousseeuw, Peter J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of computational and applied mathematics 20 (1987): 53-65.
[7] Xu T, Le T D, Liu L, et al. Identifying cancer subtypes from mirna-tf-mrna regulatory networks and expression data[J]. PloS one, 2016, 11(4): e0152792.