In this tutorial we are going to analyze the dataset of metabolic activities published in Terunuma et al (2017). In this paper the Authors characterized the transcriptomic and metabolomic profile of several types of human breast tumors to uncovered metabolite signatures using an untargeted discovery approach. They found that the oncometabolite 2-hydroxyglutarate (2HG) accumulates at high levels in a subset of tumors and discovered an association between increased 2HG levels and MYC pathway activation in breast cancer. As an example of metabolites topological analyses here we are going to use only the dataset of metabolite abundances as reported in the supplementary files selecting only tumour samples and comparing ER+ and ER- breast cancers. We will use pathway information from the KEGG and Reactome databases to identify a set of metabolites whose activity changes significantly between the two sample classes. This result will hopefully hint at some specific biological activities that are pathologically altered in tumoral samples.
We start by loading the experimental data into R. The measurements are stored in a simple text file, in tabular form. Thanks to the utility functions provided by R, we can combine the download and the actual reading of the table in a single operation. We also specify the details of the file format: the first row of input represents the header of the table; the tabulation character \t
separates values; we don’t want R to change strings into factors. TODO: change the URL
addr <- "http://romualdi.bio.unipd.it/wp-uploads/2017/10/metabolite_matrix_final.txt"
expr <- read.table(url(addr), header = TRUE, sep = "\t", stringsAsFactors = FALSE)
The loaded table contains 536 rows and 71 columns, but not all of them actually correspond to measurements. We ask R to list the names of the columns:
colnames(expr)
## [1] "METABOLON_ID" "KEGG_ID" "CAS" "HMDB_ID"
## [5] "NEG.TUMOR" "NEG.TUMOR.1" "NEG.TUMOR.2" "POS.TUMOR"
## [9] "POS.TUMOR.1" "NEG.TUMOR.3" "POS.TUMOR.2" "POS.TUMOR.3"
## [13] "NEG.TUMOR.4" "NEG.TUMOR.5" "NEG.TUMOR.6" "POS.TUMOR.4"
## [17] "NEG.TUMOR.7" "NEG.TUMOR.8" "POS.TUMOR.5" "NEG.TUMOR.9"
## [21] "NEG.TUMOR.10" "NEG.TUMOR.11" "NEG.TUMOR.12" "NEG.TUMOR.13"
## [25] "POS.TUMOR.6" "POS.TUMOR.7" "POS.TUMOR.8" "POS.TUMOR.9"
## [29] "POS.TUMOR.10" "POS.TUMOR.11" "POS.TUMOR.12" "POS.TUMOR.13"
## [33] "NEG.TUMOR.14" "NEG.TUMOR.15" "POS.TUMOR.14" "POS.TUMOR.15"
## [37] "NEG.TUMOR.16" "NEG.TUMOR.17" "NEG.TUMOR.18" "NEG.TUMOR.19"
## [41] "POS.TUMOR.16" "POS.TUMOR.17" "POS.TUMOR.18" "NEG.TUMOR.20"
## [45] "NEG.TUMOR.21" "NEG.TUMOR.22" "NEG.TUMOR.23" "NEG.TUMOR.24"
## [49] "POS.TUMOR.19" "POS.TUMOR.20" "POS.TUMOR.21" "POS.TUMOR.22"
## [53] "POS.TUMOR.23" "NEG.TUMOR.25" "POS.TUMOR.24" "POS.TUMOR.25"
## [57] "NEG.TUMOR.26" "NEG.TUMOR.27" "NEG.TUMOR.28" "NEG.TUMOR.29"
## [61] "POS.TUMOR.26" "NEG.TUMOR.30" "POS.TUMOR.27" "POS.TUMOR.28"
## [65] "NEG.TUMOR.31" "POS.TUMOR.29" "POS.TUMOR.30" "POS.TUMOR.31"
## [69] "NEG.TUMOR.32" "NEG.TUMOR.33" "POS.TUMOR.32"
As you can see the first four columns actually contain metabolite identifiers. We store their indices in a variable for future reference.
idcols <- 1:4
We display the first few rows of the table to get a feeling for its content. For the moment, we concentrate on the metabolite measurements.
head(expr[,5:10])
## NEG.TUMOR NEG.TUMOR.1 NEG.TUMOR.2 POS.TUMOR POS.TUMOR.1 NEG.TUMOR.3
## 1 143976.09 241472.30 NA 28847.14 NA 91703.21
## 2 32786.00 433319.00 21002.72 47510.59 NA 119432.82
## 3 NA 163501.61 NA NA NA 133341.57
## 4 20556.03 NA NA NA 94989.22 NA
## 5 72971.21 124561.37 50246.18 106322.34 89398.17 43596.23
## 6 NA 55323.71 NA NA NA 361387.05
It looks like there is a significant number of missing values. Let’s plot a distribution of the fraction of NAs for each row.
nas_by_row <- rowSums(is.na(data.matrix(expr[,-idcols]))) / (ncol(expr) - length(idcols))
hist(nas_by_row, xlab = "Fraction of NA values", ylab = "Number of rows", main = NULL)
We cannot make good use of rows with too many missing values. We decide to drop them from our dataset. Specifically, we are going to remove anything with more than 50% of NAs.
dexpr <- expr[nas_by_row < 0.5,]
This operation leaves us with 315 of the original 536 rows.
We are not yet satisfied, though. The downstream analysis won’t be able to cope with any NA. What would happen if we were to apply a more stringent procedure, removing any NA?
sum(rowSums(is.na(dexpr[,-idcols])) == 0)
## [1] 88
The above command is a little bit dense, so let’s go through it in small steps. First we select all columns, except the identifiers. Then we check each value, testing if it’s an NA. We tally how many NAs we have in each row and we consider only those in which the count is zero (meaning, no NA at all). The outer sum tells us how many rows would survive our filter.
The final verdict is quite grim. Only 88 out of 315 measurements would be used. We could do much better using a strategy that has become quite common in cases like this: imputation.
Instead of implementing it ourselves, we are going to use the excellent BioConductor package impute.
library(impute)
iexpr <- cbind(dexpr[,idcols],
impute.knn(data.matrix(dexpr[,-idcols]))$data,
stringsAsFactors = FALSE)
head(iexpr[, 1:6])
## METABOLON_ID KEGG_ID CAS HMDB_ID NEG.TUMOR NEG.TUMOR.1
## 1 35186 143976.09 241472.3
## 2 34214 C03819 32786.00 433319.0
## 5 27665 C02918 1005-24-9; HMDB00699 72971.21 124561.4
## 7 21184 C01885 111-03-5; 102595.37 940195.4
## 8 33960 C03916 19420-56-5; HMDB02815 66618.92 351724.6
## 10 21127 C01885 542-44-9; 17852.30 367552.9
We can explicitly check there are not NAs left:
sum(is.na(iexpr[,-idcols]))
## [1] 0
We now concentrate on the metabolite identifiers. We start again from the first few rows.
head(iexpr[,idcols])
## METABOLON_ID KEGG_ID CAS HMDB_ID
## 1 35186
## 2 34214 C03819
## 5 27665 C02918 1005-24-9; HMDB00699
## 7 21184 C01885 111-03-5;
## 8 33960 C03916 19420-56-5; HMDB02815
## 10 21127 C01885 542-44-9;
The CAS identifiers have a trailing ;
we don’t need and there are a number of empty strings, which really represent missing values. We fix both these issues.
iexpr$CAS <- sub("\\s*;.*$", "", iexpr$CAS)
iexpr[,idcols][iexpr[,idcols] == ""] <- NA
head(iexpr[,idcols])
## METABOLON_ID KEGG_ID CAS HMDB_ID
## 1 35186 <NA> <NA> <NA>
## 2 34214 C03819 <NA> <NA>
## 5 27665 C02918 1005-24-9 HMDB00699
## 7 21184 C01885 111-03-5 <NA>
## 8 33960 C03916 19420-56-5 HMDB02815
## 10 21127 C01885 542-44-9 <NA>
We get a measure of how many identifiers are present or missing:
summary(is.na(iexpr[,idcols]))
## METABOLON_ID KEGG_ID CAS HMDB_ID
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:315 FALSE:176 FALSE:199 FALSE:175
## TRUE :139 TRUE :116 TRUE :140
We have a Metabolon ID for each metabolite in the matrix, while in the case of CAS identifiers we find only 199 usable rows.
Unfortunately, at this point graphite
does not support the Metabolon IDs. We do rely on CAS even if that means loosing a significant fraction of the rows.
valid_cas <- !is.na(iexpr$CAS)
cas_col <- which(names(iexpr) == "CAS")
cexpr <- iexpr[valid_cas, c(cas_col, seq.int(5, ncol(iexpr)))]
head(cexpr[,1:6])
## CAS NEG.TUMOR NEG.TUMOR.1 NEG.TUMOR.2 POS.TUMOR POS.TUMOR.1
## 5 1005-24-9 72971.21 124561.4 50246.180 106322.34 89398.17
## 7 111-03-5 102595.37 940195.4 8155.901 64641.62 450857.68
## 8 19420-56-5 66618.92 351724.6 96895.167 43963.93 89116.99
## 10 542-44-9 17852.30 367552.9 168209.378 31647.50 361469.21
## 11 17364-16-8 73652.66 2933573.1 118921.644 13491.35 201283.33
## 12 123-94-4 25390.48 255992.6 164295.490 53549.10 161703.45
There is no reason at this point to keep the identifiers as a column inside the dataset. We move such information to the row names and transform the data.frame
into a numeric matrix
.
mexpr <- data.matrix(cexpr[,-1])
rownames(mexpr) <- paste("CAS", cexpr$CAS, sep = ":")
head(mexpr[,1:6])
## NEG.TUMOR NEG.TUMOR.1 NEG.TUMOR.2 POS.TUMOR POS.TUMOR.1
## CAS:1005-24-9 72971.21 124561.4 50246.180 106322.34 89398.17
## CAS:111-03-5 102595.37 940195.4 8155.901 64641.62 450857.68
## CAS:19420-56-5 66618.92 351724.6 96895.167 43963.93 89116.99
## CAS:542-44-9 17852.30 367552.9 168209.378 31647.50 361469.21
## CAS:17364-16-8 73652.66 2933573.1 118921.644 13491.35 201283.33
## CAS:123-94-4 25390.48 255992.6 164295.490 53549.10 161703.45
## NEG.TUMOR.3
## CAS:1005-24-9 43596.23
## CAS:111-03-5 115254.33
## CAS:19420-56-5 1438440.74
## CAS:542-44-9 809467.15
## CAS:17364-16-8 8597685.11
## CAS:123-94-4 476471.58
To make sense of the changes in metabolic activity recorded in the data we have just loaded, we are going to use pathway information from KEGG (via the graphite package) and a statistical analysis capable of exploiting the topology of such pathways (here we’ll rely on clipper).
We import the required packages:
library(graphite)
library(clipper)
clipper
will need three pieces of information:
mexpr
in our case)Getting pathways is quite easy thanks to graphite
.
kpaths <- pathways("hsapiens", "kegg")
The above command retrieves all KEGG pathways for Homo sapiens (294 in total). We take a peek at the first entry:
kpaths[[1]]
## "Glycolysis / Gluconeogenesis" pathway
## Native ID = hsa:00010
## Database = KEGG
## Species = hsapiens
## Number of nodes = 94
## Number of edges = 1191
## Retrieved on = 12-10-2017
This summary doesn’t tell us which identifiers are used for the nodes in the pathway. We can get that from the list of edges.
head(edges(kpaths[[1]], "metabolites"))
## src_type src dest_type dest direction type
## 1 KEGGCOMP C00022 KEGGCOMP C00074 directed Process(indirect)
## 2 KEGGCOMP C00022 KEGGCOMP C00186 directed Process(indirect)
## 3 KEGGCOMP C00022 KEGGCOMP C05125 directed Process(indirect)
## 4 KEGGCOMP C00022 KEGGCOMP C15972 directed Process(indirect)
## 5 KEGGCOMP C00022 KEGGCOMP C16255 directed Process(indirect)
## 6 KEGGCOMP C00024 KEGGCOMP C15973 directed Process(indirect)
As we might expect, KEGG pathways are using KEGG compounds IDs. Since we’re relying on CAS in our data, we should convert them. convertIdentifiers
to the rescue!
rpaths <- convertIdentifiers(kpaths, "CAS")
The edges of the first pathway have been converted into:
head(edges(rpaths[[1]], "metabolites"))
## src_type src dest_type dest direction type
## 1 CAS 57-60-3 CAS 73-89-2 directed Process(indirect)
## 2 CAS 57-60-3 CAS 79-33-4 directed Process(indirect)
## 3 CAS 64-19-7 CAS 72-89-9 directed Process(indirect)
## 4 CAS 71-50-1 CAS 72-89-9 directed Process(indirect)
## 5 CAS 64-19-7 CAS 75-07-0 directed Process(indirect)
## 6 CAS 71-50-1 CAS 75-07-0 directed Process(indirect)
To make the rest of the analysis more robust, we are going to filter pathways requiring they have at least 10 edges whose metabolites appear in our data.
filter_pathways <- function(pathways, expr, min_edge_num) {
node_names <- sub("^CAS:", "", rownames(expr))
pred <- function(p) {
es <- edges(p, "metabolites")
mask <- es$src %in% node_names & es$dest %in% node_names
sum(mask) > min_edge_num
}
Filter(pred, pathways)
}
fpaths <- filter_pathways(rpaths, mexpr, 10)
Now about sample classes. We are going to split our dataset into two: POS.TUMOR
samples in the first group and NEG.TUMOUR
samples in the other. clipper
wants from us a vector with as many entries as there are samples. An entry should be 1
if the corresponding samples belongs to the first class, or 2
in the other case.
pos_indices <- grep("^POS", colnames(mexpr))
neg_indices <- grep("^NEG", colnames(mexpr))
classes <- numeric(length = ncol(mexpr))
classes[neg_indices] <- 1
classes[pos_indices] <- 2
classes
## [1] 1 1 1 2 2 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1
## [36] 1 2 2 2 1 1 1 1 1 2 2 2 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2 2 2 1 1 2
We use the helper function runClipper
to start our analysis. Note that we explicitly require metabolites from the pathways.
out <- runClipper(fpaths, mexpr, classes, "mean", "metabolites", maxNodes = 150, seed = 42)
We check that there are no errors and the we extract the list of pathways which appears to be significantly altered between the two conditions according to clipper
.
stopifnot(length(out$errors) == 0)
names(out$results)
## [1] "Alanine, aspartate and glutamate metabolism"
plot_altered_path <- function(result, pathways, node_scale = 2) {
title <- names(result)
pathway <- pathways[[title]]
graph <- pathwayGraph(pathway, which = "metabolites")
labels <- sub("^CAS:", "", nodes(graph))
names(labels) <- nodes(graph)
altered <- unlist(strsplit(result[[1]][1, "pathGenes"], "[,;]"))
selected <- nodes(graph) %in% altered
node_colors <- ifelse(selected, "red", "black")
names(node_colors) <- nodes(graph)
base <- 0.75
heights <- ifelse(selected, base * node_scale, base)
names(heights) <- nodes(graph)
widths <- ifelse(selected, base * node_scale, base)
names(widths) <- nodes(graph)
base <- 14
fontsizes <- ifelse(selected, base * node_scale, base)
names(fontsizes) <- nodes(graph)
between_altered <- function(edge_names, altered) {
sapply(edge_names, function(edge_name) {
nodes <- unlist(strsplit(edge_name, "~", fixed = TRUE))
all(nodes %in% altered)
})
}
edge_colors <- ifelse(between_altered(edgeNames(graph), altered), "red", "black")
plot(graph,
attrs = list(edge = list(arrowsize = 0.5)),
nodeAttrs = list(label = labels, color = node_colors, width = widths,
height = heights, fontsize = fontsizes),
edgeAttrs = list(color = edge_colors),
recipEdges = "combined", main = title)
}
plot_altered_path(out$results[1], fpaths)
As expected, within the significant path (highlighted in red) we found N-Acetyl-L-aspartic acid (NAA) (CAS:997-55-7) and L-Glutamic acid (CAS:56-85-9) identified by the Authors as two differential metabolites between ER+ and ER- patients. Previous studies demonstrate that a deleterious aspartoacylase (ASPA) gene mutations inhibit the hydrolysis of NAA suggesting a possible cause for increased tumor-associated NAA. The Authors found NAA significantly elevated in ER- tumors and a reduced expression of aspartoacylase (ASPA) gene in ER- tumours.
In this section we repeat the previous analysis, but we use Reactome as a source of pathway information.
fpaths <- filter_pathways(convertIdentifiers(pathways("hsapiens", "reactome"), "CAS"),
mexpr, 10)
out <- runClipper(fpaths, mexpr, classes, "mean", "metabolites", maxNodes = 150, seed = 42)
stopifnot(length(out$errors) == 0)
plot_altered_path(out$results[1], fpaths, 12)
## Warning in arrows(head_from[1], head_from[2], head_to[1], head_to[2], col =
## edgeColor, : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(head_from[1], head_from[2], head_to[1], head_to[2], col =
## edgeColor, : zero-length arrow is of indeterminate angle and so skipped
Here we found a general pathway (nucleotide metabolism) within which the best significant path is composed by inosine, guanine, guanosine, adenosine and xanthine. It is worth to note that Terunuma et al (2017) found guanine one of the highest significant metabolite between ER+ and ER- tumours.