MultiRNAflow
on a RNA-seq raw counts with different time points and several biological conditionsThis document is a quick workflow describing how to use our R package MultiRNAflow on one dataset (see Dataset used as example). For a more complete description of our package and complete outputs with graphs, the user must read our pdf file entitled ’MultiRNAflow_vignette-knitr.pdf”.
The Mouse dataset 2 (Weger et al. 2021) is accessible on the Gene Expression Omnibus (GEO) database with the accession number GSE135898.
This dataset contains the temporal transcription profile of 16 mice with Bmal1 and Cry1/2 knocked-down under an ad libitum (AL) or night restricted feeding (RF) regimen. Data were collected at 0, 4h, 8h, 16, 20h and 24h. Therefore, there are six time points and eight biological conditions. As there are only two mice per biological condition, we decided not to take into account the effect of the regimen. The dataset contains temporal expression data of 40327 genes.
To illustrate the use of our package, we take 500 genes, over the global 40327 genes in the original dataset. This sub dataset is saved in the file RawCounts_Weger2021_MOUSEsub500.
library(MultiRNAflow)
data("RawCounts_Weger2021_MOUSEsub500")
The dataset must be a data.frame containing raw counts data.
If it is not the case, the functions DATAnormalization()
and
DEanalysisGlobal()
will stop and print an error.
Each line should correspond to a gene, each column to a sample,
except a particular column that may contain strings of characters
describing the names of the genes.
The first line of the data.frame should contain the names of the columns
(strings of characters) that must have the following structure.
head(colnames(RawCounts_Weger2021_MOUSEsub500), n=5)
#> [1] "Gene" "BmKo_t1_r1" "BmKo_t2_r1" "BmKo_t0_r1" "BmKo_t3_r1"
In this example, “Gene” indicates the column which contains the names of the different genes. The other column names contain all kind of information about the sample, including the biological condition, the time of measurement and the name of the individual (e.g patient, replicate, mouse, yeasts culture…). Other kinds of information can be stored in the column names (such as patient information), but they will not be used by the package. The various information in the column names must be separated by underscores. The order of these information is arbitrary but must be the same for all columns. For instance, the sample “BmKo_t0_r1” corresponds to the first replicate (r1) of the biological condition BmKo at time t0 (reference time).
The information located to the left of the first underscore will be considered
to be in position 1, the information located between the first underscore and
the second one will be considered to be in position 2, and so on.
In the previous example, the biological condition is in position 1,
the time is in position 2 and the replicate is in position 3.
In most of the functions of our package, the order of the previous information
in the column names will be indicated with the inputs Group.position
,
Time.position
and Individual.position
.
Similarly the input Column.gene
will indicate the number of
the column containing gene names.
DATAprepSE()
resDATAprepSE <- DATAprepSE(RawCounts=RawCounts_Weger2021_MOUSEsub500,
Column.gene=1,
Group.position=1,
Time.position=2,
Individual.position=3)
DEanalysisGlobal
for the statistical (supervised) analysis.RawCounts
)Column.gene=1
means that the first column of the dataset
contain genes name, Time.position=2
means that the time of measurements
is between the first and the second underscores in the columns names,
Individual.position=3
means that the name of the individual is between
the second and the third underscores in the columns names and
Group.position=NULL
means that there is only one biological condition
in the dataset. Similarly, Time.position=NULL
would mean that
there is only one time of measurement for each individual and
Column.gene=NULL
would mean that there is no column containing gene names.?DATAprepSE
in your console for more information about the function.DATAnormalization()
resNorm <- DATAnormalization(SEres=resDATAprepSE,
Normalization="vst",
Blind.rlog.vst=FALSE,
Plot.Boxplot=FALSE,
Colored.By.Factors=TRUE,
Color.Group=NULL,
path.result=NULL)
resDATAprepSE
but with
the normalized dataPlot.Boxplot=TRUE
) showing the distribution of
the normalized expression (Normalization="vst"
means that the vst method
is used) of genes for each sample.DATAprepSE()
.Plot.Boxplot=TRUE
.Colored.By.Factors=TRUE
, the color of the boxplots would be different
for different biological conditions.path.result
.?DATAnormalization
in your console for more information
about the function.PCAanalysis()
resPCA <- PCAanalysis(SEresNorm=resNorm,
DATAnorm=TRUE,
gene.deletion=NULL,
sample.deletion=NULL,
Plot.PCA=FALSE,
Mean.Accross.Time=FALSE,
Color.Group=NULL,
Cex.label=0.6,
Cex.point=0.7, epsilon=0.2,
Phi=25,Theta=140,
motion3D=FALSE,
path.result=NULL)
resNorm
but with
the results of the function PCA()
of the package FactoMineR
.motion3D=FALSE
) where samples are colored with different colors
for different biological conditions. Furthermore, lines are drawn
between each pair of consecutive points for each sample
(if Mean.Accross.Time=FALSE
, otherwise it will be only between means).motion3D=FALSE
) for each biological condition,
where samples are colored with different colors for different time points.
Furthermore, lines are drawn between each pair of consecutive points
for each sample (if Mean.Accross.Time=FALSE
, otherwise it will be only
between means).DATAnormalization()
.DATAnorm=TRUE
) for the PCA
analysis.Color.Group=NULL
), a color will be automatically applied
for each biological condition. If you want to change the colors,
read our pdf file entitled ’MultiRNAflow_vignette-knitr.pdf”.gene.deletion=c("ENSMUSG00000025921", "ENSMUSG00000026113")
and/or
sample.deletion=c("BmKo_t2_r1", "BmKo_t5_r2")
gene.deletion=c(2,6)
and/or sample.deletion=c(3,13)
.
The integers in gene.deletion
and sample.deletion
represent
respectively the row numbers and the column numbers of ExprData
where the selected genes and samples are located.Plot.PCA=TRUE
.path.result
.?PCAanalysis
in your console for more information
about the function.HCPCanalysis()
resHCPC <- HCPCanalysis(SEresNorm=resNorm,
DATAnorm=TRUE,
gene.deletion=NULL,
sample.deletion=NULL,
Plot.HCPC=FALSE,
Phi=25,Theta=140,
Cex.point=0.6,
epsilon=0.2,
Cex.label=0.6,
motion3D=FALSE,
path.result=NULL)
resNorm
but with
the results of the function HCPCanalysis()
of the package FactoMineR
motion3D=FALSE
). These PCA graphs are identical to the outputs
of PCAanalysis()
with all samples but samples are colored with different
colors for different clusters.DATAnormalization()
.DATAnorm=TRUE
)
for the HCPC analysis.Plot.HCPC=TRUE
.path.result
.HCPC()
.?HCPCanalysis
in your console for more information
about the function.MFUZZanalysis()
.resMFUZZ <- MFUZZanalysis(SEresNorm=resNorm,
DATAnorm=TRUE,
DataNumberCluster=NULL,
Method="hcpc",
Membership=0.5,
Min.std=0.1,
Plot.Mfuzz=FALSE,
path.result=NULL)
resNorm
but with
mfuzz()
DataNumberCluster=NULL
). If Method="hcpc"
, the function plots
the scaled within-cluster inertia, but if Method="kmeans"
,
the function plots the scaled within-cluster inertia.
As the number of genes can be very high, we recommend to select
Method="hcpc"
which is by default.mfuzz()
which represents the most
common temporal behavior among all genes for the biological condition ‘BmKo’.DATAnormalization()
.DATAnorm=TRUE
)
for the clustering analysis.Membership
(numeric value between 0 and 1) will not be displayed.
The membership values correspond to the probability of gene to belong
to each cluster.Min.std
(numeric positive value) will be excluded.Plot.Mfuzz=TRUE
.path.result
.?MFUZZanalysis
in your console for more information
about the function.DATAplotExpressionGenes()
resEXPR <- DATAplotExpressionGenes(SEresNorm=resNorm,
DATAnorm=TRUE,
Vector.row.gene=c(17),
Color.Group=NULL,
Plot.Expression=FALSE,
path.result=NULL)
DATAnormalization()
.DATAnorm=TRUE
).Vector.row.gene=c(97,192,194,494)
.Color.Group=NULL
), a color will be automatically applied
for each biological condition. If you want to change the colors,
read our pdf file entitled ’MultiRNAflow_vignette-knitr.pdf”.Plot.Expression=TRUE
.path.result
.?DATAplotExpressionGenes
in your console for more information
about the function.DEanalysisGlobal()
For the speed of the algorithm, we will take only three biological conditions and three times
Sub3bc3T <- RawCounts_Weger2021_MOUSEsub500
Sub3bc3T <- Sub3bc3T[, seq_len(73)]
SelectTime <- grep("_t0_", colnames(Sub3bc3T))
SelectTime <- c(SelectTime, grep("_t1_", colnames(Sub3bc3T)))
SelectTime <- c(SelectTime, grep("_t2_", colnames(Sub3bc3T)))
Sub3bc3T <- Sub3bc3T[, c(1, SelectTime)]
resSEsub <- DATAprepSE(RawCounts=Sub3bc3T,
Column.gene=1,
Group.position=1,
Time.position=2,
Individual.position=3)
resDE <- DEanalysisGlobal(SEres=resSEsub,
pval.min=0.05,
log.FC.min=1,
Plot.DE.graph=FALSE,
path.result=NULL)
#> [1] "Preprocessing"
#> [1] "Differential expression step with DESeq2::DESeq()"
#> [1] "Case 3 analysis : Biological conditions and Times."
#> [1] "DE time analysis for each biological condition."
#> [1] "DE group analysis for each time measurement."
#> [1] "Combined time and group results."
DESeq()
Results
) which contains
DEplotVennBarplotTime()
).DEplotBarplotFacetGrid()
).DEplotVennBarplotGroup()
).DEplotAlluvial()
).DEplotBarplotFacetGrid()
).DEplotAlluvial()
).pval.min
(numeric value between 0 and 1) and if the absolute
log fold change associated to the gene is superior to log.FC.min
(numeric positive value).Plot.DE.graph=TRUE
.path.result
.RawCounts=RawCounts_Weger2021_MOUSEsub500
in order to use
the complete dataset.?DEanalysisGlobal
in your console for more information
about the function.DEplotVolcanoMA()
resMAvolcano <- DEplotVolcanoMA(SEresDE=resDE,
NbGene.plotted=2,
SizeLabel=3,
Display.plots=FALSE,
Save.plots=FALSE)
DEanalysisGlobal()
.Display.plots=TRUE
.Save.plots=TRUE
and and a folder path in the input path.result
in the function DEanalysisGlobal()
.?DEplotVolcanoMA
in your console for more information
about the function.DEplotHeatmaps()
resHEAT <- DEplotHeatmaps(SEresDE=resDE,
ColumnsCriteria=2,
Set.Operation="union",
NbGene.analysis=20,
SizeLabelRows=5,
SizeLabelCols=5,
Display.plots=FALSE,
Save.plots=TRUE)
Set.Operation="union"
then the rows extracted from the different
datasets included in SEresDE
are those such that the sum of the selected
columns of SummarizedExperiment::rowData(SEresDE)
by ColumnsCriteria
is >0. For example, the selected genes can be those DE at least at t1 or t2
(versus the reference time t0).Display.plots=TRUE
.Save.plots=TRUE
and and a folder path in the input path.result
in the function DEanalysisGlobal()
.?DEplotHeatmaps
in your console for more information
about the function.GSEAQuickAnalysis()
and GSEApreprocessing()
resRgprofiler2 <- GSEAQuickAnalysis(Internet.Connection=FALSE,
SEresDE=resDE,
ColumnsCriteria=2,
ColumnsLog2ordered=NULL,
Set.Operation="union",
Organism="mmusculus",
MaxNumberGO=20,
Background=FALSE,
Display.plots=FALSE,
Save.plots=TRUE)
DEanalysisGlobal()
,
an enrichment analysis (GSEA) of a subset of genes with the R package
gprofiler2
.
gprofiler2::gost()
MaxNumberGO
most important GO.Set.Operation="union"
then the rows extracted from the different
datasets included in SEresDE
are those such that the sum of the selected
columns of SummarizedExperiment::rowData(SEresDE)
by ColumnsCriteria
is >0. For example, the selected genes can be those DE at least at t1 or t2
(versus the reference time t0).Display.plots=TRUE
.Save.plots=TRUE
and a folder path in the input path.result
in the function DEanalysisGlobal()
.?GSEAQuickAnalysis
in your console for more information
about the function.resGSEApreprocess <- GSEApreprocessing(SEresDE=resDE,
ColumnsCriteria=2,
Set.Operation="union",
Rnk.files=FALSE,
Save.files=FALSE)
ColumnsCriteria
and Set.Operation
.Save.files=TRUE
and the path.result of
DEanalysisGlobal()
is not NULL, specific files designed to be used as input
for the following online tools and software : GSEA, DAVID, WebGestalt,
gProfiler, Panther, ShinyGO, Enrichr, GOrillaSet.Operation="union"
then the rows extracted from the
different datasets included in SEresDE
are those such that the sum of
the selected columns of SummarizedExperiment::rowData(SEresDE)
by ColumnsCriteria
is >0.
For example, the selected genes can be those DE at least at t1 or t2
(versus the reference time t0).?GSEApreprocessing
in your console for more information
about the function.Here is the output of sessionInfo()
on the system on which this document
was compiled.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] tcltk stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MultiRNAflow_1.5.0 Mfuzz_2.67.0 DynDoc_1.85.0
#> [4] widgetTools_1.85.0 e1071_1.7-16 Biobase_2.67.0
#> [7] BiocGenerics_0.53.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9
#> [3] shape_1.4.6.1 magrittr_2.0.3
#> [5] magick_2.8.5 TH.data_1.1-2
#> [7] estimability_1.5.1 farver_2.1.2
#> [9] rmarkdown_2.28 GlobalOptions_0.1.2
#> [11] fs_1.6.4 zlibbioc_1.53.0
#> [13] vctrs_0.6.5 Cairo_1.6-2
#> [15] base64enc_0.1-3 rstatix_0.7.2
#> [17] htmltools_0.5.8.1 S4Arrays_1.7.0
#> [19] broom_1.0.7 Formula_1.2-5
#> [21] SparseArray_1.7.0 gridGraphics_0.5-1
#> [23] sass_0.4.9 bslib_0.8.0
#> [25] htmlwidgets_1.6.4 plyr_1.8.9
#> [27] sandwich_3.1-1 emmeans_1.10.5
#> [29] plotly_4.10.4 zoo_1.8-12
#> [31] cachem_1.1.0 misc3d_0.9-1
#> [33] lifecycle_1.0.4 iterators_1.0.14
#> [35] pkgconfig_2.0.3 Matrix_1.7-1
#> [37] R6_2.5.1 fastmap_1.2.0
#> [39] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.0
#> [41] clue_0.3-65 digest_0.6.37
#> [43] colorspace_2.1-1 S4Vectors_0.45.0
#> [45] DESeq2_1.47.0 GenomicRanges_1.59.0
#> [47] ggpubr_0.6.0 labeling_0.4.3
#> [49] fansi_1.0.6 httr_1.4.7
#> [51] abind_1.4-8 compiler_4.5.0
#> [53] proxy_0.4-27 withr_3.0.2
#> [55] doParallel_1.0.17 backports_1.5.0
#> [57] BiocParallel_1.41.0 viridis_0.6.5
#> [59] carData_3.0-5 UpSetR_1.4.0
#> [61] dendextend_1.18.1 Rttf2pt1_1.3.12
#> [63] ggsignif_0.6.4 MASS_7.3-61
#> [65] tkWidgets_1.85.0 DelayedArray_0.33.0
#> [67] rjson_0.2.23 scatterplot3d_0.3-44
#> [69] flashClust_1.01-2 tools_4.5.0
#> [71] extrafontdb_1.0 FactoMineR_2.11
#> [73] glue_1.8.0 grid_4.5.0
#> [75] reshape2_1.4.4 cluster_2.1.6
#> [77] generics_0.1.3 gtable_0.3.6
#> [79] class_7.3-22 tidyr_1.3.1
#> [81] data.table_1.16.2 car_3.1-3
#> [83] utf8_1.2.4 XVector_0.47.0
#> [85] stringr_1.5.1 ggrepel_0.9.6
#> [87] foreach_1.5.2 pillar_1.9.0
#> [89] yulab.utils_0.1.7 circlize_0.4.16
#> [91] splines_4.5.0 dplyr_1.1.4
#> [93] lattice_0.22-6 survival_3.7-0
#> [95] tidyselect_1.2.1 ComplexHeatmap_2.23.0
#> [97] locfit_1.5-9.10 knitr_1.48
#> [99] gridExtra_2.3 bookdown_0.41
#> [101] IRanges_2.41.0 SummarizedExperiment_1.37.0
#> [103] stats4_4.5.0 xfun_0.48
#> [105] plot3Drgl_1.0.4 factoextra_1.0.7
#> [107] matrixStats_1.4.1 DT_0.33
#> [109] stringi_1.8.4 UCSC.utils_1.3.0
#> [111] lazyeval_0.2.2 yaml_2.3.10
#> [113] evaluate_1.0.1 codetools_0.2-20
#> [115] extrafont_0.19 tibble_3.2.1
#> [117] BiocManager_1.30.25 multcompView_0.1-10
#> [119] ggplotify_0.1.2 cli_3.6.3
#> [121] xtable_1.8-4 munsell_0.5.1
#> [123] jquerylib_0.1.4 Rcpp_1.0.13
#> [125] GenomeInfoDb_1.43.0 gprofiler2_0.2.3
#> [127] coda_0.19-4.1 png_0.1-8
#> [129] parallel_4.5.0 leaps_3.2
#> [131] rgl_1.3.12 ggplot2_3.5.1
#> [133] ggalluvial_0.12.5 viridisLite_0.4.2
#> [135] mvtnorm_1.3-1 plot3D_1.4.1
#> [137] scales_1.3.0 purrr_1.0.2
#> [139] crayon_1.5.3 GetoptLong_1.0.5
#> [141] rlang_1.1.4 multcomp_1.4-26