Data object

0.1 About data object

All functions in the bigPint package require an input parameter called data, which should be a data frame that contains the full dataset of interest. If a researcher is using the package to visualize RNA-seq data, then this data object should be a count table that contains the read counts for all genes of interest.

0.2 Example: two treatments

The data object requires the same particular data frame format for all bigPint functions. There should be \(n\) rows in the data frame, where \(n\) is the number of genes. There should be \(p + 1\) columns in the data frame, where \(p\) is the number of samples. The first column contains the genes names and the rest of the columns should contain the read counts for all samples of interest. An example of this format is shown below:

library(bigPint)
data("soybean_ir_sub")
head(soybean_ir_sub)

##                               ID N.1 N.2 N.3 P.1 P.2 P.3
## 14881 Glyma.06G158700.Wm82.a2.v1  48  28  15  29  16   8
## 20855 Glyma.08G156000.Wm82.a2.v1   0   0   0   0   0   0
## 32104 Glyma.12G070600.Wm82.a2.v1   8  11  11   5   8   7
## 50897 Glyma.19G045200.Wm82.a2.v1 200 192 187 186 193 183
## 11303 Glyma.05G050000.Wm82.a2.v1   0   0   0   0   0   0
## 50345 Glyma.18G292300.Wm82.a2.v1 583 669 497 419 467 426

We can also examine the structure of an example data object as follows:

str(soybean_ir_sub, strict.width = "wrap")

## 'data.frame':    5604 obs. of  7 variables:
## $ ID : chr "Glyma.06G158700.Wm82.a2.v1" "Glyma.08G156000.Wm82.a2.v1"
##    "Glyma.12G070600.Wm82.a2.v1" "Glyma.19G045200.Wm82.a2.v1" ...
## $ N.1: int 48 0 8 200 0 583 22 52 1 73 ...
## $ N.2: int 28 0 11 192 0 669 34 42 0 120 ...
## $ N.3: int 15 0 11 187 0 497 19 44 3 59 ...
## $ P.1: int 29 0 5 186 0 419 11 46 0 98 ...
## $ P.2: int 16 0 8 193 0 467 21 54 0 106 ...
## $ P.3: int 8 0 7 183 0 426 11 42 2 86 ...

This example dataset contains 5,604 genes and six samples (Lauter and Graham 2016). There are two treatment groups, N and P. Each treatment group contains three replicates.

0.3 Data object rules

As demonstrated above, the data object must meet the following conditions:

Be of type data.frame
Contain at least two treatment groups and at least two replicates per treatment group
Its first column must
- Be called “ID”
- Be of class character
- Contain the names of the genes (or a unique set of names in general)
Each of the rest of its columns must
- Contain the read counts for a given sample (or quantitative values in general)
- Be of class integer or numeric
- Be called in a three-part format (such as “A.3” or “S4.1”) that matches the Perl expression ^[a-zA-Z0-9]+\\.[0-9]+, where
  - The first part indicates the treatment group name and must contain alphanumeric characters. Examples include “A”, “AR”, and “A9”
  - The second part consists of a dot “.” to serve as a delimeter
  - The third part indicates the replicate number and must consist of numbers

It is important that the names of all columns except the first follow the three-part format delineated above. All functions in the bigPint package require this format to successfully produce plots. If your data object does not fit this format, bigPint will likely throw an informative error about why your format was not recognized.

0.4 Example: three treatments

Note that the data object can contain more than two treatment groups. In this case, the bigPint software will automatically create plots for all pairs of treatment groups. An example of this type of dataset is provided in the bigPint package and can accessed as follows:

data(soybean_cn_sub)

This example dataset contains 7,332 genes and nine samples (Brown and Hudson 2015). There are three treatment groups, S1, S2, and S3. Each treatment group contains three replicates. In such cases where the data object contains more than two treatment groups, all functions in the bigPint package (except plotSMApp()) will automatically produce a plot for each pairwise combination of treatment groups.

For example, bigPint functions will produce plots for S1 versus S2, S1 versus S3, and S2 versus S3 in this case. The same could be accomplished (although less efficiently) by separating the dataset into three separate datasets and running a bigPint function of interest on each of them individually.

library(dplyr)
soybean_cn_sub_S1S2 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S2"))
soybean_cn_sub_S1S3 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S3"))
soybean_cn_sub_S2S3 <- soybean_cn_sub %>% select("ID", contains("S2"), contains("S3"))

head(soybean_cn_sub_S1S2, 3)

##                     ID      S1.1     S1.2     S1.3     S2.1     S2.2     S2.3
## 19468  Glyma06g12670.1 0.8024444 2.708884 1.763407 7.716099 6.581990 7.003538
## 27284  Glyma08g12390.2 4.7687202 5.235777 5.166631 3.823472 3.566863 3.295619
## 42001 Glyma12g02076.11 3.1899340 2.902131 2.906502 3.100206 3.284326 3.295619

head(soybean_cn_sub_S1S3, 3)

##                     ID      S1.1     S1.2     S1.3     S3.1     S3.2     S3.3
## 19468  Glyma06g12670.1 0.8024444 2.708884 1.763407 8.556732 8.367593 8.389347
## 27284  Glyma08g12390.2 4.7687202 5.235777 5.166631 3.669489 4.031427 4.269312
## 42001 Glyma12g02076.11 3.1899340 2.902131 2.906502 3.364437 2.731105 3.255649

head(soybean_cn_sub_S2S3, 3)

##                     ID     S2.1     S2.2     S2.3     S3.1     S3.2     S3.3
## 19468  Glyma06g12670.1 7.716099 6.581990 7.003538 8.556732 8.367593 8.389347
## 27284  Glyma08g12390.2 3.823472 3.566863 3.295619 3.669489 4.031427 4.269312
## 42001 Glyma12g02076.11 3.100206 3.284326 3.295619 3.364437 2.731105 3.255649

0.5 Preprocessing of data object

Some popular RNA-seq analysis packages (such as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), and limma (Ritchie et al. 2015)) advise researchers to perform certain preprocessing steps to their data, such as filtering the genes, normalizing their read counts, and standardizing their read counts before visualization. Researchers can use datasets whether or not they have been filtered, normalized, and standardized for setting the data object in the bigPint package. If they wish, they can use bigPint plots to investigate how their dataset changes after filters, normalizations, and standardizations.

References

Brown, Anne V., and Karen A. Hudson. 2015. “Developmental Profiling of Gene Expression in Soybean Trifoliate Leaves and Cotyledons.” BMC Plant Biology 15 (1). BioMed Central:169.

Lauter, AN Moran, and MA Graham. 2016. “NCBI Sra Bioproject Accession: PRJNA318409.”

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12). BioMed Central:550.

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for Rna-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7). Oxford University Press:e47–e47.

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1). Oxford University Press:139–40.