We present an elaborate guided tutorial of how to use the Logolas R package. A pdf version of this vignette can be found here.
Compared to the existing packages for plotting sequence logos (seqLogo, seq2Logo, motifStack etc), Logolas offers several new features that makes logo visualization a more generic tool with potential applications in a much wider scope of problems.
Enrichment Depletion Logo (EDLogo) : General logo plotting softwares highlight only enrichment of certain symbols, but Logolas allows the user to highlight both enrichment and depletion of symbols at any position, leading to more parsimonious and visually appealing representation.
String symbols : General logo building softwares have limited library of symbols usually restricted to English alphabets. Logolas allows the user to plot symbols for any alphanumeric string, comprising of English alphabets, numbers, punctuation marks, arrows etc. It also provides an easy interface for the user to create her own logo and add to the library of symbols that can be plotted.
Dirichlet Adaptive Shrinkage : Logolas provides a statistical approach to adaptively scale the heights of the logos based on the number of aligned sequences.
Better customizations : Logolas offers several new color palettes, fill and border styles, several options for determining heights of the logos etc. Also, they can be plotted in multiple panels and combined with ggplot2 graphics.
Logolas loads as dependencies the following CRAN-R package : grid
, gridExtra
, SQUAREM
, LaplacesDemon
, Matrix
, RColorBrewer
. To run this vignette, the user also would be required to install the ggseqlogo
package.
The Bioc version of Logolas can be installed as follows
For installing the developmental version of Logolas from Github, the user is required to have the devtools
package and then run the following command.
Load the Logolas package.
Logolas accepts two data formats as input
a vector of aligned character sequences (may be DNA, RNA or amino acid sequences), each of same length (see Example 1 below)
a positional frequency (weight) matrix, termed PFM (PWM), with the symbols to be plotted along the rows and the positions of aligned sequences, from which the matrix is generated, along the columns. (see Example 2)
Consider aligned strings of characters
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTTTAT", "CTATAGT",
"CTATTTT", "CTTATAT", "CTATATT", "CTCATTT", "CTTATTT", "CAATAGT",
"CATTTGA", "CTCTTAT", "CTATTAT", "CTTTTAT", "CTATAAT", "CTTAGGT",
"CTATTGT", "CTCATGT", "CTATAGT", "CTCGTTA", "CTAGAAT", "CAATGGT")
The logo plots (both standard and Enrichment Depletion Logo) can be plotted using the function.
Instead of DNA.RNA sequence as above, one can also use amino acid character sequences.
We now see an example of positional weight matrix (PWM) as input to logomaker().
## 1 2 3 4 5 6 7 8
## A 0.0 0.0 0.0 0.3 0.2 0.0 0.0 0.0
## C 0.8 0.2 0.8 0.3 0.4 0.2 0.8 0.2
## G 0.2 0.8 0.2 0.4 0.3 0.8 0.2 0.8
## T 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0
We plot the logo plots for this PWM matrix.
## [1] 1.2751513 1.2751513 1.2751513 0.4277256 0.1534290 1.2751513 1.2751513
## [8] 1.2751513
## $pos_ic
## 1 2 3 4 5 6 7
## 6.3105144 6.3105144 6.3105144 0.4031197 0.9645379 6.3105144 6.3105144
## 8
## 6.3105144
##
## $neg_ic
## 1 2 3 4 5 6 7 8
## 4.364040 4.364040 4.364040 4.925458 1.493007 4.364040 4.364040 4.364040
##
## $table_mat_pos_norm
## 1 2 3 4 5 6 7 8
## A 0.0000000 0.0000000 0.0000000 0 0.0000000 0.0000000 0.0000000 0.0000000
## C 0.6542247 0.3457753 0.6542247 0 0.7089704 0.3457753 0.6542247 0.3457753
## G 0.3457753 0.6542247 0.3457753 1 0.2910296 0.6542247 0.3457753 0.6542247
## T 0.0000000 0.0000000 0.0000000 0 0.0000000 0.0000000 0.0000000 0.0000000
##
## $table_mat_neg_norm
## 1 2 3 4 5 6 7 8
## A 0.5 0.5 0.5 0 0.188016 0.5 0.5 0.5
## C 0.0 0.0 0.0 0 0.000000 0.0 0.0 0.0
## G 0.0 0.0 0.0 0 0.000000 0.0 0.0 0.0
## T 0.5 0.5 0.5 1 0.811984 0.5 0.5 0.5
The outputs the information content at each position for the standard logo plot (type = “Logo”) and the heights of the stacks along the positive and negative Y axis, along with the breakdown of the height due to different characters for the EDLogo plot (type = “EDLogo”).
The logomaker() function provides three arguments to set the colors for the logos, a color_type specifying the scheme of coloring used, colors denoting the cohort of colors used and a color_seed argument determining how sampling is done from this cohort.
The color_type argument can be of three types, per_row
, per_column
and per_symbol
. colors
element is a cohort of colors (chosen suitably large) from which distinct colors are chosen based on distinct color_type
. The number of colors chosen is of same length as number of rows in table for per_row
(assigning a color to each string), of same length as number of columns in table for per_column
(assuming a color for each column), or a distinct color for a distinct symbol in per_symbol
. The length of colors should be as large as the number of colors to be chosen in each scenario. % The default color_type is per-row
and default colors comprises of a large cohort of nearly 70 distinct colors from which colors are sampled using the color_seed argument.
Besides the default style with filled symbols for each character, one can also use characters with border styles. For the standard logo plot, this is accomplished by the tofill
control argument.
logomaker(seqlogo_example, type = "Logo",
logo_control = list(control = list(tofill= FALSE)), color_seed = 4000)
For an EDLogo plot, the arguments tofill_pos
and tofill_neg
represent the coloring scheme for the positive and the negative axes in an EDLogo plot.
Logolas allows the user to scale the data based on a specified background information. The background information can be incorporated in the argument bg
. The default value is NULL, in which case equal probability is assigned to each symbol. The user can however specify a vector (equal to in length to the number of symbols) which specifies the background probability for each symbol and assumes this background probability to be the same across the columns (sites), or a matrix, whose each cell specifies the background probability of the symbols for each position.
First example with bg
as a vector.
bg <- c(0.05, 0.90, 0.03, 0.05)
names(bg) <- c("A", "C", "G", "T")
logomaker(seqlogo_example, bg=bg, type = "EDLogo")
Second example with bg
as a matrix.
Logolas allows the user to perform adaptive scaling of the stack heights in a logo plot based on the number of aligned sequences, using the use_dash
argument. This scaling is performed only when the data input into the logomaker() function is a vector of sequences or a position frequency (PFM) matrix. We show an example with and without the use_dash
argument.
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT")
logomaker(sequence, use_dash = FALSE, type = "Logo",
logo_control = list(pop_name = "use_dash = FALSE"))
The adaptive scaling is performed by the Dirichlet Adaptive Shrinkage method, the details of which can be viewed at our dashr package.
Logolas allows the user to plot symbols not just for characters as we saw in previous examples, but for any alphanumeric string. We present two examples - one for representing mutation signature and another for representing histone marks composition.
Histone marks string symbols example
Mutation signature string and character mix example.
The user may want to have distinct colors for distinct symbols. This is where we use the per_symbol
option for color_type.
The corresponding EDLogo
Logolas provides a new nomenclature to geneerate consensus sequence from a positional frequency (weight) matrix or from a vector of aligned sequences. This is performed by the GetConsensusSeq() function.
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT")
GetConsensusSeq(sequence)
## [1] "C T (Ag) T T (Ac) (TA)"
In the sequence, a position represented by (Ag) would mean enrichment in A and depletion in G at that position. One can input a PWM or PFM matrix with A, C, G and T as row names in the GetConsensusSeq() function as well.
Logolas plots can be plotted in multiple panels, as depicted below.
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT")
Logolas::get_viewport_logo(1, 2, heights_1 = 20)
library(grid)
seekViewport(paste0("plotlogo", 1))
logomaker(sequence, type = "Logo", logo_control = list(newpage = FALSE))
seekViewport(paste0("plotlogo", 2))
logomaker(sequence, type = "EDLogo", logo_control = list(newpage = FALSE))
In the same way, ggplot2 graphics can also be combined with Logolas plots.
While logomaker() takes a PFM, PWM or a set of aligned sequences as input, sometimes, some position specific scores are only available to the user. In this case, one can use the logo_pssm() in Logolas to plot the scoring matrix.
The round_off
comtrol argument specifies the number of points after decimal allowed in the axes of the plot.
The authors would like to acknowledge Oliver Bembom, the author of seqLogo
for acting as an inspiration and providing the foundation on which this package is created. We also thank Peter Carbonetto, Edward Wallace and John Blischak for helpful feedback and discussions.
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggseqlogo_0.1 Logolas_1.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 XVector_0.20.0 knitr_1.20
## [4] magrittr_1.5 zlibbioc_1.26.0 IRanges_2.14.0
## [7] BiocGenerics_0.26.0 munsell_0.4.3 gridBase_0.4-7
## [10] SQUAREM_2017.10-1 colorspace_1.3-2 rlang_0.2.0
## [13] stringr_1.3.0 plyr_1.8.4 tools_3.5.0
## [16] parallel_3.5.0 gtable_0.2.0 htmltools_0.3.6
## [19] yaml_2.1.18 lazyeval_0.2.1 rprojroot_1.3-2
## [22] digest_0.6.15 tibble_1.4.2 RColorBrewer_1.1-2
## [25] ggplot2_2.2.1 S4Vectors_0.18.0 evaluate_0.10.1
## [28] LaplacesDemon_16.1.0 rmarkdown_1.9 stringi_1.1.7
## [31] pillar_1.2.2 compiler_3.5.0 Biostrings_2.48.0
## [34] scales_0.5.0 backports_1.1.2 stats4_3.5.0
Thank you for using Logolas !
If you have any questions, you can either open an issue in our Github page or write to Kushal K Dey (kkdey@uchicago.edu). Also please feel free to contribute to the package. You can contribute by submitting a pull request or by communicating with the said person.