ISAnalytics 1.4.3
In this vignette we’re going to explain in detail how to use functions of the aggregate family, namely:
aggregate_metadata()
aggregate_values_by_key()
ISAnalytics
can be installed quickly in different ways:
devtools
There are always 2 versions of the package active:
RELEASE
is the latest stable versionDEVEL
is the development version, it is the most up-to-date version where
all new features are introducedRELEASE version:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ISAnalytics")
DEVEL version:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("ISAnalytics")
RELEASE:
if (!require(devtools)) {
install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
ref = "RELEASE_3_14",
dependencies = TRUE,
build_vignettes = TRUE)
DEVEL:
if (!require(devtools)) {
install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
ref = "master",
dependencies = TRUE,
build_vignettes = TRUE)
ISAnalytics
has a verbose option that allows some functions to print
additional information to the console while they’re executing.
To disable this feature do:
# DISABLE
options("ISAnalytics.verbose" = FALSE)
# ENABLE
options("ISAnalytics.verbose" = TRUE)
Some functions also produce report in a user-friendly HTML format, to set this feature:
# DISABLE HTML REPORTS
options("ISAnalytics.reports" = FALSE)
# ENABLE HTML REPORTS
options("ISAnalytics.reports" = TRUE)
We refer to information contained in the association file as “metadata”:
sometimes it’s useful to obtain collective information based on a certain
group of variables we’re interested in. The function aggregate_metadata()
does just that: according to the grouping variables, meaning the names of
the columns in the association file to perform a group_by
operation with,it
creates a summary. You can fully customize the summary by providing a
“function table” that tells the function which operation should be
applied to which column and what name to give to the output column.
A default is already supplied:
#> Loading required package: magrittr
#> # A tibble: 15 × 4
#> Column Function Args Output_colname
#> <chr> <list> <lgl> <chr>
#> 1 FusionPrimerPCRDate <formula> NA {.col}_min
#> 2 LinearPCRDate <formula> NA {.col}_min
#> 3 VCN <formula> NA {.col}_avg
#> 4 ng DNA corrected <formula> NA {.col}_avg
#> 5 Kapa <formula> NA {.col}_avg
#> 6 ng DNA corrected <formula> NA {.col}_sum
#> 7 ulForPool <formula> NA {.col}_sum
#> 8 BARCODE_MUX <formula> NA {.col}_sum
#> 9 TRIMMING_FINAL_LTRLC <formula> NA {.col}_sum
#> 10 LV_MAPPED <formula> NA {.col}_sum
#> 11 BWA_MAPPED_OVERALL <formula> NA {.col}_sum
#> 12 ISS_MAPPED_PP <formula> NA {.col}_sum
#> 13 PCRMethod <formula> NA {.col}
#> 14 NGSTechnology <formula> NA {.col}
#> 15 DNAnumber <formula> NA {.col}
You can either provide purrr-style lambdas (as given in the example above),
or simply specify the name of the function and additional parameters as a
list in a separated column. If you choose to provide your own table you
should maintain the column names for the function to work properly.
For more details on this take a look at the function documentation
?default_meta_agg
.
import_assocition_file()
. If you need more
information on import function please view the vignette
“How to use import functions”:
vignette("how_to_import_functions", package="ISAnalytics")
.data("association_file", package = "ISAnalytics")
aggregated_meta <- aggregate_metadata(association_file = association_file)
#> # A tibble: 20 × 19
#> SubjectID CellMarker Tissue TimePoint FusionPrimerPCRDate_m… LinearPCRDate_m…
#> <chr> <chr> <chr> <chr> <date> <date>
#> 1 PT001 MNC BM 0030 2016-11-03 NA
#> 2 PT001 MNC BM 0060 2016-11-03 NA
#> 3 PT001 MNC BM 0090 2016-11-03 NA
#> 4 PT001 MNC BM 0180 2016-11-03 NA
#> 5 PT001 MNC BM 0360 2017-04-21 NA
#> 6 PT001 MNC PB 0030 2016-11-03 NA
#> 7 PT001 MNC PB 0060 2016-11-03 NA
#> 8 PT001 MNC PB 0090 2016-11-03 NA
#> 9 PT001 MNC PB 0180 2016-11-03 NA
#> 10 PT001 MNC PB 0360 2017-04-21 NA
#> 11 PT002 MNC BM 0030 2017-04-21 NA
#> 12 PT002 MNC BM 0060 2017-05-05 NA
#> 13 PT002 MNC BM 0090 2017-05-05 NA
#> 14 PT002 MNC BM 0180 2017-05-16 NA
#> 15 PT002 MNC BM 0360 2018-03-12 NA
#> 16 PT002 MNC PB 0030 2017-04-21 NA
#> 17 PT002 MNC PB 0060 2017-05-05 NA
#> 18 PT002 MNC PB 0090 2017-05-05 NA
#> 19 PT002 MNC PB 0180 2017-05-05 NA
#> 20 PT002 MNC PB 0360 2018-03-12 NA
#> # … with 13 more variables: VCN_avg <dbl>, ng DNA corrected_avg <dbl>,
#> # Kapa_avg <dbl>, ng DNA corrected_sum <dbl>, ulForPool_sum <dbl>,
#> # BARCODE_MUX_sum <int>, TRIMMING_FINAL_LTRLC_sum <int>, LV_MAPPED_sum <int>,
#> # BWA_MAPPED_OVERALL_sum <int>, ISS_MAPPED_PP_sum <int>, PCRMethod <chr>,
#> # NGSTechnology <chr>, DNAnumber <chr>
ISAnalytics
contains useful functions to aggregate the values contained in
your imported matrices based on a key, aka a single column or a combination of
columns contained in the association file that are related to the samples.
import_parallel_Vispa2Matrices()
data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
aggreg <- aggregate_values_by_key(
x = integration_matrices,
association_file = association_file,
value_cols = c("seqCount", "fragmentEstimate")
)
#> # A tibble: 1,074 × 11
#> chr integration_locus strand GeneName GeneStrand SubjectID CellMarker
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 1 8464757 - RERE - PT001 MNC
#> 2 1 8464757 - RERE - PT001 MNC
#> 3 1 8607357 + RERE - PT001 MNC
#> 4 1 8607357 + RERE - PT001 MNC
#> 5 1 8607357 + RERE - PT001 MNC
#> 6 1 8607362 - RERE - PT001 MNC
#> 7 1 8850362 + RERE - PT002 MNC
#> 8 1 11339120 + UBIAD1 + PT001 MNC
#> 9 1 11339120 + UBIAD1 + PT001 MNC
#> 10 1 11339120 + UBIAD1 + PT001 MNC
#> Tissue TimePoint seqCount_sum fragmentEstimate_sum
#> <chr> <chr> <dbl> <dbl>
#> 1 BM 0030 542 3.01
#> 2 BM 0060 1 1.00
#> 3 BM 0060 1 1.00
#> 4 BM 0180 1096 5.01
#> 5 BM 0360 330 34.1
#> 6 BM 0180 1702 4.01
#> 7 BM 0360 562 3.01
#> 8 BM 0060 1605 8.03
#> 9 PB 0060 1 1.00
#> 10 PB 0180 1 1.00
#> # … with 1,064 more rows
The function aggregate_values_by_key
can perform the aggregation both on the
list of matrices and a single matrix.
The function has several different parameters that have default values that can be changed according to user preference.
key
valuec("SubjectID", "CellMarker", "Tissue", "TimePoint")
(same default key as the aggregate_metadata
function).agg1 <- aggregate_values_by_key(
x = integration_matrices,
association_file = association_file,
key = c("SubjectID", "ProjectID"),
value_cols = c("seqCount", "fragmentEstimate")
)
#> # A tibble: 577 × 9
#> chr integration_locus strand GeneName GeneStrand SubjectID ProjectID
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 1 8464757 - RERE - PT001 PJ01
#> 2 1 8607357 + RERE - PT001 PJ01
#> 3 1 8607362 - RERE - PT001 PJ01
#> 4 1 8850362 + RERE - PT002 PJ01
#> 5 1 11339120 + UBIAD1 + PT001 PJ01
#> 6 1 12341466 - VPS13D + PT002 PJ01
#> 7 1 14034054 - PRDM2 + PT002 PJ01
#> 8 1 16186297 - SPEN + PT001 PJ01
#> 9 1 16602483 + FBXO42 - PT001 PJ01
#> 10 1 16602483 + FBXO42 - PT002 PJ01
#> seqCount_sum fragmentEstimate_sum
#> <dbl> <dbl>
#> 1 543 4.01
#> 2 1427 40.1
#> 3 1702 4.01
#> 4 562 3.01
#> 5 1607 10.0
#> 6 1843 8.05
#> 7 1938 3.01
#> 8 3494 16.1
#> 9 2947 9.04
#> 10 30 2.00
#> # … with 567 more rows
lambda
valuelambda
parameter indicates the function(s) to be applied to the
values for aggregation.
lambda
must be a named list of either functions or purrr-style lambdas:
if you would like to specify additional parameters to the function
the second option is recommended.
The only important note on functions is that they should perform some kind of
aggregation on numeric values: this means in practical terms they need
to accept a vector of numeric/integer values as input and produce a
SINGLE value as output. Valid options for this purpose might be: sum
, mean
,
median
, min
, max
and so on.agg2 <- aggregate_values_by_key(
x = integration_matrices,
association_file = association_file,
key = "SubjectID",
lambda = list(mean = ~ mean(.x, na.rm = TRUE)),
value_cols = c("seqCount", "fragmentEstimate")
)
#> # A tibble: 577 × 8
#> chr integration_locus strand GeneName GeneStrand SubjectID seqCount_mean
#> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 8464757 - RERE - PT001 272.
#> 2 1 8607357 + RERE - PT001 285.
#> 3 1 8607362 - RERE - PT001 851
#> 4 1 8850362 + RERE - PT002 562
#> 5 1 11339120 + UBIAD1 + PT001 321.
#> 6 1 12341466 - VPS13D + PT002 1843
#> 7 1 14034054 - PRDM2 + PT002 1938
#> 8 1 16186297 - SPEN + PT001 699.
#> 9 1 16602483 + FBXO42 - PT001 982.
#> 10 1 16602483 + FBXO42 - PT002 30
#> fragmentEstimate_mean
#> <dbl>
#> 1 2.01
#> 2 8.02
#> 3 2.01
#> 4 3.01
#> 5 2.01
#> 6 8.05
#> 7 3.01
#> 8 3.22
#> 9 3.01
#> 10 2.00
#> # … with 567 more rows
Note that, when specifying purrr-style lambdas (formulas), the first
parameter needs to be set to .x
, other parameters can be set as usual.
You can also use in lambda
functions that produce data frames or lists.
In this case all variables from the produced data frame will be included
in the final data frame. For example:
agg3 <- aggregate_values_by_key(
x = integration_matrices,
association_file = association_file,
key = "SubjectID",
lambda = list(describe = ~ list(psych::describe(.x))),
value_cols = c("seqCount", "fragmentEstimate")
)
#> # A tibble: 577 × 8
#> chr integration_locus strand GeneName GeneStrand SubjectID
#> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 8464757 - RERE - PT001
#> 2 1 8607357 + RERE - PT001
#> 3 1 8607362 - RERE - PT001
#> 4 1 8850362 + RERE - PT002
#> 5 1 11339120 + UBIAD1 + PT001
#> 6 1 12341466 - VPS13D + PT002
#> 7 1 14034054 - PRDM2 + PT002
#> 8 1 16186297 - SPEN + PT001
#> 9 1 16602483 + FBXO42 - PT001
#> 10 1 16602483 + FBXO42 - PT002
#> seqCount_describe fragmentEstimate_describe
#> <list> <list>
#> 1 <psych [1 × 13]> <psych [1 × 13]>
#> 2 <psych [1 × 13]> <psych [1 × 13]>
#> 3 <psych [1 × 13]> <psych [1 × 13]>
#> 4 <psych [1 × 13]> <psych [1 × 13]>
#> 5 <psych [1 × 13]> <psych [1 × 13]>
#> 6 <psych [1 × 13]> <psych [1 × 13]>
#> 7 <psych [1 × 13]> <psych [1 × 13]>
#> 8 <psych [1 × 13]> <psych [1 × 13]>
#> 9 <psych [1 × 13]> <psych [1 × 13]>
#> 10 <psych [1 × 13]> <psych [1 × 13]>
#> # … with 567 more rows
value_cols
valuevalue_cols
parameter tells the function on which numeric columns
of x the functions should be applied.
Note that every function contained in lambda
will be applied to every
column in value_cols
: resulting columns will be named as
“original name_function applied”.agg4 <- aggregate_values_by_key(
x = integration_matrices,
association_file = association_file,
key = "SubjectID",
lambda = list(sum = sum, mean = mean),
value_cols = c("seqCount", "fragmentEstimate")
)
#> # A tibble: 577 × 10
#> chr integration_locus strand GeneName GeneStrand SubjectID seqCount_sum
#> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 8464757 - RERE - PT001 543
#> 2 1 8607357 + RERE - PT001 1427
#> 3 1 8607362 - RERE - PT001 1702
#> 4 1 8850362 + RERE - PT002 562
#> 5 1 11339120 + UBIAD1 + PT001 1607
#> 6 1 12341466 - VPS13D + PT002 1843
#> 7 1 14034054 - PRDM2 + PT002 1938
#> 8 1 16186297 - SPEN + PT001 3494
#> 9 1 16602483 + FBXO42 - PT001 2947
#> 10 1 16602483 + FBXO42 - PT002 30
#> seqCount_mean fragmentEstimate_sum fragmentEstimate_mean
#> <dbl> <dbl> <dbl>
#> 1 272. 4.01 2.01
#> 2 285. 40.1 8.02
#> 3 851 4.01 2.01
#> 4 562 3.01 3.01
#> 5 321. 10.0 2.01
#> 6 1843 8.05 8.05
#> 7 1938 3.01 3.01
#> 8 699. 16.1 3.22
#> 9 982. 9.04 3.01
#> 10 30 2.00 2.00
#> # … with 567 more rows
group
valuegroup
parameter should contain all other variables to include in the
grouping besides key
. By default this contains
c("chr", "integration_locus","strand", "GeneName", "GeneStrand")
.
You can change this grouping as you see
fit, if you don’t want to add any other variable to the key, just set it to
NULL
.agg5 <- aggregate_values_by_key(
x = integration_matrices,
association_file = association_file,
key = "SubjectID",
lambda = list(sum = sum, mean = mean),
group = c(mandatory_IS_vars()),
value_cols = c("seqCount", "fragmentEstimate")
)
#> # A tibble: 577 × 8
#> chr integration_locus strand SubjectID seqCount_sum seqCount_mean
#> <chr> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 8464757 - PT001 543 272.
#> 2 1 8607357 + PT001 1427 285.
#> 3 1 8607362 - PT001 1702 851
#> 4 1 8850362 + PT002 562 562
#> 5 1 11339120 + PT001 1607 321.
#> 6 1 12341466 - PT002 1843 1843
#> 7 1 14034054 - PT002 1938 1938
#> 8 1 16186297 - PT001 3494 699.
#> 9 1 16602483 + PT001 2947 982.
#> 10 1 16602483 + PT002 30 30
#> fragmentEstimate_sum fragmentEstimate_mean
#> <dbl> <dbl>
#> 1 4.01 2.01
#> 2 40.1 8.02
#> 3 4.01 2.01
#> 4 3.01 3.01
#> 5 10.0 2.01
#> 6 8.05 8.05
#> 7 3.01 3.01
#> 8 16.1 3.22
#> 9 9.04 3.01
#> 10 2.00 2.00
#> # … with 567 more rows
R
session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.2 (2021-11-01)
#> os Ubuntu 20.04.3 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2022-01-16
#> pandoc 2.5 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.1.2)
#> BiocManager 1.30.16 2021-06-15 [2] CRAN (R 4.1.2)
#> BiocParallel 1.28.3 2022-01-16 [2] Bioconductor
#> BiocStyle * 2.22.0 2022-01-16 [2] Bioconductor
#> bookdown 0.24 2021-09-02 [2] CRAN (R 4.1.2)
#> bslib 0.3.1 2021-10-06 [2] CRAN (R 4.1.2)
#> cli 3.1.0 2021-10-27 [2] CRAN (R 4.1.2)
#> colorspace 2.0-2 2021-06-24 [2] CRAN (R 4.1.2)
#> crayon 1.4.2 2021-10-29 [2] CRAN (R 4.1.2)
#> data.table 1.14.2 2021-09-27 [2] CRAN (R 4.1.2)
#> DBI 1.1.2 2021-12-20 [2] CRAN (R 4.1.2)
#> digest 0.6.29 2021-12-01 [2] CRAN (R 4.1.2)
#> dplyr 1.0.7 2021-06-18 [2] CRAN (R 4.1.2)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.1.2)
#> evaluate 0.14 2019-05-28 [2] CRAN (R 4.1.2)
#> fansi 1.0.2 2022-01-14 [2] CRAN (R 4.1.2)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.1.2)
#> fs 1.5.2 2021-12-08 [2] CRAN (R 4.1.2)
#> generics 0.1.1 2021-10-25 [2] CRAN (R 4.1.2)
#> ggplot2 3.3.5 2021-06-25 [2] CRAN (R 4.1.2)
#> ggrepel 0.9.1 2021-01-15 [2] CRAN (R 4.1.2)
#> glue 1.6.0 2021-12-17 [2] CRAN (R 4.1.2)
#> gtable 0.3.0 2019-03-25 [2] CRAN (R 4.1.2)
#> hms 1.1.1 2021-09-26 [2] CRAN (R 4.1.2)
#> htmltools 0.5.2 2021-08-25 [2] CRAN (R 4.1.2)
#> httr 1.4.2 2020-07-20 [2] CRAN (R 4.1.2)
#> ISAnalytics * 1.4.3 2022-01-16 [1] Bioconductor
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.1.2)
#> jsonlite 1.7.2 2020-12-09 [2] CRAN (R 4.1.2)
#> knitr 1.37 2021-12-16 [2] CRAN (R 4.1.2)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.1.2)
#> lubridate 1.8.0 2021-10-07 [2] CRAN (R 4.1.2)
#> magrittr * 2.0.1 2020-11-17 [2] CRAN (R 4.1.2)
#> mnormt 2.0.2 2020-09-01 [2] CRAN (R 4.1.2)
#> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.1.2)
#> nlme 3.1-155 2022-01-13 [2] CRAN (R 4.1.2)
#> pillar 1.6.4 2021-10-18 [2] CRAN (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.1.2)
#> plyr 1.8.6 2020-03-03 [2] CRAN (R 4.1.2)
#> psych 2.1.9 2021-09-22 [2] CRAN (R 4.1.2)
#> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.1.2)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.1.2)
#> Rcapture 1.4-3 2019-12-16 [2] CRAN (R 4.1.2)
#> Rcpp 1.0.8 2022-01-13 [2] CRAN (R 4.1.2)
#> readr 2.1.1 2021-11-30 [2] CRAN (R 4.1.2)
#> RefManageR * 1.3.0 2020-11-13 [2] CRAN (R 4.1.2)
#> rlang 0.4.12 2021-10-18 [2] CRAN (R 4.1.2)
#> rmarkdown 2.11 2021-09-14 [2] CRAN (R 4.1.2)
#> sass 0.4.0 2021-05-12 [2] CRAN (R 4.1.2)
#> scales 1.1.1 2020-05-11 [2] CRAN (R 4.1.2)
#> sessioninfo * 1.2.2 2021-12-06 [2] CRAN (R 4.1.2)
#> stringi 1.7.6 2021-11-29 [2] CRAN (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.1.2)
#> tibble 3.1.6 2021-11-07 [2] CRAN (R 4.1.2)
#> tidyr 1.1.4 2021-09-27 [2] CRAN (R 4.1.2)
#> tidyselect 1.1.1 2021-04-30 [2] CRAN (R 4.1.2)
#> tmvnsim 1.0-2 2016-12-15 [2] CRAN (R 4.1.2)
#> tzdb 0.2.0 2021-10-27 [2] CRAN (R 4.1.2)
#> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.1.2)
#> vctrs 0.3.8 2021-04-29 [2] CRAN (R 4.1.2)
#> xfun 0.29 2021-12-14 [2] CRAN (R 4.1.2)
#> xml2 1.3.3 2021-11-30 [2] CRAN (R 4.1.2)
#> yaml 2.2.1 2020-02-01 [2] CRAN (R 4.1.2)
#> zip 2.2.0 2021-05-31 [2] CRAN (R 4.1.2)
#>
#> [1] /tmp/RtmplowxUB/Rinst3081361f546c34
#> [2] /home/biocbuild/bbs-3.14-bioc/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
This vignette was generated using BiocStyle (Oleś, 2022) with knitr (Xie, 2021) and rmarkdown (Allaire, Xie, McPherson, Luraschi, Ushey, Atkins, Wickham, Cheng, Chang, and Iannone, 2021) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.11. 2021. URL: https://github.com/rstudio/rmarkdown.
[2] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[3] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.22.0. 2022. URL: https://github.com/Bioconductor/BiocStyle.
[4] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.37. 2021. URL: https://yihui.org/knitr/.