This vignette describes the procedure to contribute new datasets to the
imcdatasets
package and contains guidelines for dataset formatting.
Contributions or suggestions for new imaging mass cytometry (IMC) datasets to
add to the imcdatasets
package are always welcome. New datasets can be
suggested by opening an issue at the imcdatasets
GitHub page. The only
requirements are that the new dataset (i) is publicly available and (ii)
has been described in a published scientific article.
Details about creating Bioconductor’s ExperimentHub
packages
are available here.
The first step is to create a new branch at the imcdatasets
GitHub page.
Then, create an R markdown (.Rmd
) script in .inst/scripts/
to generate the
data objects:
ExperimentHub
.The .Rmd
script must be formatted in the same way as pre-existing scripts.
Examples can be found here and here. Each step should be clearly and comprehensively
documented.
For usability of the package and consistency across datasets, the data
objects must be formatted as described in the Dataset format
section
below.
Other files in the imcdatasets
package should be updated to include the new
dataset:
./R/Lastname_Year_Type.R
file with a function to load the new
dataset and extensive documentation. Examples can be found here
and here.man
documentation files (go to the imcdatasets
directory and run roxygen2::roxygenize(".")
)../inst/scripts/make-metadata.R
R script and run it. This will
update the ./inst/extdata/metadata.csv
file that is used by ExperimentHub
to provide metadata information about datasets../inst/scripts/ref.bib
file../inst/extdata/alldatasets.csv
file../tests/testthat/test_loading.R
.After these steps have been completed, open a pull request at the imcdataset GitHub page.
The package maintainers will do the following:
AWS S3
and announce the upload to Bioconductor
Hubs.ExperimentHub
and check the format again.NEWS
, DESCRIPTION
(add new contributor, version bump) and
README.md
(if needed) files.imcdatasets
package can be installed
from there.Contributors will be recognized by having their names added to the DESCRIPTION
file of the imcdatasets
package.
The imcdatasets
package is meant to provide quick and easy access to
published and curated IMC datasets. Each dataset consists of three data objects
that can be retrieved individually:
The three data objects can be mapped using unique image_name
values contained
in the metadata of each object.
For consistency across datasets, the guidelines below must be followed when creating a new dataset.
Single cell data should be formatted into a SingleCellExperiment
object named sce
that contains the following slots:
colData
: observations metadata.rowData
: marker metadata.assays
: marker expression levels per cell.colPairs
(optional): neighborhood information.The colData
entry of the SingleCellExperiment
object is a DataFrame
that
contains observation metadata; i.e., cells, slides, tissue, patients, …. It
is recommended that all column names have a prefix that indicates the level of
observation (e.g. cell_
, slide_
, tissue_
, patient_
, tumor_
).
The following columns are required:
image_name
and/or image_number
: unique image (ablated ROI) name,
respectively number. Should map to the image_name
/image_number
column(s)
in the metadata of the images
and masks
objects.cell_number
: integer representing cell numbers. Should map to the values of
cell segmentation masks.cell_id
: a unique cell identifier defined as
{image_number
_
cell_number
} (e.g., 7_138
).cell_x
and cell_y
: position of the cell centroid on the image. These
columns are used as SpatialCoords
when converting to a SpatialExperiment
object.In addition, colnames(sce)
should be set as colData(sce)$cell_id
.
The rowData
entry of the SingleCellExperiment
is a DataFrame
that
contains marker (protein, RNA, probe) information.
The following columns are required in the rowData
entry:
channel
: a unique integer that maps to the channels of the associated
multichannel images.metal
: the metal isotope used for detection, formatted as
{ ChemicalSymbol
IsotopeMass
} (e.g., Y89
, In115
, Yb176
, Bi209
).name
: marker name used in the publication that describes the dataset.full_name
: full marker name.short_name
: abbreviated marker name, preferably following the official UniProt nomenclature.For the full_name
and short_name
columns, the following guidelines apply:
short_name
, all dashes, dots and spaces should removed or replaced with
underscores.full_name
with the modification type (e.g., phospho-
,
methyl-
) and suffix it with the modified aminoacids (e.g., [S28]
).short_name
with an abbreviation of the modification type
(e.g., p_
, me_
) and do not indicate the modified aminoacids, unless there
is a possible confusion with another target in the dataset.full_name | short_name |
---|---|
Carbonic anhydrase IX | CA9 |
CD3 epsilon | CD3e |
CD8 alpha | CD8a |
E-Cadherin | CDH1 |
cleaved-Caspase3 + cleaved-PARP | cCASP3_cPARP |
Cytokeratin 5 | KRT5 |
Forkhead box P3 | FOXP3 |
Glucose transporter 1 | SLC2A1 |
Histone H3 | H3 |
phospho-Histone H3 [S28] | p_H3 |
Ki-67 | Ki67 |
Myeloperoxidase | MPO |
Programmed cell death protein 1 | PD_1 |
Programmed death-ligand 1 | PD_L1 |
phospho-Rb [S807/S811] | p_Rb |
Smooth muscle actin | SMA |
Vimentin | VIM |
Iridium 191 | DNA1 |
Iridium 193 | DNA2 |
In addition, rownames(sce)
should be set as rowData(sce)$short_name
.
The assays
slot of the SingleCellExperiment
contains counts matrices
representing marker expression levels per cell and channel.
It should at least contain a counts
matrix with raw ion counts. The assays
slot can also contain additional matrices with commonly used counts
transformations, or counts transformations that were used in the publication
that describes the dataset. All counts transformations must be documented in
the .R
function used to load the dataset. Common examples include:
exprs
: asinh-transformed counts. For IMC, a cofactor of 1 is typically
used.quant_norm
: counts censored (e.g., at the 99th percentile) and scaled from
0 to 1.Neighborhood information, such as a list of cells that are localized next to
each other, can be stored as a SelfHits object in the colPair
slot of the SingleCellExperiment
object.
Multichannel images are stored in a CytoImageList object named
images
.
Channel names of the images
object (channelNames(images)
) must map to
rownames(sce)
(marker short names).
The metadata slot (mcols(images)
) must contain an image_name
column that
maps to the image_name
column of colData(sce)
, and to the image_name
column of mcols(masks)
. This information is used by cytomapper
to associate multichannel images, cell segmentation masks, and single cell data.
Cell segmentation masks are stored in a CytoImageList object named
masks
.
The values of the masks should be integers mapping to the cell_number
column
of colData(sce)
. This information is used by cytomapper to associate single
cell data and cell segmentation masks.
The metadata slot (mcols(masks)
) must contain an image_name
column that
maps to the image_name
column of colData(sce)
, and to the image_name
column of mcols(images)
. This information is used by cytomapper to associate
multichannel images, cell segmentation masks, and single cell data.
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.35 R6_2.5.1 bookdown_0.39
## [4] fastmap_1.1.1 xfun_0.43 cachem_1.0.8
## [7] knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
## [10] lifecycle_1.0.4 cli_3.6.2 sass_0.4.9
## [13] jquerylib_0.1.4 compiler_4.4.0 tools_4.4.0
## [16] evaluate_0.23 bslib_0.7.0 yaml_2.3.8
## [19] BiocManager_1.30.22 jsonlite_1.8.8 rlang_1.1.3