In the era of personalized medicine and personalized health there is a tremendous effort to bring NGS based technology in the clinical routine. Nevertheless, the information coming from research studies is in most cases useless for the clinical practice and in addition too expensive. To bridge the gap between cancer research and clinical genomics, we propose PrecisionTrialDrawer, a suite of simple tools to explore the ability of less expensive cancer gene panels to be used in clinical trials. With this package, you can use information coming from different sources like mutations, expression, copynumber and fusion data and apply them to your cancer panel to see for example how many patients are covered by a specific mutations or better, how many patients would be covered by a drug that acts on multiple targets. Given a panel and tumor types to analyze, PrecisionTrialDrawer can be a terrific tool to create a pilot project for a basket or umbrella design and also to design the library to be ready for sequencing.
The only input data the user has to provide is the panel itself. We create an easy to read standard format, suitable for every kind of alteration supported by PrecisionTrialDrawer. For example, while in the research practice genomic regions are a standard way to represent mutations (e.g. VCF style, 1:10000:A,C) this is not the most used form of representing druggable or pathogenic variants in clinical routine. The use of amino acid notation (BRAF V600E is sensitive to Vemurafenib) or dbsnp rs numbers (for example in OMIM database) is way more common but these are formats incompatible with NGS genomic format. With our tool, every kind of notation is accepted so that you can easily integrate any database at your disposal without any particular effort.
First of all, let’s see what is the format of a cancer panel:
library(PrecisionTrialDrawer)
library(knitr)
data(panelexample)
knitr::kable(panelexample)
drug | gene_symbol | alteration | exact_alteration | mutation_specification | group |
---|---|---|---|---|---|
Idelalisib | PIK3CA | SNV | Actionable | ||
Idelalisib | PIK3CA | CNA | amplification | Actionable | |
Trastuzumab | ERBB2 | CNA | amplification | Actionable | |
Trastuzumab | ERBB2 | expression | up | Actionable | |
Vemurafenib | BRAF | SNV | amino_acid_variant | V600E | Actionable |
Vemurafenib | BRAF | SNV | amino_acid_variant | V600K | Actionable |
Olaparib | BRCA1 | CNA | deletion | Actionable | |
Olaparib | BRCA1 | SNV | Actionable | ||
Vandetanib | ERBB2 | SNV | Actionable | ||
Vandetanib | ABL1 | SNV | Actionable | ||
Vandetanib | KIT | SNV | Actionable | ||
Vandetanib | RET | fusion | Actionable | ||
Erlotinib | EGFR | SNV | mutation_type | missense | Actionable |
Crizotinib | EML4__ALK | fusion | Actionable | ||
no_drug | KRAS | SNV | amino_acid_position | 10-14 | Driver |
no_drug | KRAS | SNV | genomic_position | 12:25380273-25380280 | Driver |
no_drug | KRAS | SNV | dbSNP_rs | rs121913528 | Driver |
no_drug | KRAS | SNV | genomic_variant | 12:25378562:C,T | Driver |
no_drug | TP53 | SNV | mutation_type | truncating | Driver |
no_drug | TP53 | CNA | deletion | Driver |
A cancer panel is a data.frame of character columns and every row represents an alteration and its properties. We put together a panel using known specific drug targets (like Vemurafenib for BRAF V600E/K) or pathway targeting drugs (like kinases inhibitor Vandetanib). We also add known driver/prognostic genes with no associated drug (like KRAS or TP53). You have to bear in mind that the panel is both what you target and what you analyze. If you ask for a BRAF V600E alteration, you will sequence 3 bases but only V600E will have the characteristic of being druggable and will be considered in simulation analysis. In other words, by putting V600E you occupies 3 bases in your panel but in simulations only a change from valine to glutamic acid will be considered druggable.
We now present a list of possible alteration formats that must be respected. Possible values for a cancer panel in columns alteration, exact_alteration and mutation_specification are:
|alteration |exact.alteration |mutation_specification | |:———-|:——————-|:———————-| |CNA |amplification | | |CNA |deletion | | |expression |up | | |expression |down | | |fusion | | | |SNV | | | |SNV |mutation_type |missense | |SNV |mutation_type |truncating | |SNV |amino_acid_position |300-300 | |SNV |amino_acid_position |300-350 | |SNV |amino_acid_variant |V600E | |SNV |genomic_position |13:20000-40000 | |SNV |genomic_position |13:20000-20000 | |SNV |genomic_variant |13:20000:A,C | |SNV |dbSNP_rs |rs1234567 |
Copynumber and expression are basically self explained. Mutations can be
expressed in a plethora of ways but remember to keep the 1-base convention and
in case of single position (amino acid or genomic) to replicate start and end
(e.g. 300-300). If you don’t want to specify a particular mutation, you can
either leave exact_alteration and mutation_specification empty or specify a
subtype of alteration, like only missense or only truncating variants. In the
example panel, BRCA1 is sensitive to a PARP inhibitor only if it is either
deleted or truncated, because generally missense mutations are not considered
pathogenic. Fusions have no particular format, but are expressed in the
genesymbol column. If you put a single gene, r Rpackage("PrecisionTrialDrawer")
will interpret it as ‘every fusion involving
that gene’ (see RET), while if you put a specific fusion (see EML4__ALK),
it will look only for that specific fusion. The format for specific fusions
must be always gene1__gene2.
Now that we have our panel, we need to build an object of class CancerPanel:
mypanel <- newCancerPanel(panelexample)
## Checking panel construction...
## Calculating panel size...
## Connecting to ensembl biomart...
The class constructor performs two main tasks. The first one is to check for inconsistency in panel data.frame. The second is to calculate a genomic length for every row of the panel. CNA and expression alteration will have a length equal to the CDS or CDS plus UTR of the requested gene. Fusions follow the same rule, but when a specific fusion is requested, the length is equal to the mean of two genes involved. Finally, mutations will take a length equal to the exact alteration requested plus a padding_length that can be decided by the user. Note that this calculation does not take into account overlapping regions as this is just a rough estimation for simulations in which every row counts for itself. For a precise calculation, refer to the last section PanelDesigner.
Gene length calculations are based on biomaRt and ensembl database which is located in the UK. For your convenience, we add the possibility of changing the host and go to a faster mirror located closer to your location.
# Connect to US east coast server
# The default is www.ensembl.org
mypanel <- newCancerPanel(panelexample , myhost="uswest.ensembl.org")
A convenient list of mirror servers can be found here: ensembl mirrors. The same option is also available for panelDesigner method.
There are specific situations where not every tumor in the panel can be associated with a drug. Such cases include known drug resistance or simply drugs that are approved only for specific tumor types. We add a parameter called rules that define a set of modifiers of associations between drug and tumor types. For example we can create a data.frame like the following:
rules <- data.frame(
drug=c("Erlotinib" , "Erlotinib", "Erlotinib","Erlotinib","Olaparib")
, gene_symbol=c("EGFR" , "KRAS", "" , "", "")
, alteration=c("SNV" , "SNV", "" , "", "")
, exact_alteration=c("amino_acid_variant" , "", "" , "", "")
, mutation_specification=c("T790M" , "" , "" , "", "")
, group=c("Driver" , "Driver", "Driver" , "Driver", "Driver")
, tumor_type=c("luad" , "luad" , "luad" , "coadread","brca")
, in_out=c("exclude" , "exclude" , "include" , "include" , "include")
, stringsAsFactors = FALSE
)
knitr::kable(rules)
drug | gene_symbol | alteration | exact_alteration | mutation_specification | group | tumor_type | in_out |
---|---|---|---|---|---|---|---|
Erlotinib | EGFR | SNV | amino_acid_variant | T790M | Driver | luad | exclude |
Erlotinib | KRAS | SNV | Driver | luad | exclude | ||
Erlotinib | Driver | luad | include | ||||
Erlotinib | Driver | coadread | include | ||||
Olaparib | Driver | brca | include |
It can be added directly when the object is constructed or later on during subsetAlteration.
mypanel_exclude <- newCancerPanel(panelexample
, myhost="uswest.ensembl.org"
, rules = rules)
We developed two type of possible rules described below.
knitr::kable(rules[1:2 , ])
drug | gene_symbol | alteration | exact_alteration | mutation_specification | group | tumor_type | in_out |
---|---|---|---|---|---|---|---|
Erlotinib | EGFR | SNV | amino_acid_variant | T790M | Driver | luad | exclude |
Erlotinib | KRAS | SNV | Driver | luad | exclude |
The first two rows implement a drug resistance rule. Both alterations (EGFR T790M and KRAS mutations) can cause resistance to EGFR inhibitors and all the patients harboring these alterations cannot be associated with that treatment.
The data.frame columns are exactly the same as the panel but in this case the drug is a drug that cannot be assigned to all the patients with the specified alteration. The group column is instead what the new couple sample-drug should be considered. In our case, Actionable becomes Driver.
An extra column specifying the tumor_type in which the rule is applied is added. If the rule is valid for any tumor type, simply put "" (empty string).
in_out define the type of rule. In this case is an exclusion (in_out=exclude) rule so any patient harboring that mutation cannot be associated with Erlotinib. Also inclusion only rules can be implemented using in_out=include.
knitr::kable(rules[3:5 , ])
drug | gene_symbol | alteration | exact_alteration | mutation_specification | group | tumor_type | in_out | |
---|---|---|---|---|---|---|---|---|
3 | Erlotinib | Driver | luad | include | ||||
4 | Erlotinib | Driver | coadread | include | ||||
5 | Olaparib | Driver | brca | include |
Not every drug can be associated with all tumor types. Certain treatments (e.g. FDA approved) are only available for specific tumor types. We implemented a second set of rules in the rows 3,4 and 5.
The interpretation is: drugA can be used only in (include) or not used in (exclude) the specific tumor_type and it will be associated with group Driver.
Multiple include on the same drug are cumulative and every tumor type require a new row (Erlotinib can be used only in coadread and luad).
These rules are universal and do not depend on the data or the panel. In the example we are going to develop, coadread is not included but the rule can stay.
Now that we have our panel, we want to test it against tumor samples. We will choose two different cancer types among the available ones:
# Check available tumor types
knitr::kable( head(showTumorType()) , row.names=FALSE)
# Fetch data from breast cancer and lung adenocarcinoma samples
mypanel <- getAlterations(mypanel , tumor_type=c("brca" , "luad"))
# See the updated slot dataFull for mutations
str( cpData(mypanel)$mutations , vec.len = 1)
tumor_type | name |
---|---|
acbc | Adenoid Cystic Carcinoma of the Breast |
acc | Adenoid Cystic Carcinoma Project|Adrenocortical Carcinoma |
acyc | Adenoid Cystic Carcinoma |
all | Acute Lymphoblastic Leukemia|Hypodiploid Acute Lymphoid Leukemia|Pediatric Acute Lymphoid Leukemia - Phase II |
aml | Acute Myeloid Leukemia|Pediatric Acute Myeloid Leukemia |
ampca | Ampullary Carcinoma |
## List of 2
## $ data :'data.frame': 7095 obs. of 9 variables:
## ..$ entrez_gene_id : int [1:7095] 673 673 ...
## ..$ gene_symbol : chr [1:7095] "BRAF" ...
## ..$ case_id : chr [1:7095] "MB-0599" ...
## ..$ mutation_type : chr [1:7095] "Missense_Mutation" ...
## ..$ amino_acid_change : chr [1:7095] "V600E" ...
## ..$ genetic_profile_id: chr [1:7095] "brca_metabric" ...
## ..$ tumor_type : chr [1:7095] "brca" ...
## ..$ amino_position : num [1:7095] 600 469 ...
## ..$ genomic_position : chr [1:7095] "7:140453136:A,T" ...
## $ Samples:List of 2
## ..$ brca: chr [1:4221] "MB-0002" ...
## ..$ luad: chr [1:965] "LU-A08-43" ...
This method performs 2 tasks. Download data from cBioportal and MD Anderson Fusion Portal for every alteration type for the requested genes and retrieve the total number of samples for each available data type.
In case you are interested into requesting data for every tumor type,
it is possible to assign “all_tumors” to the parameter
tumor_type getAlterations(mypanel , tumor_type="all_tumors")
Additionally, instead of specifying the tumor type, it is also possible to display the individual study associated with each tumor type and select them one by one. This for example allows the user more control over what dataset will be used to run the simulation.
# Check available tumor types
knitr::kable( showCancerStudy(c("brca","luad")) , row.names = FALSE )
# Fetch data from breast cancer and lung adenocarcinoma samples
mypanel_alternative <- getAlterations(mypanel
, tumor_type=c("luad_tcga_pan_can_atlas_2018"
, "brca_tcga_pan_can_atlas_2018"))
At this point, the data are downloaded by gene, without looking for the exact alteration requested. If you want, you can always put your own data, as long as you respect the format for each alteration type.
# Extract the data from the panel as a toy example.
repos <- lapply(cpData(mypanel) , '[' , c('data' , 'Samples'))
# How custom data should look like
str(repos , vec.len=1)
## List of 4
## $ fusions :List of 2
## ..$ data :'data.frame': 6 obs. of 7 variables:
## .. ..$ tumor_type: chr [1:6] "brca" ...
## .. ..$ case_id : chr [1:6] "TCGA-C8-A1HJ" ...
## .. ..$ Gene_A : chr [1:6] "ERC1" ...
## .. ..$ Gene_B : chr [1:6] "RET" ...
## .. ..$ FusionPair: chr [1:6] "ERC1__RET" ...
## .. ..$ tier : chr [1:6] "tier1" ...
## .. ..$ frame : chr [1:6] "In-frame" ...
## ..$ Samples:List of 2
## .. ..$ brca: chr [1:1019] "TCGA-A1-A0SB" ...
## .. ..$ luad: chr [1:487] "TCGA-05-4244" ...
## $ mutations :List of 2
## ..$ data :'data.frame': 7095 obs. of 9 variables:
## .. ..$ entrez_gene_id : int [1:7095] 673 673 ...
## .. ..$ gene_symbol : chr [1:7095] "BRAF" ...
## .. ..$ case_id : chr [1:7095] "MB-0599" ...
## .. ..$ mutation_type : chr [1:7095] "Missense_Mutation" ...
## .. ..$ amino_acid_change : chr [1:7095] "V600E" ...
## .. ..$ genetic_profile_id: chr [1:7095] "brca_metabric" ...
## .. ..$ tumor_type : chr [1:7095] "brca" ...
## .. ..$ amino_position : num [1:7095] 600 469 ...
## .. ..$ genomic_position : chr [1:7095] "7:140453136:A,T" ...
## ..$ Samples:List of 2
## .. ..$ brca: chr [1:4221] "MB-0002" ...
## .. ..$ luad: chr [1:965] "LU-A08-43" ...
## $ copynumber:List of 2
## ..$ data :'data.frame': 16568 obs. of 5 variables:
## .. ..$ gene_symbol: chr [1:16568] "PIK3CA" ...
## .. ..$ CNA : chr [1:16568] "normal" ...
## .. ..$ case_id : chr [1:16568] "MB-0000" ...
## .. ..$ tumor_type : chr [1:16568] "brca" ...
## .. ..$ CNAvalue : chr [1:16568] "0" ...
## ..$ Samples:List of 2
## .. ..$ brca: chr [1:3626] "MB-0002" ...
## .. ..$ luad: chr [1:516] "TCGA-05-4249" ...
## $ expression:List of 2
## ..$ data :'data.frame': 1608 obs. of 5 variables:
## .. ..$ gene_symbol : chr [1:1608] "ERBB2" ...
## .. ..$ expression : chr [1:1608] "normal" ...
## .. ..$ case_id : chr [1:1608] "TCGA-05-4244" ...
## .. ..$ tumor_type : chr [1:1608] "luad" ...
## .. ..$ expressionValue: num [1:1608] 1.79 ...
## ..$ Samples:List of 2
## .. ..$ brca: chr [1:1093] "TCGA-LQ-A4E4" ...
## .. ..$ luad: chr [1:515] "TCGA-05-4249" ...
# Use custom data in a new CancerPanel object
mypanel_toy <- newCancerPanel(panelexample)
mypanel_toy <- getAlterations(mypanel_toy , repos=repos)
In this toy example, we extract the data from our CancerPanel object and transfer them to a new object. This operation can be useful when you want to test subsets of your original panel using a freeze of your data. Remember that sample names are taken from the element ‘Sample’ of the list for each alteration type because our tool needs to know the actual reference samples for which a genomic screening were made, even if they have no alterations. If ‘Sample’ is not present, the sample names are retrieved directly from the data using column case_id. Adding more data than necessary is not a problem at this point, because every statistics and calculation are made after the call to subsetAlterations.
In case you want to add extra data from a different source, and you dispose of the data already according to the @dataFull slot format, you can append this dataset to an existing panel. This allows to increase the sample size for those rare tumor_types that are not present in cBioPortal.
# Add two mutations from two different samples for breast cancer
newmutations <- data.frame(
entrez_gene_id=c(7157 , 7157)
,gene_symbol=c("TP53" , "TP53")
,case_id=c("new_samp1" , "new_samp2")
,mutation_type=c("Nonsense_Mutation" , "Nonsense_Mutation")
,amino_acid_change=c("R306*" , "Y126*")
,genetic_profile_id=c("mynewbreaststudy" , "mynewbreaststudy")
,tumor_type=c("brca" , "brca")
,amino_position=c(306 , 126)
,genomic_position=c("17:7577022:G,A" , "17:7578552:G,C")
,stringsAsFactors=FALSE
)
newsamples <- c("new_samp1" , "new_samp2")
# A dataFull slot style list should look like this
toBeAdded <- list(fusions=list(data=NULL , Samples=NULL)
, mutations=list(data=newmutations
, Samples=list(brca=newsamples))
, copynumber=list(data=NULL , Samples=NULL)
, expression=list(data=NULL , Samples=NULL)
)
new_panel <- appendRepo(mypanel_toy , toBeAdded)
Furthermore, if you want to filter certain mutations
or fusions from your dataset,
you can use the methods filterMutations
and
filterFusions
. The first method, can be used
with specific mutations or using a bed file.
# Create a bedfile
bed <- data.frame(chr = paste0("chr" , c(7 , 17))
, start = c(140534632 , 41244326)
, end = c(140534732 , 41244426)
, stringsAsFactors=FALSE)
knitr::kable(bed , row.names=FALSE)
# Apply the filter
# You can decide to exclude or keep only the mutations in the bed file
new_panel <- filterMutations(mypanel_toy , bed = bed , mode = "exclude")
# Filtering can be also tumor-wise
new_panel <- filterMutations(mypanel_toy , bed = bed
, mode = "exclude" , tumor_type="brca")
After getting alterations, we have to subset alterations for the exact specifications of our panel.
# Subset the alterations using panel information
mypanel <- subsetAlterations(mypanel)
# See the updated slot dataSubset
str( cpDataSubset(mypanel) , vec.len = 1)
As we can see, the format of the new slot is now the same for every alteration type and the information contained in the panel was merged. This format is necessary to perform our simulations since we can easily bind together these datasets.
If the user provides an exclude data.frame (either at the object construction or here), all the patients excluded from certain treatment opportunities will be stored in the excluded slot. See Negation Rules section above.
The druggability parameter can be used in this context too.
For example we can use the data already downloaded and use them in a panel with an exclude and druggability parameter:
mypanel_exclude <- mypanel
mypanel_exclude <- subsetAlterations(mypanel_exclude
, rules=rules)
PrecisionTrialDrawer has three different plot capabilities/statistics. Every plot can be turned off by using noPlot=TRUE option and statistics are reported instead. The first plot is the coveragePlot, which returns the absolute number of samples covered by at least one or more alterations. A more compact but less informative version of the same plot is the coverageStackPlot, which returns a breakdown of variable (e.g. drug) by its subcomponents (e.g. genes). The second one is called saturationPlot, and expresses the cumulative number of samples covered or the mean number of alteration per sample by adding one piece of the panel at a time. In particular, this plot is able to estimate the trade-off between the number of covered samples and the genomic space occupied by the panel (as a sort of proxy of cost). The third plot is called coocMutexPlot and is able to explore the relationship between pairs of features as mutual exclusive or cooccurent. For example, in a typical umbrella design, we seek for drugs or genes that are possibly mutual exclusive between each other, in order to maximize the number of covered patients and avoid confusing multiple targeting.
All plots accept a CancerPanel object and a grouping
variable to see the plot stratified
by gene or tumor type. Furthermore, you can decide the kind
of alteration to plot, between mutations, copynumber, fusions,
expression or any combination of these.
Let’s draw the simplest of the coverage plot. We want to see how many samples are covered by at least one of the alteration of our panel, divided by tumor type:
coveragePlot(mypanel
, alterationType=c("mutations" , "expression" , "copynumber" , "fusions")
, grouping="tumor_type")
If we do not want the plot to be displayed but just the statistics, we can add the parameter noPlot set to TRUE.
# Same command, but we ask for no plot, just the statistics
covStats <- coveragePlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, noPlot=TRUE)
# Proportion of covered samples by at least one alteration
atLeastOne <- covStats$plottedTable[ , 1]/covStats$Samples
The results appears to be extremely good, because we reached 59% of breast samples and 67% for lung adenocarcinoma samples.
How many of these samples are actually actionable by at least one alteration? Since we use ‘group’ in our panel to divide druggable from non-druggable alterations, it is pretty easy to answer this question.
# We add a new grouping variable, 'group'
coveragePlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping=c("tumor_type" , "group")
)
Only the 53% of breast samples are actionable and 37% of lung adenocarcinomas.
Some of the drugs, like Vandetanib, are multitarget that acts on several kinases. So now we are interested in knowing what are the most altered genes for each drug. We could add one more grouping variable to the coveragePlot but we will end up with a very hard to read plot (one plot per gene). To overcome this difficulty, we implemented a more compact version of the coverage plot that only returns the number of samples with at least one alteration stratified for a ‘grouping’ variable.
# the stacked coverage plot requires:
# 'var' parameter (that selects the bars)
# 'grouping' parameter (that selects the stratification variable, optional)
# 'tumor_type' parameter (to select one or more tumor type, optional)
par(mfrow=c(2,1))
coverageStackPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, var="drug"
, grouping="gene_symbol"
, tumor_type="brca"
)
coverageStackPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, var="drug"
, grouping="gene_symbol"
, tumor_type="luad"
)
The stacked version of coverage plot reports a breakdown of each drug by gene and then the total samples covered by each drug. When the total is lower than the breakdown, it means that some of the samples are altered in more than one gene targeted by the same drug.
From this plot we can show for example that olaparib covers, in both tumors the 10% of the samples (4% for breast and 4% for lung). The contribute of its target BRCA1 and BRCA2 is different. While in breast cancer the number of altered samples is splitted between the two genes, in lung the contribution of BRCA2 is predominant ( BRCA1 covers the 4% while BRCA1 the NA% ).
Furthermore, a multitarget drug like Vandetanib could be a valuable therapeutic option, especially in lung adenocarcinoma. It covers the 5% of the samples with a large contribution of KDR ( NA%) and ABL2 (NA%) that is not present in breast cancer.
If the plot result is not clear, we can create an interactive version of the same graph.
# Add parameter html=TRUE
myHTMLPlot <- coverageStackPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, var="drug"
, grouping="gene_symbol"
, html=TRUE)
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
## Warning in type.convert.default(gsub("px", "", options[[x]])): 'as.is' should be
## specified by the caller; using TRUE
# The plot is not displayed immediately to allow the user
# to embed it in markdown document or web page
# Now run plot
plot(myHTMLPlot)
Implementing negation rules can significantly alter the number of subject that can be potentially treated by a drug. In this example, we will see the effect of the rules parameter on Erlotinib drug in the lung dataset.
# First plot with exclude activated, second without
par(mfrow=c(2,1))
coverageStackPlot(mypanel_exclude
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, var="drug"
, grouping=NA
, tumor_type="luad"
)
coverageStackPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, var="drug"
, grouping=NA
, tumor_type="luad"
)
As you can see, ~ 1% of the patients can no longer be potentially treated with Erlotinib because of specific resistant mutations. Furthermore, Olaparib association was completely eliminated by the druggability parameter.
Our panel is performing very well on the absolute number of covered samples. What about its construction? With the next analysis, we will try to answer to questions related to each element of our panel and try to optimize it. In particular we want to ask:
To answer these questions, let’s draw a saturation plot.
# Saturation plot by adding one gene at a time divided by tumor type
saturationPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, adding="gene_symbol"
)
The plot is quite simple to interpret. On the X axis we have genomic space, on the Y axis the mean number of alteration per sample. The plot is build by adding one gene at a time (adding parameter), using all alteration types (alterationType parameter) and drawing one curve per tumor type (grouping parameter). The genes are added by their frequency of alteration, starting from the most altered. As can be seen from the bottom right annotation, some of the genes are never altered in these tumor types. To find out who they are, we can run this simple lines of code:
# As always, we can ask for the statistics with the option noPlot
satStats <- saturationPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, adding="gene_symbol"
, noPlot=TRUE
)
# Retrieve all the genes of the panel
panelGenes <- unique( cpArguments(mypanel)$panel$gene_symbol )
# Look for the ones that are not present in the plot with a simple setdiff
missingGenes <- sapply( c("brca" , "luad") , function(x) {
setdiff( panelGenes
, satStats[ satStats$grouping==x , "gene_symbol2" ] )
})
# Print out
missingGenes
## $brca
## [1] "BRAF" "EML4__ALK"
##
## $luad
## character(0)
As we can see, HRAS is never altered in the positions requested in the panel, both in breast and lung cancer samples. We can think about remove it from our panel and save some genomic space. This plot can also be seen from the drug point of view, by changing the variable adding:
# Saturation plot by adding one drug at a time divided by tumor type
saturationPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, adding="drug"
, y_measure="absolute"
)
In this case, we plot the absolute number of cumulative covered patient (like in the coverage plot) by using the parameter y_measure set to “absolute”. We can notice that by using just Idelasib (target of PIK3CA) and Trastuzumab (target of ERBB2), we can cover almost the 50% of breast cancer patients. Moreover, two drugs have no available target and could be excluded from a potential trial. On the other hand, lung samples are predominantly mutated in KRAS and TP53 and since these two genes are not targeted by any drug, the first “drug” category is indeed no_drug. In addition, we can appreciate a very high coverage for LUAD samples, but the saturation is reached at the point of Idelalisib/Olaparib and all other drugs increase the coverage of 1-2%. Olaparib itself, occupies almost 25 Kbases alone with its targets BRCA1/2, which are very long genes.
The information coming from the saturationPlot is useful but not complete. In fact, the order of the genes/drugs can be misleading about the true coverage of a single feature. For example, let’s plot the coverage by drug and tumor type:
# One tumor at a time, we draw a stack plot of drug without stratification
par(mfrow=c(2,1))
coverageStackPlot(mypanel , var="drug"
, grouping=NA , tumor_type="brca" , collapseByGene=TRUE)
coverageStackPlot(mypanel , var="drug"
, grouping=NA , tumor_type="luad" , collapseByGene=TRUE)
In this case we add the parameter collapseByGene. This option is useful to prevent the package from counting two different types of alteration on the same gene and same sample to be counted twice. For example, if ERBB2 is both amplified and overexpressed (two things that are extremely correlated) in the same sample, it will count for 1 alteration only. This parameter has an effect on the bars from 2 to 10, but obviously not on the ‘at least one’ alteration bar.
From this plot we can see for example that Olaparib (a target of BRCA1/2) accounts for almost the 11% of breast cancer samples, while in the saturationPlot its cumulative contribute appeared to be between 5-6%. This is where the coocMutexPlot comes handy.
# coocMutexPlot of drug divided by tumor type
coocMutexPlot(mypanel
, alterationType=c("mutations" , "expression" , "copynumber" , "fusions")
, grouping="tumor_type"
, var="drug")
This plot represents significant mutual exclusivity and cooccurence between couples of drugs. In particular, it tests whether all the alterations targeted by a drug tends to appear in different patients or together in the same set of patients. For example, we can appreciate that Olaparib is mutual exclusive with Idelalisib, so the drugs are active on different set of patients. This is particular useful to know for balancing a basket trial design. The more two drugs are mutual exclusive, the larger is the set of druggable patients.
The very same plot, using genes instead of drugs, shows a mutual exclusivity on both BRCA1 and BRCA2 against PIK3CA.
# coocMutexPlot of genes divided by tumor type
# To avoid plotting gene pairs with no significance, we set parameter plotrandom
coocMutexPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, var="gene_symbol"
, plotrandom=FALSE
)
Cooccurence and mutual exclusivity can also be seen from the prospective of similarity, rather than pair-wise mutex. For this purpose, we developed an alternative style for the plot that exploits binary distance between occurrences of alterations between genes or drugs.
# coocMutexPlot of drugs divided by tumor type
# Style set to dendro
coocMutexPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, var="drug"
, style="dendro"
)
Again, if you want just the statistics and no plot, set the parameter noPlot to TRUE.
# coocMutexPlot of genes divided by tumor type
coocMutexStats <- coocMutexPlot(mypanel
, alterationType=c("mutations" , "expression"
, "copynumber" , "fusions")
, grouping="tumor_type"
, var="drug"
, noPlot=TRUE
)
knitr::kable(head(coocMutexStats) , digits = 3)
sp1_name | sp2_name | pVal.MutEx | pVal.Cooc | OR | grouping |
---|---|---|---|---|---|
Erlotinib | Idelalisib | 0.730 | 0.473 | 1.215 | brca |
Erlotinib | Olaparib | 0.529 | 1.000 | 0.728 | brca |
Erlotinib | Trastuzumab | 0.870 | 0.328 | 1.815 | brca |
Erlotinib | Vandetanib | 0.978 | 0.129 | 4.321 | brca |
Erlotinib | no_drug | 0.898 | 0.280 | 2.006 | brca |
Idelalisib | Olaparib | 0.110 | 0.940 | 0.646 | brca |
Idelalisib and Olaparib are mutual exclusive with a p-value of 0.11.
PrecisionTrialDrawer is designed to allow the user to perform
a lot of tasks for designing a genomic based trial.
Nevertheless, not all the possible plots are available inside the package.
We implemented a function to extract the relevant information
for any kind of simulation, called dataExtractor
.
#Extract mutations from BRCA tumor type
myData <- dataExtractor(mypanel , alterationType = "mutations"
, tumor_type = "brca")
# Check the format
str(myData)
## List of 3
## $ data :'data.frame': 2355 obs. of 6 variables:
## ..$ drug : chr [1:2355] "no_drug" "no_drug" "no_drug" "no_drug" ...
## ..$ group : chr [1:2355] "Driver" "Driver" "Driver" "Driver" ...
## ..$ gene_symbol : chr [1:2355] "KRAS" "KRAS" "KRAS" "KRAS" ...
## ..$ tumor_type : chr [1:2355] "brca" "brca" "brca" "brca" ...
## ..$ case_id : chr [1:2355] "TCGA-AR-A1AS" "PD4135a" "TCGA-AR-A1AL" "TCGA-E2-A1IF" ...
## ..$ alteration_id: chr [1:2355] "mut" "mut" "mut" "mut" ...
## $ Samples :List of 2
## ..$ brca : chr [1:4221] "MB-0002" "MB-0005" "MB-0006" "MB-0010" ...
## ..$ all_tumors: chr [1:4221] "MB-0002" "MB-0005" "MB-0006" "MB-0010" ...
## $ tumor_not_present: chr(0)
A data extraction is therefore a list composed by data, Samples divided by tumor type and a character vector of tumors that are not present under the requested parameters. For example, we could ask how many samples are targeted by 0, 1, 2, 3 drugs. This function is not directly implemented in our package and so we can build it up here.
#Extract drug samples couples
dataDrug <- unique(myData$data[ , c("case_id" , "drug")])
# Retrieve all screened samples
samps <- myData$Samples$all_tumors
# Calculate number of drugs per patient (we consider also non altered samples)
drugsPerPatient <- table( factor(dataDrug$case_id , levels = samps) )
# Now in terms of frequency
drugsPerPatientFreq <- table(drugsPerPatient)/length(samps)
# Ready to plot
barplot(drugsPerPatientFreq , ylim = c(0,1)
, main="Patients with 0, 1, 2... molecular targets"
, col="darkcyan")
While copynumber, expression and fusions require reading the whole gene, mutations can be targeted with high sensitivity. For particularly large genes, according to the tumor type under examination, it is often better to target specific regions rather than sequencing the whole gene. To design such regions, a “ratiometric” approach could be the best solution. It is better to evaluate the trade-off between genomic space occupied by the panel and number of mutations captured.
A shiny-based interactive application was created to allow
the users to design their custom regions. For a better
explanation of all the features of panelOptimizer,
we suggest you to follow the manual page ?panelOptimizer
and reading carefully the instruction inside the mini app itself.
panOpt <- panelOptimizer(mypanel)
In a classical epidemiological study design, one of the most important task is power and sample size estimation. In PrecisionTrialDrawer is currently possible to evaluate two of the most common oncological trial designs: time-to-event (survival) and proportion equality. The methods survPowerSampleSize and propPowerSampleSize are currently implemented inside the package. In this vignette, we will evaluate the first of the two methods. The latter is very similar, but requires a different set of initial parameter (pCase and pControl). A single-sample time-to-event calculator is also available as survPowerSampleSize1Arm. It is similar to survPowerSampleSize but instead of Hazard Ratio it requires MED1 and MED0, median survival for the case group and historical median survival.
First, we calculate the sample size required to obtain 4 levels of power (from 0.6 to 0.9) for 4 different postulated hazard ratios (from 0.625 to 0.3). Type I error was set to 5% (‘alpha’) and baseline event rate per unit of time was set to 0.9 (‘ber’, which correspond to 10% 1-year progression-free survival). Average follow-up time is 3 years (‘fu’). Allocation is equal between treatment and control groups (‘case.fraction’).
survPowerSampleSize(mypanel
, HR = c(0.625 , 0.5 , 0.4 , 0.3)
, power = seq(0.6 , 0.9 , 0.1)
, alpha = 0.05
, ber = 0.9
, fu = 3
, case.fraction = 0.5
)
Our function estimates the real number of samples required to start the study under the hypothesis that only a fraction of them will have at least one alteration included in the panel. The number of samples that will actually enter in the study can be retrieved using the option noPlot, in the following way.
# Add option noPlot to retrieve results as dataframe
sampSize <- survPowerSampleSize(mypanel
, HR = c(0.625 , 0.5 , 0.4 , 0.3)
, power = seq(0.6 , 0.9 , 0.1)
, alpha = 0.05
, ber = 0.9
, fu = 3
, case.fraction = 0.5
, noPlot=TRUE
)
knitr::kable(sampSize[ sampSize$HazardRatio==0.5 , ] , row.names = FALSE)
Var | ScreeningSampleSize | EligibleSampleSize | Beta | Power | HazardRatio |
---|---|---|---|---|---|
Full Design | 80 | 49 | 0.4 | 0.6 | 0.5 |
Full Design | 101 | 62 | 0.3 | 0.7 | 0.5 |
Full Design | 129 | 79 | 0.2 | 0.8 | 0.5 |
Full Design | 171 | 105 | 0.1 | 0.9 | 0.5 |
Under the column EligibleSampleSize we can see the actual number of patients in treatment and control groups combined under the postulated hypothesis ( 79 ).
Our sample size estimation function can also be used to estimate power from a postulated screening sample size. We first bring the sample size from screening to eligibility (multiplying by the frequency of alterations) and then perform the calculation. As a toy example, we use the same screening sample sizes calculated before and show the calculated power.
# Calculate power from sample size at screening
survPowerSampleSize(mypanel
, HR = 0.5
, sample.size = sampSize[ sampSize$HazardRatio==0.5 , "ScreeningSampleSize"]
, alpha = 0.05
, ber = 0.9
, fu = 3
, case.fraction = 0.5
)
So with 129 patients we obtain an estimated power of .
In a typical basket or umbrella design, we can also be interested in what is the power or sample size of the design of each single drug or tumor type instead of the whole panel. By adding the parameter var, this task is easy to achieve.
# Calcualte sample size by tumor type
survPowerSampleSize(mypanel
, var = "tumor_type"
, HR = c(0.625 , 0.5 , 0.4 , 0.3)
, power = seq(0.6 , 0.9 , 0.1)
, alpha = 0.05
, ber = 0.9
, fu = 3
, case.fraction = 0.5
)
Similarly, we can also ask a design divided by drug. Given the frequency of alterations targetable by each drug, we can estimate how many patients should be screened to obtain the minimum number of samples to reach a certain level of power. In this case, we don’t ask for all the drug type, but only Olaparib and Idelalisib.
# Calcualte sample size by drug
# Visualize only "Olaparib" and "Idelalisib"
survPowerSampleSize(mypanel
, var = "drug"
, stratum = c("Olaparib" , "Idelalisib")
, HR = c(0.625 , 0.5 , 0.4 , 0.3)
, power = seq(0.6 , 0.9 , 0.1)
, alpha = 0.05
, ber = 0.9
, fu = 3
, case.fraction = 0.5
)
It is noteworthy that the calculations performed arm-wise are based on the assumption that each arm is completely independent from the others. In other terms, patients with two targetable alterations enter in the frequency calculations twice as they could be treated with both drugs. In a typical umbrella design (many drugs, one tumor type) the patient is generally not treated with multiple drugs and can enter in one arm only.
The two sample size estimations above are useful in two particular occasions:
In case of a multi-drug design, the calculation of the number of patients to screen can be heavily biased. The first estimates the power of the whole study correctly, but each drug can be underpowered. The second estimates the power of each single arm correctly, but overestimates the total number of patients to screen. This is because the patients that are not eligible to enter in a specific arm cannot be assigned to a different drug and are simply discarded. To give an example, it is like performing 3 different clinical trials in 3 different hospitals with no possibility that a patient discarded from the first trial can be eligible in the other 2.
In a real umbrella trial scenario, the sample size should be calculated according to 3 assumptions:
survPowerSampleSize has a specific parameter to fulfill these prerequisites called priority.trial, a character vector listings drugs or group levels to build an umbrella trial based on a priority order. Sample size calculation becomes a multistep process reproduced by a cascade algorithm of screening and assignment.
The algorithm is composed by steps that implement a series of rules
to screen and allocate patients to each arm of the trial. By default,
the screening starts from the drug whose linked alteration are the
rarest in the object up to the most common. This default rule guarantees
the minimal sample size to screen. Alternatively, the user can decide
the order of “importance” of each drug by changing the default
priority.trial.order="optimal"
to "as.is"
. In this case, the order
of importance is dictated by the order of the priority.trial vector.
In order to calculate the fraction of expected patients with an alteration that make them eligible for a drug, we need to calculate a matrix as shown below. Under a example design with 4 drugs (A, B, C, D), the fraction (probability) of obtaining an eligible patient is calculated as:
\[\mathbf{P_{A,B,C,D}} = \left[\begin{array} {rrrr} P(A) & P(B|\bar{A}) & P(C|\bar{A}\cup\bar{B}) & P(D|\bar{A}\cup\bar{B}\cup\bar{C}) \\ P(A|\bar{B}) & P(B) & P(C|\bar{A}\cup\bar{B}) & P(D|\bar{A}\cup\bar{B}\cup\bar{C}) \\ P(A|\bar{C}) & P(B|\bar{A}\cup\bar{C}) & P(C) & P(D|\bar{A}\cup\bar{B}\cup\bar{C}) \\ P(A|\bar{D}) & P(B|\bar{A}\cup\bar{D}) & P(C|\bar{A}\cup\bar{B}\cup\bar{C}) & P(D) \end{array}\right] \]
The dependence scheme derives from the following order of priority of the drugs taken into consideration (\(A \rightarrow B \rightarrow C \rightarrow D\)). The main diagonal represents the screening phase, that starts with new samples and so is not influenced by any other drugs. Moving to 2, the second drug, the probability is calculated as \(P(2|\bar{1})\) that in the table above is \(P(B|\bar{A})\). Moving to 3 means taking dependencies from both 1 and 2, so \(P(2|\bar{1}\cup\bar{2})\) ( \(P(C|\bar{A}\cup\bar{B})\) ).
\[\mathbf{Order} = \left[\begin{array} {rrrr} 1 & 2 & 3 & 4 \\ 2 & 1 & 3 & 4 \\ 2 & 3 & 1 & 4 \\ 2 & 3 & 4 & 1 \end{array}\right] \]
The main diagonal (new screening line) uses the original frequencies of each drug, estimated as the number of eligible patients over the total number of samples in our CancerPanel object. Every advancement on the columns must take in consideration a depletion of all the samples that were already allocated. For example, \(P(B|\bar{A})\), ( first row , second column), calculates the number of eligible patients for B in a dataset epurated from all the samples eligible for A. If the drugs cover different pathways, the dependent frequencies are very similar to the original ones. Conversely, if you imagine two drugs with very similar molecular targets, once the first has removed all its eligible patients, very few patients can be allocated to the second treatment that will drop dramatically in frequency.
Now, let’s see an example. We take into consideration a design with four drugs, and we ask for the optimal order of priority of the drugs (basically starting from the rarest).
drugs4 <- c("Idelalisib" , "Olaparib" , "Trastuzumab" , "Vandetanib")
prior4drugs <- survPowerSampleSize(mypanel
, var = "drug"
, priority.trial = drugs4
, priority.trial.order="optimal"
, HR = 0.5
, power = 0.8
, noPlot = TRUE
)
##
## Minimum Eligible Sample Size to reach 80% of power with an HR equal to 0.5 and a HR0 equal to 1 for each drug is equal to: 95
prior4drugs
## $`HR:0.5 | HR0:1 | Power:0.8`
## $`HR:0.5 | HR0:1 | Power:0.8`$Summary
## Olaparib Vandetanib Trastuzumab Idelalisib Total
## Screened 2270 0 0 0 2270
## Eligible 95 97 270 530 991
## Not.Eligible 1280 0 0 0 1280
##
## $`HR:0.5 | HR0:1 | Power:0.8`$Screening.scheme
## Olaparib Vandetanib Trastuzumab Idelalisib
## Olaparib 2270 2175 2079 1809
## Vandetanib 0 0 0 0
## Trastuzumab 0 0 0 0
## Idelalisib 0 0 0 0
##
## $`HR:0.5 | HR0:1 | Power:0.8`$Allocation.scheme
## Olaparib Vandetanib Trastuzumab Idelalisib
## Olaparib 95 97 270 530
## Vandetanib 0 0 0 0
## Trastuzumab 0 0 0 0
## Idelalisib 0 0 0 0
## Total 95 97 270 530
##
## $`HR:0.5 | HR0:1 | Power:0.8`$Probability.scheme
## Olaparib Vandetanib Trastuzumab Idelalisib
## Olaparib 0.04186047 0.0443828 0.1298984 0.292744
## Vandetanib 0.00000000 0.0000000 0.0000000 0.000000
## Trastuzumab 0.00000000 0.0000000 0.0000000 0.000000
## Idelalisib 0.00000000 0.0000000 0.0000000 0.000000
##
## $`HR:0.5 | HR0:1 | Power:0.8`$Base.probabilities
## Olaparib Vandetanib Trastuzumab Idelalisib
## 0.04186047 0.04584718 0.13488372 0.28704319
The output is completely different from the regular one of this function. It is a list composed by 5 elements: Summary, Screening.scheme, Allocation.scheme, Probability.scheme, Base.probabilities.
A message is telling us that each drug must reach a number of
eligible patients equal to 66, under a default case/control fraction
of 0.5. We are designing a trial where
each drug needs 33 treated and 33 controls.
Reading from the Summary, our algorithm rules that
patients must
be screened to obtain
eligible subjects and discarding
patients that cannot be inserted in any arm of the trial.
Let’s now compare the result to the ones seen in the two sections above.
# We use stratum to create a design with only four drugs
fullDesign <- survPowerSampleSize(mypanel
, var = "drug"
, stratum = drugs4
, HR = 0.5
, power = 0.8
, noPlot = TRUE)
knitr::kable(fullDesign)
Var | ScreeningSampleSize | EligibleSampleSize | Beta | Power | HazardRatio |
---|---|---|---|---|---|
Idelalisib | 331 | 95 | 0.2 | 0.8 | 0.5 |
Olaparib | 2270 | 95 | 0.2 | 0.8 | 0.5 |
Trastuzumab | 705 | 95 | 0.2 | 0.8 | 0.5 |
Vandetanib | 2073 | 95 | 0.2 | 0.8 | 0.5 |
Full Design | 218 | 95 | 0.2 | 0.8 | 0.5 |
As we can see, the Eligible Sample Size is the same
as above, equal to 95.
The Full Design calculates a number of patients to
screen equal to 218, way
lower than
calculated by our priority trial.
This result guarantees at least 80% of power on the whole trial,
but it doesn’t assure sufficient power for all the arms of the study.
Conversely, the sum of all the patients to screen
assuming 4 independent trials is way higher, because a patient
discarded from an arm of the study cannot be eligible for a different arm.
sum(fullDesign[fullDesign$Var %in% drugs4 , "ScreeningSampleSize"])
## [1] 5379
If we want to calculate the post-hoc power for the whole study, using a sample size equal to the one calculated with priority trial, we can do:
# Extract the number of samples calculated with priority.trial
sizePrior4Drugs <- prior4drugs[[1]]$Summary["Screened" , "Total"]
# Calculate post-hoc power, given sample.size
postHocPower <- survPowerSampleSize(mypanel
, var = "drug"
, stratum = drugs4
, HR = 0.5
, sample.size = sizePrior4Drugs
, noPlot = TRUE)
postHocPower[ postHocPower$Var=="Full Design" , "Power"]
## [1] 1
The power is basically 100%. In conclusion, priority.trial guarantees an high level of power for the whole trial and at least the same level of power for each drug, minimizing the samples to screen.
The problem of using public data is that the information available from each tumor type depends on how many samples per tumor type were sequenced. In our example, we can see that mutation data are not the same between tumor types with 4221 sequenced samples for breast cancer and 965 for lung adenocarcinoma. Therefore, when calculating compound frequencies across tumor types, breast cancer weights more than lung cancer.
This behaviour was purposely set by default given the nature of the original idea of a panel-wise analysis that is the base of Umbrella Designs (one tumor type, multiple targets). In other words, PrecisionTrialDrawer is best suited for tumor-wise analysis where homogeneity of samples is automatically granted.
Nevertheless, Basket Designs (one target, multiple tumor types) can also be explored using this package by adding two parameters common to many of our functions.
Function | tumor.freqs | tumor.weights |
---|---|---|
coveragePlot | present | present |
coverageStackPlot | present | present |
cpFreq | present | present |
saturationPlot | present | |
survPowerSampleSize | present | present |
coocMutExPlot | present |
Let’s now see some examples that make use of these two parameters.
As mentioned in the introduction, alteration frequencies are by default calculated on all available samples inside the object. If we want to establish the frequency of samples that can be targeted by each drug, we can run a simple coverageStackPlot like this:
coverageStackPlot(mypanel , var="drug"
, grouping=NA , collapseByGene=TRUE)
This plot derives from a calculation made with 1019 breast samples and 486 lung samples, that are the ones with all available data from copynumber, mutations, fusions and expression. As you can see, this is not a balanced sample because breast tumors are the large majority. What if we had the same number of samples for each tumor type?
coverageStackPlot(mypanel , var="drug"
, grouping=NA , collapseByGene=TRUE
, tumor.freqs=c(brca=0.5 , luad=0.5))
While some of the drugs do not change dramatically, some other drop down or increase significantly (check red numbers on top of the bars). For example, Idelalisib drops down in frequency by a 10% (before 0.2870432, now 0.2324214). The reason is that is linked to PIK3CA, that is a gene that is rarely mutated in lung cancer, while it represents around the 30% breast cancer case. When the design is balanced, the frequency is dragged down towards the low LUAD frequency. Note that the plot is also slightly different from the previous one: now Y-axis reports the relative frequency (the same as the red text on the top of the bars), instead of the absolute number of samples that is usually reported. The reason is that tumor.freqs simply weights the tumor-wise frequencies by a 0.5 - 0.5 factor per tumor type and the original number of samples is therefore lost.
If we want to check our theory about PIK3CA, we can simply use tumor.freqs in cpFreq function.
# Run cpFreq with and withouth tumor.freqs
flatfreq <- cpFreq(mypanel , alterationType = "mutations")
weightedfreq <- cpFreq(mypanel , alterationType = "mutations"
, tumor.freqs=c(brca=0.5 , luad=0.5) )
# Frequency with no weights
knitr::kable(flatfreq[ flatfreq$gene_symbol=="PIK3CA" , ] , row.names=FALSE)
gene_symbol | freq |
---|---|
PIK3CA | 0.3033166 |
# Frequency with balanced weights
knitr::kable(weightedfreq[ weightedfreq$gene_symbol=="PIK3CA" , ]
, row.names=FALSE)
gene_symbol | freq |
---|---|
PIK3CA | 0.2019177 |
In this section, we analyze how a basket trial simulation can be carried on. Imagine a clinical trial on Idelalisib, a drug that acts on PIK3CA mutations. The study is designed so that each tumor type arm is closed when we reach 50 breast cancer cases and 40 lung cancer samples. Now we ask three questions:
samplefreqs <- cpFreq(mypanel , alterationType = "mutations"
, tumor.weights=c(brca=50 , luad=40))
knitr::kable(samplefreqs[ samplefreqs$gene_symbol=="PIK3CA" , ]
, row.names=FALSE)
gene_symbol | freq |
---|---|
PIK3CA | 0.2444444 |
tumor.weights allows a random extraction of 50 and 40 samples from each tumor type and calculate the frequency. This is just 1 extraction, so the result could be completely biased. We need to run a resampling simulation to answer our questions.
# Run cpFreq 20 times
samplefreqsboot <- replicate( 20
, cpFreq(mypanel , alterationType = "mutations"
, tumor.weights=c(brca=50 , luad=40))
, simplify=FALSE)
# Extract PIK3CA frequency in the 20 runs
pik3caboot <- sapply(samplefreqsboot , function(x) {
x[x$gene_symbol=="PIK3CA" , "freq"]})
Now to answer the first two questions, let’s draw the distribution of our resampling:
# calculate mean and sd of PIK3CA frequency distribution
avgpik3ca <- mean(pik3caboot)
sdpik3ca <- sd(pik3caboot)
# Plot the distribution
title <- paste("PIK3CA frequency distribution with 50 BRCA and 40 LUAD samples"
, paste("Mean:" , round(avgpik3ca , 3))
, paste("SD:" , round(sdpik3ca , 3))
, sep="\n")
# draw a simple histogram. the red line shows the mean value
hist(pik3caboot , col="cadetblue" , breaks=30
, main=title , xlab="Frequencies of resampling")
abline(v = avgpik3ca , col="red" , lwd=3 , xpd=FALSE)
So, with our expected sample composition, we have an average of 0.232 samples mutated on PIK3CA with a standard deviation of 0.031. Also a confidence interval for our measure can be calculated, using percentile bootstrap confidence interval
\[(\theta_{\alpha/2}^{*};\theta_{1-\alpha/2}^{*})\]
To summarize what we should expect from this study, including answering to question 3:
# Write a simple function to calculate confidence interval of a proportion
CI <- function(x){
left <- quantile(x , 0.025)
right <- quantile(x , 0.975)
return(c(left , right))
}
# Create a small data.frame to summarize everything
pik3caSummary <- data.frame(Gene = "PIK3CA"
, AverageMutationRate = round(avgpik3ca , 3)
, SDMutationRate = round(sdpik3ca , 3)
, MaxMutationRate = round(max(pik3caboot)[1] , 3)
, MinMutationRate = round(min(pik3caboot)[1] , 3)
, CI = paste( round( CI(pik3caboot) , 3) , collapse=" - "))
knitr::kable(pik3caSummary , row.names=FALSE)
Gene | AverageMutationRate | SDMutationRate | MaxMutationRate | MinMutationRate | CI |
---|---|---|---|---|---|
PIK3CA | 0.232 | 0.031 | 0.311 | 0.189 | 0.189 - 0.295 |
What we known now is how PIK3CA would behave if we randomly select 90 individuals divided in 50 breast cancer and 40 lung cancer cases.
How is this affecting sample size estimation? As mentioned in the previous section, sample size at screening is influenced by the number of affected samples we expect to find. If you need to treat with Idelalisib a total of 100 samples, you would need at least 300 patients sequenced to obtain 100 PIK3CA mutated samples (considering a mutation rate of 30%). This means that while sample size at treatment is fixed and depends only from hazard ratio, power and other parameters chosen beforehand, sample size at screening depends on how lucky/unlucky we are in finding enough patients with the alteration we are targeting.
Also survPowerSampleSize has a tumor.weights parameter that can be resampled as we did for cpFreq. Let’s see how. To simplify things, we hypothesize an hazard ratio of 2 and we want to reach a power of 80%.
# We want to know how many samples we will need under
# HR = 2
# power = 0.8
survboot <- replicate( 10
, survPowerSampleSize(mypanel
, var="gene_symbol"
, alterationType = "mutations"
, HR=2
, power=0.8
, tumor.weights=c(brca=50 , luad=40)
, noPlot=TRUE)
, simplify=FALSE)
# Extract PIK3CA results
survbootpik3ca <- sapply(survboot , function(x) {
x[ x$Var=="PIK3CA" , "ScreeningSampleSize"]} )
Again, we create a little table that would tell us the average scenario, the best, the worst etc. Calculation of confidence interval follows the same formula as above
# Create a small data.frame to summarize
# everything (we round up to integer values)
survSummary <- data.frame(Gene = "PIK3CA"
, Mean.Screen.Samp.Size = round(mean(survbootpik3ca))
, SD.Screen.Samp.Size = round(sd(survbootpik3ca))
, Max.Screen.Samp.Size = round(max(survbootpik3ca)[1])
, Min.Screen.Samp.Size = round(min(survbootpik3ca)[1])
, CI = paste( round( CI(survbootpik3ca)) , collapse=" - "))
knitr::kable(survSummary , row.names=FALSE)
Gene | Mean.Screen.Samp.Size | SD.Screen.Samp.Size | Max.Screen.Samp.Size | Min.Screen.Samp.Size | CI |
---|---|---|---|---|---|
PIK3CA | 312 | 51 | 392 | 257 | 259 - 387 |
What we learnt is that in the average scenario, we would need 312 samples to start our trial, with a minimum of 257 and, if we are very unlucky, a maximum of 392 patients at screening.
Once we are satisfied with our panel (and we suppose we are), the next step is to submit it to our preferred NGS company. The most accepted format to submit a panel is the bed format. As mentioned in the introduction, it is sometimes difficult to reconcile all the possible formats of alteration in a genomic position and that is what panelDesigner is designed for.
panelDesigner takes our CancerPanel object and 4 possible arguments.
# Design our panel in full with no padding
# length and merge window equal to 100 bp
mydesign <- panelDesigner(mypanel
, padding_length=0
, merge_window=100)
## List of 4
## $ GeneIntervals :'data.frame': 240 obs. of 9 variables:
## ..$ ensembl_transcript_id: chr [1:240] "ENST00000372348" ...
## ..$ ensembl_gene_id : chr [1:240] "ENSG00000097007" ...
## ..$ hgnc_symbol : chr [1:240] "ABL1" ...
## ..$ cds_length : int [1:240] 3450 3450 ...
## ..$ chromosome_name : chr [1:240] "9" ...
## ..$ genomic_coding_start : int [1:240] 133759356 133755887 ...
## ..$ genomic_coding_end : int [1:240] 133761070 133756051 ...
## ..$ exon_chrom_start : int [1:240] 133759356 133755887 ...
## ..$ exon_chrom_end : int [1:240] 133761070 133756051 ...
## $ TargetIntervals:'data.frame': 5 obs. of 5 variables:
## ..$ target_chr : chr [1:5] "7" ...
## ..$ target_start: int [1:5] 140453135 25398277 ...
## ..$ target_end : int [1:5] 140453137 25398291 ...
## ..$ target_width: int [1:5] 3 15 ...
## ..$ gene_symbol : chr [1:5] "BRAF" ...
## $ FullGenes : chr [1:7] "PIK3CA" ...
## $ BedStylePanel :'data.frame': 135 obs. of 4 variables:
## ..$ chr : chr [1:135] "chr3" ...
## ..$ start : int [1:135] 178916613 37881579 ...
## ..$ end : int [1:135] 178916965 37881655 ...
## ..$ annotation: chr [1:135] "PIK3CA" ...
The design is a list composed by four elements:
So if you want to submit your panel right away, just write your bed file:
# Retrieve the panel in bed format
bedPanel <- mydesign$BedStylePanel
head(bedPanel)
## chr start end annotation
## 1 chr3 178916613 178916965 PIK3CA
## 2 chr17 37881579 37881655 ERBB2
## 3 chr9 133589706 133589842 ABL1
## 4 chr9 133729450 133729624 ABL1
## 5 chr9 133730187 133730483 ABL1
## 6 chr3 178927973 178928353 PIK3CA
To know the total length of my panel, just sum up all the regions in the bed file:
# Calculate genomic space in kilo bases
sum( bedPanel$end - bedPanel$start )/100
## [1] 246.41
How panelDesigner decide the regions, according to the alteration specification?
As mentioned before, fusion, copynumber and expression data are probably collected through different technologies (maybe no NGS at all). Considering a specific design for every alteration type is therefore desirable.
# Design the panel for mutations
myMutationDesign <- panelDesigner(mypanel
, alterationType="mutations"
, padding_length=0
, merge_window=100)
By setting parameter alterationType, we require a design only for mutations.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.33 PrecisionTrialDrawer_1.8.0
## [3] BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] matrixStats_0.58.0 bitops_1.0-7 bit64_4.0.5
## [4] filelock_1.0.2 RColorBrewer_1.1-2 progress_1.2.2
## [7] httr_1.4.2 GenomeInfoDb_1.28.0 tools_4.1.0
## [10] bslib_0.2.5.1 utf8_1.2.1 R6_2.5.0
## [13] DT_0.18 DBI_1.1.1 BiocGenerics_0.38.0
## [16] colorspace_2.0-1 withr_2.4.2 tidyselect_1.1.1
## [19] prettyunits_1.1.1 bit_4.0.4 curl_4.3.1
## [22] compiler_4.1.0 Biobase_2.52.0 LowMACAAnnotation_0.99.3
## [25] profileModel_0.6.1 xml2_1.3.2 labeling_0.4.2
## [28] bookdown_0.22 sass_0.4.0 scales_1.1.1
## [31] brglm_0.7.2 rappdirs_0.3.3 stringr_1.4.0
## [34] digest_0.6.27 shinyBS_0.61 rmarkdown_2.8
## [37] XVector_0.32.0 pkgconfig_2.0.3 htmltools_0.5.1.1
## [40] highr_0.9 dbplyr_2.1.1 fastmap_1.1.0
## [43] htmlwidgets_1.5.3 rlang_0.4.11 RSQLite_2.2.7
## [46] shiny_1.6.0 farver_2.1.0 jquerylib_0.1.4
## [49] generics_0.1.0 jsonlite_1.7.2 BiocParallel_1.26.0
## [52] dplyr_1.0.6 R.oo_1.24.0 RCurl_1.98-1.3
## [55] magrittr_2.0.1 GenomeInfoDbData_1.2.6 Rcpp_1.0.6
## [58] munsell_0.5.0 S4Vectors_0.30.0 fansi_0.4.2
## [61] lifecycle_1.0.0 R.methodsS3_1.8.1 stringi_1.6.2
## [64] yaml_2.2.1 zlibbioc_1.38.0 plyr_1.8.6
## [67] BiocFileCache_2.0.0 grid_4.1.0 blob_1.2.1
## [70] promises_1.2.0.1 parallel_4.1.0 ggrepel_0.9.1
## [73] crayon_1.4.1 Biostrings_2.60.0 hms_1.1.0
## [76] KEGGREST_1.32.0 magick_2.7.2 pillar_1.6.1
## [79] GenomicRanges_1.44.0 cgdsr_1.3.0 reshape2_1.4.4
## [82] codetools_0.2-18 biomaRt_2.48.0 stats4_4.1.0
## [85] googleVis_0.6.10 XML_3.99-0.6 glue_1.4.2
## [88] evaluate_0.14 data.table_1.14.0 BiocManager_1.30.15
## [91] vctrs_0.3.8 png_0.1-7 httpuv_1.6.1
## [94] gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1
## [97] cachem_1.0.5 ggplot2_3.3.3 xfun_0.23
## [100] mime_0.10 xtable_1.8-4 later_1.2.0
## [103] tibble_3.1.2 AnnotationDbi_1.54.0 memoise_2.0.0
## [106] IRanges_2.26.0 ellipsis_0.3.2