• Importing data
  • Subsetting data
  • Extracting productive sequences
  • Create a table of summary statistics
  • Calculate clonal relatedness
  • Draw a phylogenetic tree
  • Multiple sequence alignment
  • Searching for sequences
  • Searching for published sequences
  • Visualizing repertoire diversity
  • Comparing samples
  • Differential abundance
  • Finding recurring sequences
  • Tracking sequences across samples
  • Comparing V(D)J gene usage
  • Removing sequences
  • Merging samples
  • Conclusion
  • Session info

Application of high-throughput sequencing of T and B lymphocyte antigen receptors has great potential for improving the monitoring of lymphoid malignancies, assessing immune reconstitution after hematopoietic stem cell transplantation, and characterizing the composition of lymphocyte repertoires (Warren, E. H. et al. Blood 2013;122:19–22). LymhoSeq is an R package designed to import, analyze, and visualize antigen receptor sequencing from Adaptive Biotechnologies’ ImmunoSEQ assay. The package is also adaptable to the analysis of T and B cell receptor sequencing processed using other platforms such as MiXCR or IMGT/HighV-QUEST. This vignette has been written to highlight some of the features of LymphoSeq and guide the user through a typical workflow.

Importing data

The LymphoSeq function readImmunoSeq imports tab-separated value (.tsv) files exported by Adaptive Biotechnologies ImmunoSEQ analyzer v2 where each row represents a unique sequence and each column is a variable with information about that sequence such as read count, frequency, or variable gene name. Note that the file format for ImmunoSEQ analyzer v3 is not yet supported and users must choose to export the v2 format from the analyzer software. Only files with the extension .tsv are imported while all other are disregarded. It is possible to import files processed using other platforms as long as the files are tab-delimited, are given the extension .tsv and have identical column names as the ImmunoSEQs files (see readImmunoSeq manual for a list of column names used by this file type). Refer to the LymphoSeq manual regarding the required column names used by each function.

To explore the features of LymphoSeq, this package includes 2 example data sets. The first is a data set of T cell receptor beta (TCRB) sequencing from 10 blood samples acquired serially from a single patient who underwent a bone marrow transplant (Kanakry, C.G., et al. JCI Insight 2016;1(5):pii: e86252). The second, is a data set of B cell receptor immunoglobulin heavy (IGH) chain sequencing from Burkitt lymphoma tumor biopsies acquired from 10 different individuals (Lombardo, K.A., et al. Blood Advances 2017 1:535-544). To improve performance, both data sets contain only the top 1,000 most frequent sequences. The complete data sets are publicly available through Adapatives’ immuneACCESS portal. As shown in the example below, you can specify the path to the example data sets using the command system.file("extdata", "TCRB_sequencing", package = "LymphoSeq") for the TCRB files and system.file("extdata", "IGH_sequencing", package = "LymphoSeq") for the IGH files.

readImmunoSeq imports each file in the specified directory as a list object where each file becomes a data frame. You can import all columns from each file by setting the columns parameter to "all" or list just those columns of interest. Be aware that Adaptive Biotechnologies has changed the column names of their files over time and if the headings of your files are not all the same, you will need to specify "all" or provide all variations of the column header. By default, the columns parameter is set to import only those columns used by LymphoSeq.

library(LymphoSeq)
## Loading required package: LymphoSeqDB
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'treeio':
##   method     from
##   root.phylo ape
TCRB.path <- system.file("extdata", "TCRB_sequencing", package = "LymphoSeq")

TCRB.list <- readImmunoSeq(path = TCRB.path)

Notice that each data frame listed in the TCRB.list object is named according the ImmunoSEQ file names. If different names are desired, you may rename the original .tsv files or assign names(TCRB.list) to a new character vector of desired names in the same order as the list.

names(TCRB.list)
 [1] "TRB_CD4_949"       "TRB_CD8_949"       "TRB_CD8_CMV_369"  
 [4] "TRB_Unsorted_0"    "TRB_Unsorted_1320" "TRB_Unsorted_1496"
 [7] "TRB_Unsorted_32"   "TRB_Unsorted_369"  "TRB_Unsorted_83"  
[10] "TRB_Unsorted_949" 

Having the data in the form of a list makes it easy to apply a function over that list using the base function lapply. For example, you may use the function dim to report the dimensions of each data frame as shown below. Noticed that each data frame in the example below has less than 1,000 rows and 11 columns.

lapply(TCRB.list, dim)
$TRB_CD4_949
[1] 1000   11

$TRB_CD8_949
[1] 1000   11

$TRB_CD8_CMV_369
[1] 414  11

$TRB_Unsorted_0
[1] 1000   11

$TRB_Unsorted_1320
[1] 1000   11

$TRB_Unsorted_1496
[1] 1000   11

$TRB_Unsorted_32
[1] 920  11

$TRB_Unsorted_369
[1] 1000   11

$TRB_Unsorted_83
[1] 1000   11

$TRB_Unsorted_949
[1] 1000   11

In place of dim, you may also use colnames, nrow, ncol, or other more complex functions that perform operations on subsetted columns.

Subsetting data

If you imported all of the files from your project but just want to perform an analysis on a subset, use standard R methods to subset the list. Remember that a single bracket [ returns a list and a double bracket [[ returns a single data frame.

CMV <- TCRB.list[grep("CMV", names(TCRB.list))]
names(CMV)
[1] "TRB_CD8_CMV_369"
TRB_Unsorted_0 <- TCRB.list[["TRB_Unsorted_0"]]
head(TRB_Unsorted_0)
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
1
2CASSPVSNEQFF
3CASSQEVPPYQAFF
4
5CASSQEASGRQTQYF
6CASSLEHTGATNEKLFF

For more complex subsetting, you can use a metadata file where one column contains the file names and the other columns have additional information about the sample files. You can then subset the metadata file using criteria from the other columns to give you just a character vector of file names that you can use to subset TCRB.list. In the example below, a metadata file is imported for the example TCRB data set which contains information on the number of days post bone marrow transplant the sample was collected and the cellular phenopyte the blood sample was sorted for prior to sequencing.

TCRB.metadata <- read.csv(system.file("extdata", "TCRB_metadata.csv", package = "LymphoSeq"))
TCRB.metadata
ABCDEFGHIJ0123456789
samples
<fctr>
day
<int>
phenotype
<fctr>
TRB_Unsorted_00Unsorted
TRB_Unsorted_3232Unsorted
TRB_Unsorted_8382Unsorted
TRB_CD8_CMV_369369CD8+CMV+
TRB_Unsorted_369369Unsorted
TRB_CD4_949949CD4+
TRB_CD8_949949CD8+
TRB_Unsorted_949949Unsorted
TRB_Unsorted_13201320Unsorted
TRB_Unsorted_14961496Unsorted
selected <- as.character(TCRB.metadata[TCRB.metadata$phenotype == "Unsorted" & 
                                 TCRB.metadata$day > 300, "samples"])
TCRB.list.selected <- TCRB.list[selected]
names(TCRB.list.selected)
[1] "TRB_Unsorted_369"  "TRB_Unsorted_949"  "TRB_Unsorted_1320"
[4] "TRB_Unsorted_1496"

Extracting productive sequences

A productive sequence is defined as a sequences that is in frame and does not have an early stop codon. If you sequenced genomic DNA as opposed to complimentary DNA made from RNA, then you will have unproductive and productive sequences in your data files. Use the function productiveSeq to remove unproductive sequences and recompute the frequencyCount for each of your samples.

If you are interested in just the complementarity determining region 3 (CDR3) amino acid sequences, then set aggregate to "aminoAcid" and the count and estimated number of genomes for duplicate amino acid sequences will be summed. Note that the resulting list of data frames will have columns corresponding to “aminoAcid”, “count”, “frequencyCount”, and “estimatedNumberGenomes” (if this column is available) only. All other columns, such as those corresponding to the V, D, and J gene names, will be removed if they were included in your original file list. The reason for this is to avoid confusion since a single amino acid CDR3 sequence may be encoded by multiple different nucleotide sequences with differing V, D, and J genes.

productive.TRB.aa <- productiveSeq(file.list = TCRB.list, aggregate = "aminoAcid", 
                               prevalence = FALSE)

Alternatively, you may set aggregate to "nucleotide" and the resulting list of data frames will all have the same columns as your original file list. Take note that some LymphoSeq functions require a productive sequence list aggregated by amino acid or nucleotide.

productive.TRB.nt <- productiveSeq(file.list = TCRB.list, aggregate = "nucleotide", 
                               prevalence = FALSE)

If the parameter prevalence is set to TRUE, then a new column is added to each of the data frames giving the prevalence (%) of each TCR beta CDR3 amino acid sequence in 55 healthy donor peripheral blood samples. Values range from 0 to 100% where 100% means the sequence appeared in the blood of all 55 individuals. The data for this operation resides in a separate package that is automatically loaded called LymphoSeqDB. Please refer to that package manual for more details.

Notice in the example below that there are no amino acid sequences given in the first and fourth row of the TCRB.list data frame for sample “TRB_Unsorted_949”. This is because the nucleotide sequence is out of frame and does not produce a productively transcribed amino acid sequence. If an asterisk (*) appears in the amino acid sequences, this would indicate an early stop codon.

head(TCRB.list[["TRB_Unsorted_0"]])
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
1
2CASSPVSNEQFF
3CASSQEVPPYQAFF
4
5CASSQEASGRQTQYF
6CASSLEHTGATNEKLFF

After productiveSeq is run, the unproductive sequences are removed and the frequencyCount is recalculated for each sequence. If there were two identical amino acid sequences that differed in their nucleotide sequence, they would be combined and their counts added together.

head(productive.TRB.aa[["TRB_Unsorted_0"]])
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
count
<int>
frequencyCount
<dbl>
estimatedNumberGenomes
<dbl>
1CASSPVSNEQFF8225.773283822
2CASSQEVPPYQAFF7975.597696797
3CASSQEASGRQTQYF7044.944515704
4CASSLEHTGATNEKLFF6534.586318653
5CASSPGDEQYF6194.347521619
6CSARSPSTGTLAEAFF4293.013064429

Finally, notice that the productive.TRB.nt data frame for sample “TRB_Unsorted_949” below has additional columns not present in productive.TRB.aa but are in TCRB.list. This is because the data frame was aggregated by nucleotide sequence and all of the original columns from TCRB.list were carried over.

head(productive.TRB.nt[["TRB_Unsorted_0"]])
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
1CASSPVSNEQFF
2CASSQEVPPYQAFF
3CASSQEASGRQTQYF
4CASSLEHTGATNEKLFF
5CASSPGDEQYF
6CSARSPSTGTLAEAFF

Create a table of summary statistics

To create a table summarizing the total number of sequences, number of unique productive sequences, number of genomes, entropy, clonality, Gini coefficient, and the frequency (%) of the top productive sequence in each imported file, use the function clonality.

clonality(file.list = TCRB.list)
ABCDEFGHIJ0123456789
samples
<fctr>
totalSequences
<int>
uniqueProductiveSequences
<int>
totalCount
<int>
clonality
<dbl>
giniCoefficient
<dbl>
TRB_CD4_9491000845257690.4427190.8665242
TRB_Unsorted_36910008303394130.4259650.8447387
TRB_Unsorted_8310008232367320.3381140.7766277
TRB_CD8_9491000794262390.4306150.9026124
TRB_CD8_CMV_36941428117940.3315700.7606261
TRB_Unsorted_132010008381781900.4216300.9016617
TRB_Unsorted_14961000832336690.3893180.8812733
TRB_Unsorted_949100083165490.3057840.7654438
TRB_Unsorted_01000838181610.2809230.8184686
TRB_Unsorted_32920767310780.1342420.6007820

The clonality score is derived from the Shannon entropy, which is calculated from the frequencies of all productive sequences divided by the logarithm of the total number of unique productive sequences. This normalized entropy value is then inverted (1 - normalized entropy) to produce the clonality metric.

The Gini coefficient is an alternative metric used to calculate repertoire diversity and is derived from the Lorenz curve. The Lorenz curve is drawn such that x-axis represents the cumulative percentage of unique sequences and the y-axis represents the cumulative percentage of reads. A line passing through the origin with a slope of 1 reflects equal frequencies of all clones. The Gini coefficient is the ratio of the area between the line of equality and the observed Lorenz curve over the total area under the line of equality.

Both Gini coefficient and clonality are reported on a scale from 0 to 1 where 0 indicates all sequences have the same frequency and 1 indicates the repertoire is dominated by a single sequence.

Calculate clonal relatedness

One of the drawbacks of the clonality metric is that it does not take into account sequence similarity. This is particularly important when studying affinity maturation or B cell malignancies(Lombardo, K.A., et al. Blood Advances 2017 1:535-544). Clonal relatedness is a useful metric that takes into account sequence similarity without regard for clonal frequency. It is defined as the proportion of nucleotide sequences that are related by a defined edit distance threshold. The value ranges from 0 to 1 where 0 indicates no sequences are related and 1 indicates all sequences are related. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences different by a single amino acid or nucleotide.

IGH.path <- system.file("extdata", "IGH_sequencing", package = "LymphoSeq")

IGH.list <- readImmunoSeq(path = IGH.path)
clonalRelatedness(list = IGH.list, editDistance = 10)
ABCDEFGHIJ0123456789
samples
<fctr>
clonalRelatedness
<dbl>
IGH_MVQ108911A_BL0.623919308
IGH_MVQ194745A_BL0.845000000
IGH_MVQ81231A_BL0.623919308
IGH_MVQ89037A_BL0.288184438
IGH_MVQ90143A_BL0.006097561
IGH_MVQ92552A_BL0.272727273
IGH_MVQ93505A_BL0.392201835
IGH_MVQ93631A_BL0.757000000
IGH_MVQ94865A_BL0.007000000
IGH_MVQ95413A_BL0.003636364

Draw a phylogenetic tree

A phylogenetic tree is a useful way to visualize the similarity between sequences. The phyloTree function create a phylogenetic tree of a single sample using neighbor joining tree estimation for amino acid or nucleotide CDR3 sequences. Each leaf in the tree represents a sequence color coded by the V, D, and J gene usage. The number next to each leaf refers to the sequence count. A triangle shaped leaf indicates the most frequent sequence. The distance between leaves on the horizontal axis corresponds to the sequence similarity (i.e. the further apart the leaves are horizontally, the less similar the sequences are to one another).

productive.IGH.nt <- productiveSeq(file.list = IGH.list, aggregate = "nucleotide")
phyloTree(list = productive.IGH.nt, sample = "IGH_MVQ92552A_BL", type = "nucleotide", 
         layout = "rectangular")

Multiple sequence alignment

In LymphoSeq, you can perform a multiple sequence alignment using one of three methods provided by the Bioconductor msa package (ClustalW, ClustalOmega, or Muscle) and output results to the console or as a pdf file. One may perform the alignment of all amino acid or nucleotide sequences in a single sample. Alternatively, one may search for a given sequence within a list of samples using an edit distance threshold.

alignSeq(list = productive.IGH.nt, sample = "IGH_MVQ92552A_BL", type = "aminoAcid", 
         method = "ClustalW", output = "consule")
use default substitution matrix

Searching for sequences

To search for one or more amino acid or nucleotide CDR3 sequences in a list of data frames, use the function searchSeq. You may specify to search in either a list of productive or unproductive data frames.

searchSeq(list = productive.TRB.aa, sequence = "CASSPVSNEQFF", type = "aminoAcid", 
          match = "global", editDistance = 0)
ABCDEFGHIJ0123456789
sample
<fctr>
aminoAcid
<chr>
count
<int>
frequencyCount
<dbl>
estimatedNumberGenomes
<dbl>
TRB_Unsorted_0CASSPVSNEQFF8225.773282764822
TRB_CD8_949CASSPVSNEQFF20.0089237912

If you have only a partial sequence, set the parameter match to "partial". If you are looking for related sequences that differ by one or more nucleotides or amino acids, then increase the editDistance value. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences differ by a single amino acid or nucleotide.

Searching for published sequences

To search your entire list of data frames for a published amino acid CDR3 TCRB sequence with known antigen specificity, use the function searchPublished.

published <- searchPublished(list = productive.TRB.aa)
head(published)
ABCDEFGHIJ0123456789
 
 
sample
<fctr>
aminoAcid
<chr>
count
<int>
frequencyCount
<dbl>
estimatedNumberGenomes
<dbl>
PMID
<fctr>
1TRB_Unsorted_32CASASSGTDTQYF330.131594688020647322
2TRB_Unsorted_1496CASSETGGTEAFF20.007267706219786555
3TRB_Unsorted_949CASSFSTDTQYF20.038277512223267020
4TRB_CD8_949CASSIRSAYEQYF70.031233268721048112
5TRB_Unsorted_1320CASSIRSAYEQYF150.0103138861521048112
6TRB_Unsorted_1496CASSIRSSYEQYF40.014535412421048112

For each found sequence, a table is provides listing the antigen, epitope, HLA type, PubMed ID (PMID), and prevalence (%) of the sequence among 55 healthy donor blood samples. The data for this function resides in the separate LymphoSeqDB package that is automatically loaded when the function is called. Please refer to that package manual for more details.

Visualizing repertoire diversity

Antigen receptor repertoire diversity can be characterized by a number such as clonality or Gini coefficient calculated by the clonality function. Alternatively, you can visualize the repertoire diversity by plotting the Lorenz curve for each sample as defined above. In this plot, the more diverse samples will appear near the dotted diagonal line (the line of equality) whereas the more clonal samples will appear to have a more bowed shape.

lorenzCurve(samples = names(productive.TRB.aa), list = productive.TRB.aa)

Alternatively, you can get a feel for the repertoire diversity by plotting the cumulative frequency of a selected number of the top most frequent clones using the function topSeqsPlot. In this case, each of the top sequences are represented by a different color and all less frequent clones will be assigned a single color (violet).

topSeqsPlot(list = productive.TRB.aa, top = 10)

Both of these functions are built using the ggplot2 package. You can reformat the plot using ggplot2 functions. Please refer to the lorenzCurve and topSeqsPlot manual for specific examples.

Comparing samples

To compare the T or B cell repertoires of all samples in a pairwise fashion, use the bhattacharyyaMatrix or similarityMatrix functions. Both the Bhattacharyya coefficient and similarity score are measures of the amount of overlap between two samples. The value for each ranges from 0 to 1 where 1 indicates the sequence frequencies are identical in the two samples and 0 indicates no shared frequencies. The Bhattacharyya coefficient differs from the similarity score in that it involves weighting each shared sequence in the two distributions by the arithmetic mean of the frequency of each sequence, while calculating the similarity scores involves weighting each shared sequence in the two distributions by the geometric mean of the frequency of each sequence in the two distributions.

bhattacharyya.matrix <- bhattacharyyaMatrix(productive.seqs = productive.TRB.aa)
bhattacharyya.matrix
ABCDEFGHIJ0123456789
 
 
TRB_Unsorted_1496
<dbl>
TRB_Unsorted_1320
<dbl>
TRB_Unsorted_949
<dbl>
TRB_Unsorted_369
<dbl>
TRB_Unsorted_14961.000000000.952918650.855909460.524018989
TRB_Unsorted_13200.952918651.000000000.876742970.532764399
TRB_Unsorted_9490.855909460.876742971.000000000.517496433
TRB_Unsorted_3690.524018990.532764400.517496431.000000000
TRB_Unsorted_830.303751150.331417300.375188620.469787569
TRB_Unsorted_320.287820250.312926110.280187090.192342747
TRB_CD8_CMV_3690.755288170.778866100.715033760.512020867
TRB_Unsorted_00.015429650.016273380.013887040.008900138
TRB_CD8_9490.788713020.817943470.814148320.532143928
TRB_CD4_9490.443612070.437433090.425876910.176016728
similarity.matrix <- similarityMatrix(productive.seqs = productive.TRB.aa)
similarity.matrix
ABCDEFGHIJ0123456789
 
 
TRB_Unsorted_1496
<dbl>
TRB_Unsorted_1320
<dbl>
TRB_Unsorted_949
<dbl>
TRB_Unsorted_369
<dbl>
TRB_Unsorted_14961.000000000.95818540.899065480.74206009
TRB_Unsorted_13200.958185411.00000000.938636670.65143812
TRB_Unsorted_9490.899065480.93863671.000000000.77154129
TRB_Unsorted_3690.742060090.65143810.771541291.00000000
TRB_Unsorted_830.545898520.43255480.597235890.47127662
TRB_Unsorted_320.321051030.40355520.261599890.23387758
TRB_CD8_CMV_3690.671883080.70631670.652225700.70312873
TRB_Unsorted_00.027300810.03718850.013872480.01588281
TRB_CD8_9490.789729830.76485130.907949490.69642713
TRB_CD4_9490.472238560.29001810.729410150.11852468

The results of either function can be visualized by the pairwisePlot function.

pairwisePlot(matrix = bhattacharyya.matrix)

To view sequences shared between two or more samples, use the function commonSeqs. This function requires that a productive amino acid list be specified.

common <- commonSeqs(samples = c("TRB_Unsorted_0", "TRB_Unsorted_32"), 
                    productive.aa = productive.TRB.aa)
head(common)
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
TRB_Unsorted_0
<dbl>
TRB_Unsorted_32
<dbl>
1CASSQDRTGQYGYTF0.470571710.80551900
2CAWTGGTTEAFF0.105351880.15153328
3CAISEGNYGYTF0.035117290.18742274
4CASSFGIQETQYF0.014046920.09570523

To visualize the number of overlapping sequences between two or three samples in the form of a Venn diagram, use the function commonSeqVenn.

commonSeqsVenn(samples = c("TRB_Unsorted_32", "TRB_Unsorted_83"), 
               productive.seqs = productive.TRB.aa)

commonSeqsVenn(samples = c("TRB_Unsorted_0", "TRB_Unsorted_32", "TRB_Unsorted_83"), 
               productive.seqs = productive.TRB.aa)

To compare the frequency of sequences between two samples as a scatter plot, use the function commonSeqsPlot.

commonSeqsPlot("TRB_Unsorted_32", "TRB_Unsorted_83", 
               productive.aa = productive.TRB.aa, show = "common")

If you have more than 3 samples to compare, use the commonSeqBar function. You can chose to color a single sample with the color.sample argument or a desired intersection with the color.intersection argument.

commonSeqsBar(productive.aa = productive.TRB.aa, 
              samples = c("TRB_CD4_949", "TRB_CD8_949", 
                          "TRB_Unsorted_949", "TRB_Unsorted_1320"), 
              color.sample = "TRB_CD8_949",
              labels = "no")

Differential abundance

When comparing a sample from two different time points, it is useful to identify sequences that are significantly more or less abundant in one versus the other time point (DeWitt, W.S., et al. Journal of Virology 2015 89(8):4517-4526). The differentialAbundance function uses a Fisher exact test to calculate differential abundance of each sequence in two time points and reports the log2 transformed fold change, P value and adjusted P value.

differentialAbundance(list = productive.TRB.aa, 
                      sample1 = "TRB_Unsorted_949", 
                      sample2 = "TRB_Unsorted_1320", 
                      type = "aminoAcid", q = 0.01)
ABCDEFGHIJ0123456789
aminoAcid
<fctr>
TRB_Unsorted_949
<dbl>
TRB_Unsorted_1320
<dbl>
p
<dbl>
q
<dbl>
l2fc
<dbl>
CASSPPTGERDTQYF5.6267942612.935675732.406577e-663.205561e-63-1.200970
CASSQDLMTVDSLFAGANVLTF8.057416273.118919102.038483e-632.713221e-601.369271
CASSLAGDSQETQYF2.028708136.578884046.843544e-529.101914e-49-1.697282
CASSSIKTGATEAFF0.612440190.018564993.382517e-314.495366e-285.043912
CASSQDTGNEQFF0.325358850.000000001.481077e-251.966870e-228.345888
CASSQGGSYNSPLHF0.287081340.000000001.238489e-221.643475e-198.165316
CASSFHRDAAYGYTF0.248803830.000000001.034867e-191.372234e-167.958865
CASSQEGGRDNEQFF0.363636360.010313891.991455e-192.638678e-165.139837
CASSPWSNEKLFF0.229665070.000000002.990614e-183.959573e-157.843388
CASSYFRDGGELGTF0.574162680.070134422.034201e-162.691248e-133.033265

Finding recurring sequences

To create a data frame of unique, productive amino acid sequences as rows and sample names as headers use the seqMatrix function. Each value in the data frame represents the frequency that each sequence appears in the sample. You can specify your own list of sequences or all unique sequences in the list using the output of the function uniqueSeqs. The uniqueSeqs function creates a data frame of all unique, productive sequences and reports the total count in all samples.

unique.seqs <- uniqueSeqs(productive.aa = productive.TRB.aa)
head(unique.seqs)
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
count
<int>
3143CASSQDWERLGEQFF99480
2178CASSLQGREKLFF90567
3039CASSQDLMTVDSLFAGANVLTF68682
2506CASSPAGAYYNEQFF30454
2744CASSPPTGERDTQYF24703
1642CASSLAGDSQETQYF22147
sequence.matrix <- seqMatrix(productive.aa = productive.TRB.aa, sequences = unique.seqs$aminoAcid)
head(sequence.matrix)
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
numberSamples
<int>
TRB_CD4_949
<dbl>
TRB_CD8_949
<dbl>
TRB_CD8_CMV_369
<dbl>
TRB_Unsorted_0
<dbl>
1CASSQDRTGQYGYTF900.995002680.60321720.47057171
2CASSLQGREKLFF808.731929325.16085790.00000000
3CASSQDLMTVDSLFAGANVLTF8010.650544358.84718500.00000000
4CASSYSGNTEAFF800.169552030.13404830.04214075
5CASSFMDWTGGNSPLHF800.044618950.46916890.00000000
6CASSREGDQPQHF800.298946990.13404830.00000000

If just the top clones with a frequency greater than a specified amount are of interest to you, then use the topFreq function. This creates a data frame of the top productive amino acid sequences having a minimum specified frequency and reports the minimum, maximum, and mean frequency that the sequence appears in a list of samples. For TCRB sequences, the prevalence (%) and the published antigen specificity of that sequence are also provided.

top.freq <- topFreq(productive.aa = productive.TRB.aa, percent = 0.1)
head(top.freq)
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
minFrequency
<dbl>
maxFrequency
<dbl>
meanFrequency
<dbl>
numberSamples
<int>
prevalence
<dbl>
387CASSQDRTGQYGYTF0.432495621.86250840.833497393.6
276CASSLQGREKLFF4.8620952823.64584348.7311794830.9
382CASSQDLMTVDSLFAGANVLTF1.5512222412.19223717.394429781.8
436CASSREGDQPQHF0.134048262.84723050.5767910818.2
158CASSFMDWTGGNSPLHF0.035889460.66959540.231494281.8
536CASSYSGNTEAFF0.042140750.83856330.2246444869.1

One very useful thing to do is merge the output of seqMatrix and topFreq.

top.freq <- topFreq(productive.aa = productive.TRB.aa, percent = 0)
top.freq.matrix <- merge(top.freq, sequence.matrix)
head(top.freq.matrix)
ABCDEFGHIJ0123456789
 
 
aminoAcid
<chr>
numberSamples
<int>
minFrequency
<dbl>
maxFrequency
<dbl>
meanFrequency
<dbl>
prevalence
<dbl>
antigen
<chr>
1CAAGDTTLYEQYF10.0140469170.0140469170.0140469170.0
2CAAGTSGDTQYF10.0191387560.0191387560.0191387561.8
3CAARGGGESYEQYF10.0280938330.0280938330.0280938330.0
4CAATRRQGDVMNTEAFF10.0835413160.0835413160.0835413160.0
5CAAWGTGPLGSSGANVLTF10.0299314470.0299314470.0299314470.0
6CACALGDGYTF10.0013751850.0013751850.0013751850.0

Tracking sequences across samples

To visually track the frequency of sequences across multiple samples, use the function cloneTrack. This function takes the output from the seqMatrix function. You can specify a character vector of amino acid sequences using the parameter track to highlight those sequences with a different color. Alternatively, you can highlight all of the sequences from a given sample using the parameter map. If the mapping feature is use, then you must specify a productive amino acid list and a character vector of labels to title the mapped samples. To hide sequences that are not being tracked or mapped, set unassigned to FALSE.

cloneTrack(sequence.matrix = sequence.matrix, 
           productive.aa = productive.TRB.aa, 
           map = c("TRB_CD4_949", "TRB_CD8_949"), 
           label = c("CD4", "CD8"), 
           track = "CASSPPTGERDTQYF", 
           unassigned = FALSE)

Refer to the cloneTrack manual for examples on how to reformat the chart using ggplot2 function.

Comparing V(D)J gene usage

To compare the V, D, and J gene usage across samples, start by creating a data frame of V, D, and J gene counts and frequencies using the function geneFreq. You can specify if you are interested in the “VDJ”, “DJ”, “VJ”, “DJ”, “V”, “D”, or “J” loci using the locus parameter. Set family to TRUE if you prefer the family names instead of the gene names as reported by ImmunoSeq.

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
head(vGenes)
ABCDEFGHIJ0123456789
 
 
samples
<fctr>
familyName
<fctr>
count
<int>
frequencyGene
<dbl>
1TRB_Unsorted_949TCRBV02721.377990
2TRB_Unsorted_949TCRBV031653.157895
3TRB_Unsorted_949TCRBV0478014.928230
4TRB_Unsorted_949TCRBV053987.617225
5TRB_Unsorted_949TCRBV064137.904306
6TRB_Unsorted_949TCRBV074759.090909

To create a chord diagram showing VJ or DJ gene associations from one or more more samples, combine the output of geneFreq with the function chordDiagramVDJ. This function works well the topSeqs function that creates a data frame of a selected number of top productive sequences. In the example below, a chord diagram is made showing the association between V and J genes of just the single dominant clones in each sample. The size of the ribbons connecting VJ genes correspond to the number of samples that have that recombination event. The thicker the ribbon, the higher the frequency of the recombination.

top.seqs <- topSeqs(productive.seqs = productive.TRB.nt, top = 1)
chordDiagramVDJ(sample = top.seqs, 
                association = "VJ", 
                colors = c("darkred", "navyblue"))

You can also visualize the results of geneFreq as a heat map, word cloud, our cumulative frequency bar plot with the support of additional R packages as shown below.

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
library(RColorBrewer)
library(grDevices)
RedBlue <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(11, "RdBu")))(256)
library(wordcloud)
wordcloud::wordcloud(words = vGenes[vGenes$samples == "TRB_Unsorted_83", "familyName"], 
                     freq = vGenes[vGenes$samples == "TRB_Unsorted_83", "frequencyGene"], 
                     colors = RedBlue)

library(reshape)
vGenes <- reshape::cast(vGenes, familyName ~ samples, value = "frequencyGene", sum)
rownames(vGenes) = as.character(vGenes$familyName)
vGenes$familyName = NULL
library(pheatmap)
pheatmap::pheatmap(vGenes, color = RedBlue, scale = "row")

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
library(ggplot2)
multicolors <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(9, "Set1")))(28)
ggplot2::ggplot(vGenes, aes(x = samples, y = frequencyGene, fill = familyName)) +
  geom_bar(stat = "identity") +
  theme_minimal() + 
  scale_y_continuous(expand = c(0, 0)) + 
  guides(fill = guide_legend(ncol = 2)) +
  scale_fill_manual(values = multicolors) + 
  labs(y = "Frequency (%)", x = "", fill = "") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Removing sequences

Occasionally you may identify one or more sequences in your data set that appear to be contamination. You can remove an amino acid sequence from all data frames using the function removeSeq and recompute frequencyCount for all remaining sequences.

searchSeq(list = productive.TRB.aa, sequence = "CASSESAGSTGELFF")
ABCDEFGHIJ0123456789
sample
<fctr>
aminoAcid
<chr>
count
<int>
frequencyCount
<dbl>
estimatedNumberGenomes
<dbl>
TRB_CD4_949CASSESAGSTGELFF501930.0917325019
TRB_Unsorted_1320CASSESAGSTGELFF103267.10007910326
TRB_Unsorted_949CASSESAGSTGELFF3386.468900338
TRB_Unsorted_1496CASSESAGSTGELFF17556.3774121755
cleansed <- removeSeq(file.list = productive.TRB.aa, sequence = "CASSESAGSTGELFF")
searchSeq(list = cleansed, sequence = "CASSESAGSTGELFF")
No sequences found.

Merging samples

If you need to combine multiple samples into one, use the mergeFiles function. It merges two or more sample data frames into a single data frame and aggregates count, frequencyCount, and estimatedNumberGenomes.

TRB_949_Merged <- mergeFiles(samples = c("TRB_CD4_949", "TRB_CD8_949"), 
                                file.list = TCRB.list)

Conclusion

Advances in high-throughput sequencing have enabled characterizing T and B lymphocyte repertoires with unprecedented depth. LymphoSeq was developed as a tool to assist in the analysis of targeted next generation sequencing of the hypervariable CDR3 region of T and B cell receptors. The three key features of this R package are to characterize lymphocyte repertoire diversity, compare two or more lymphocyte repertoires, and track the frequency of CDR3 sequences across multiple samples. LymphoSeq also provides the unique ability to search for sequences in a curated database of published TCRB sequences with known antigen specificity. Finally, LymphoSeq can assign the percent prevalence that any given TCRB sequence appears in a the peripheral blood in healthy population of donors.

Session info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.1.1      pheatmap_1.0.12    reshape_0.8.8     
## [4] wordcloud_2.6      RColorBrewer_1.1-2 LymphoSeq_1.12.0  
## [7] LymphoSeqDB_0.99.2
## 
## loaded via a namespace (and not attached):
##  [1] stringdist_0.9.5.1   Rcpp_1.0.1           ape_5.3             
##  [4] lattice_0.20-38      circlize_0.4.6       tidyr_0.8.3         
##  [7] Biostrings_2.52.0    assertthat_0.2.1     digest_0.6.18       
## [10] R6_2.4.0             plyr_1.8.4           futile.options_1.0.1
## [13] stats4_3.6.0         evaluate_0.13        pillar_1.3.1        
## [16] GlobalOptions_0.1.0  zlibbioc_1.30.0      rlang_0.3.4         
## [19] lazyeval_0.2.2       VennDiagram_1.6.20   data.table_1.12.2   
## [22] phangorn_2.5.3       S4Vectors_0.22.0     Matrix_1.2-17       
## [25] ineq_0.2-13          rmarkdown_1.12       labeling_0.3        
## [28] stringr_1.4.0        igraph_1.2.4.1       munsell_0.5.0       
## [31] compiler_3.6.0       xfun_0.6             pkgconfig_2.0.2     
## [34] BiocGenerics_0.30.0  shape_1.4.4          htmltools_0.3.6     
## [37] tidyselect_0.2.5     tibble_2.1.1         gridExtra_2.3       
## [40] quadprog_1.5-6       IRanges_2.18.0       withr_2.1.2         
## [43] crayon_1.3.4         dplyr_0.8.0.1        grid_3.6.0          
## [46] nlme_3.1-139         jsonlite_1.6         gtable_0.3.0        
## [49] magrittr_1.5         formatR_1.6          scales_1.0.0        
## [52] tidytree_0.2.4       stringi_1.4.3        XVector_0.24.0      
## [55] ggtree_1.16.0        futile.logger_1.4.3  rvcheck_0.1.3       
## [58] fastmatch_1.1-0      lambda.r_1.2.3       tools_3.6.0         
## [61] treeio_1.8.0         msa_1.16.0           glue_1.3.1          
## [64] purrr_0.3.2          parallel_3.6.0       yaml_2.2.0          
## [67] colorspace_1.4-1     UpSetR_1.3.3         knitr_1.22