Analysis of high-throughput sequencing of T and B cell receptors with LymphoSeq

Importing data

Subsetting data

Extracting productive sequences

Create a table of summary statistics

Calculate clonal relatedness

Draw a phylogenetic tree

Multiple sequence alignment

Searching for sequences

Searching for published sequences

Visualizing repertoire diversity

Comparing samples

Differential abundance

Finding recurring sequences

Tracking sequences across samples

Comparing V(D)J gene usage

Removing sequences

Merging samples

Conclusion

Session info

Application of high-throughput sequencing of T and B lymphocyte antigen receptors has great potential for improving the monitoring of lymphoid malignancies, assessing immune reconstitution after hematopoietic stem cell transplantation, and characterizing the composition of lymphocyte repertoires (Warren, E. H. et al. Blood 2013;122:19–22). LymhoSeq is an R package designed to import, analyze, and visualize antigen receptor sequencing from Adaptive Biotechnologies’ ImmunoSEQ assay. The package is also adaptable to the analysis of T and B cell receptor sequencing processed using other platforms such as MiXCR or IMGT/HighV-QUEST. This vignette has been written to highlight some of the features of LymphoSeq and guide the user through a typical workflow.

Importing data

The LymphoSeq function readImmunoSeq imports tab-separated value (.tsv) files exported by Adaptive Biotechnologies ImmunoSEQ analyzer v2 where each row represents a unique sequence and each column is a variable with information about that sequence such as read count, frequency, or variable gene name. Note that the file format for ImmunoSEQ analyzer v3 is not yet supported and users must choose to export the v2 format from the analyzer software. Only files with the extension .tsv are imported while all other are disregarded. It is possible to import files processed using other platforms as long as the files are tab-delimited, are given the extension .tsv and have identical column names as the ImmunoSEQs files (see readImmunoSeq manual for a list of column names used by this file type). Refer to the LymphoSeq manual regarding the required column names used by each function.

To explore the features of LymphoSeq, this package includes 2 example data sets. The first is a data set of T cell receptor beta (TCRB) sequencing from 10 blood samples acquired serially from a single patient who underwent a bone marrow transplant (Kanakry, C.G., et al. JCI Insight 2016;1(5):pii: e86252). The second, is a data set of B cell receptor immunoglobulin heavy (IGH) chain sequencing from Burkitt lymphoma tumor biopsies acquired from 10 different individuals (Lombardo, K.A., et al. Blood Advances 2017 1:535-544). To improve performance, both data sets contain only the top 1,000 most frequent sequences. The complete data sets are publicly available through Adapatives’ immuneACCESS portal. As shown in the example below, you can specify the path to the example data sets using the command system.file("extdata", "TCRB_sequencing", package = "LymphoSeq") for the TCRB files and system.file("extdata", "IGH_sequencing", package = "LymphoSeq") for the IGH files.

readImmunoSeq imports each file in the specified directory as a list object where each file becomes a data frame. You can import all columns from each file by setting the columns parameter to "all" or list just those columns of interest. Be aware that Adaptive Biotechnologies has changed the column names of their files over time and if the headings of your files are not all the same, you will need to specify "all" or provide all variations of the column header. By default, the columns parameter is set to import only those columns used by LymphoSeq.

library(LymphoSeq)

## Loading required package: LymphoSeqDB

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Registered S3 method overwritten by 'treeio':
##   method     from
##   root.phylo ape

TCRB.path <- system.file("extdata", "TCRB_sequencing", package = "LymphoSeq")

TCRB.list <- readImmunoSeq(path = TCRB.path)

Notice that each data frame listed in the TCRB.list object is named according the ImmunoSEQ file names. If different names are desired, you may rename the original .tsv files or assign names(TCRB.list) to a new character vector of desired names in the same order as the list.

names(TCRB.list)

 [1] "TRB_CD4_949"       "TRB_CD8_949"       "TRB_CD8_CMV_369"  
 [4] "TRB_Unsorted_0"    "TRB_Unsorted_1320" "TRB_Unsorted_1496"
 [7] "TRB_Unsorted_32"   "TRB_Unsorted_369"  "TRB_Unsorted_83"  
[10] "TRB_Unsorted_949"

Having the data in the form of a list makes it easy to apply a function over that list using the base function lapply. For example, you may use the function dim to report the dimensions of each data frame as shown below. Noticed that each data frame in the example below has less than 1,000 rows and 11 columns.

lapply(TCRB.list, dim)

$TRB_CD4_949
[1] 1000   11

$TRB_CD8_949
[1] 1000   11

$TRB_CD8_CMV_369
[1] 414  11

$TRB_Unsorted_0
[1] 1000   11

$TRB_Unsorted_1320
[1] 1000   11

$TRB_Unsorted_1496
[1] 1000   11

$TRB_Unsorted_32
[1] 920  11

$TRB_Unsorted_369
[1] 1000   11

$TRB_Unsorted_83
[1] 1000   11

$TRB_Unsorted_949
[1] 1000   11

In place of dim, you may also use colnames, nrow, ncol, or other more complex functions that perform operations on subsetted columns.

Subsetting data

If you imported all of the files from your project but just want to perform an analysis on a subset, use standard R methods to subset the list. Remember that a single bracket [ returns a list and a double bracket [[ returns a single data frame.

CMV <- TCRB.list[grep("CMV", names(TCRB.list))]
names(CMV)

[1] "TRB_CD8_CMV_369"

TRB_Unsorted_0 <- TCRB.list[["TRB_Unsorted_0"]]
head(TRB_Unsorted_0)

ABCDEFGHIJ0123456789

	aminoAcid <chr>
1
2	CASSPVSNEQFF
3	CASSQEVPPYQAFF
4
5	CASSQEASGRQTQYF
6	CASSLEHTGATNEKLFF

For more complex subsetting, you can use a metadata file where one column contains the file names and the other columns have additional information about the sample files. You can then subset the metadata file using criteria from the other columns to give you just a character vector of file names that you can use to subset TCRB.list. In the example below, a metadata file is imported for the example TCRB data set which contains information on the number of days post bone marrow transplant the sample was collected and the cellular phenopyte the blood sample was sorted for prior to sequencing.

TCRB.metadata <- read.csv(system.file("extdata", "TCRB_metadata.csv", package = "LymphoSeq"))
TCRB.metadata

ABCDEFGHIJ0123456789

samples <fctr>	day <int>	phenotype <fctr>
TRB_Unsorted_0	0	Unsorted
TRB_Unsorted_32	32	Unsorted
TRB_Unsorted_83	82	Unsorted
TRB_CD8_CMV_369	369	CD8+CMV+
TRB_Unsorted_369	369	Unsorted
TRB_CD4_949	949	CD4+
TRB_CD8_949	949	CD8+
TRB_Unsorted_949	949	Unsorted
TRB_Unsorted_1320	1320	Unsorted
TRB_Unsorted_1496	1496	Unsorted

selected <- as.character(TCRB.metadata[TCRB.metadata$phenotype == "Unsorted" & 
                                 TCRB.metadata$day > 300, "samples"])
TCRB.list.selected <- TCRB.list[selected]
names(TCRB.list.selected)

[1] "TRB_Unsorted_369"  "TRB_Unsorted_949"  "TRB_Unsorted_1320"
[4] "TRB_Unsorted_1496"

Extracting productive sequences

A productive sequence is defined as a sequences that is in frame and does not have an early stop codon. If you sequenced genomic DNA as opposed to complimentary DNA made from RNA, then you will have unproductive and productive sequences in your data files. Use the function productiveSeq to remove unproductive sequences and recompute the frequencyCount for each of your samples.

If you are interested in just the complementarity determining region 3 (CDR3) amino acid sequences, then set aggregate to "aminoAcid" and the count and estimated number of genomes for duplicate amino acid sequences will be summed. Note that the resulting list of data frames will have columns corresponding to “aminoAcid”, “count”, “frequencyCount”, and “estimatedNumberGenomes” (if this column is available) only. All other columns, such as those corresponding to the V, D, and J gene names, will be removed if they were included in your original file list. The reason for this is to avoid confusion since a single amino acid CDR3 sequence may be encoded by multiple different nucleotide sequences with differing V, D, and J genes.

productive.TRB.aa <- productiveSeq(file.list = TCRB.list, aggregate = "aminoAcid", 
                               prevalence = FALSE)

Alternatively, you may set aggregate to "nucleotide" and the resulting list of data frames will all have the same columns as your original file list. Take note that some LymphoSeq functions require a productive sequence list aggregated by amino acid or nucleotide.

productive.TRB.nt <- productiveSeq(file.list = TCRB.list, aggregate = "nucleotide", 
                               prevalence = FALSE)

If the parameter prevalence is set to TRUE, then a new column is added to each of the data frames giving the prevalence (%) of each TCR beta CDR3 amino acid sequence in 55 healthy donor peripheral blood samples. Values range from 0 to 100% where 100% means the sequence appeared in the blood of all 55 individuals. The data for this operation resides in a separate package that is automatically loaded called LymphoSeqDB. Please refer to that package manual for more details.

Notice in the example below that there are no amino acid sequences given in the first and fourth row of the TCRB.list data frame for sample “TRB_Unsorted_949”. This is because the nucleotide sequence is out of frame and does not produce a productively transcribed amino acid sequence. If an asterisk (*) appears in the amino acid sequences, this would indicate an early stop codon.

head(TCRB.list[["TRB_Unsorted_0"]])

ABCDEFGHIJ0123456789

	aminoAcid <chr>
1
2	CASSPVSNEQFF
3	CASSQEVPPYQAFF
4
5	CASSQEASGRQTQYF
6	CASSLEHTGATNEKLFF

After productiveSeq is run, the unproductive sequences are removed and the frequencyCount is recalculated for each sequence. If there were two identical amino acid sequences that differed in their nucleotide sequence, they would be combined and their counts added together.

head(productive.TRB.aa[["TRB_Unsorted_0"]])

ABCDEFGHIJ0123456789

	aminoAcid <chr>	count <int>	frequencyCount <dbl>	estimatedNumberGenomes <dbl>
1	CASSPVSNEQFF	822	5.773283	822
2	CASSQEVPPYQAFF	797	5.597696	797
3	CASSQEASGRQTQYF	704	4.944515	704
4	CASSLEHTGATNEKLFF	653	4.586318	653
5	CASSPGDEQYF	619	4.347521	619
6	CSARSPSTGTLAEAFF	429	3.013064	429

Finally, notice that the productive.TRB.nt data frame for sample “TRB_Unsorted_949” below has additional columns not present in productive.TRB.aa but are in TCRB.list. This is because the data frame was aggregated by nucleotide sequence and all of the original columns from TCRB.list were carried over.

head(productive.TRB.nt[["TRB_Unsorted_0"]])

ABCDEFGHIJ0123456789

	aminoAcid <chr>
1	CASSPVSNEQFF
2	CASSQEVPPYQAFF
3	CASSQEASGRQTQYF
4	CASSLEHTGATNEKLFF
5	CASSPGDEQYF
6	CSARSPSTGTLAEAFF

Create a table of summary statistics

To create a table summarizing the total number of sequences, number of unique productive sequences, number of genomes, entropy, clonality, Gini coefficient, and the frequency (%) of the top productive sequence in each imported file, use the function clonality.

clonality(file.list = TCRB.list)

ABCDEFGHIJ0123456789

samples <fctr>	totalSequences <int>	uniqueProductiveSequences <int>	totalCount <int>	clonality <dbl>	giniCoefficient <dbl>
TRB_CD4_949	1000	845	25769	0.442719	0.8665242
TRB_Unsorted_369	1000	830	339413	0.425965	0.8447387
TRB_Unsorted_83	1000	823	236732	0.338114	0.7766277
TRB_CD8_949	1000	794	26239	0.430615	0.9026124
TRB_CD8_CMV_369	414	281	1794	0.331570	0.7606261
TRB_Unsorted_1320	1000	838	178190	0.421630	0.9016617
TRB_Unsorted_1496	1000	832	33669	0.389318	0.8812733
TRB_Unsorted_949	1000	831	6549	0.305784	0.7654438
TRB_Unsorted_0	1000	838	18161	0.280923	0.8184686
TRB_Unsorted_32	920	767	31078	0.134242	0.6007820

The clonality score is derived from the Shannon entropy, which is calculated from the frequencies of all productive sequences divided by the logarithm of the total number of unique productive sequences. This normalized entropy value is then inverted (1 - normalized entropy) to produce the clonality metric.

The Gini coefficient is an alternative metric used to calculate repertoire diversity and is derived from the Lorenz curve. The Lorenz curve is drawn such that x-axis represents the cumulative percentage of unique sequences and the y-axis represents the cumulative percentage of reads. A line passing through the origin with a slope of 1 reflects equal frequencies of all clones. The Gini coefficient is the ratio of the area between the line of equality and the observed Lorenz curve over the total area under the line of equality.

Both Gini coefficient and clonality are reported on a scale from 0 to 1 where 0 indicates all sequences have the same frequency and 1 indicates the repertoire is dominated by a single sequence.

Calculate clonal relatedness

One of the drawbacks of the clonality metric is that it does not take into account sequence similarity. This is particularly important when studying affinity maturation or B cell malignancies(Lombardo, K.A., et al. Blood Advances 2017 1:535-544). Clonal relatedness is a useful metric that takes into account sequence similarity without regard for clonal frequency. It is defined as the proportion of nucleotide sequences that are related by a defined edit distance threshold. The value ranges from 0 to 1 where 0 indicates no sequences are related and 1 indicates all sequences are related. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences different by a single amino acid or nucleotide.

IGH.path <- system.file("extdata", "IGH_sequencing", package = "LymphoSeq")

IGH.list <- readImmunoSeq(path = IGH.path)

clonalRelatedness(list = IGH.list, editDistance = 10)

ABCDEFGHIJ0123456789

samples <fctr>	clonalRelatedness <dbl>
IGH_MVQ108911A_BL	0.623919308
IGH_MVQ194745A_BL	0.845000000
IGH_MVQ81231A_BL	0.623919308
IGH_MVQ89037A_BL	0.288184438
IGH_MVQ90143A_BL	0.006097561
IGH_MVQ92552A_BL	0.272727273
IGH_MVQ93505A_BL	0.392201835
IGH_MVQ93631A_BL	0.757000000
IGH_MVQ94865A_BL	0.007000000
IGH_MVQ95413A_BL	0.003636364

Draw a phylogenetic tree

A phylogenetic tree is a useful way to visualize the similarity between sequences. The phyloTree function create a phylogenetic tree of a single sample using neighbor joining tree estimation for amino acid or nucleotide CDR3 sequences. Each leaf in the tree represents a sequence color coded by the V, D, and J gene usage. The number next to each leaf refers to the sequence count. A triangle shaped leaf indicates the most frequent sequence. The distance between leaves on the horizontal axis corresponds to the sequence similarity (i.e. the further apart the leaves are horizontally, the less similar the sequences are to one another).

productive.IGH.nt <- productiveSeq(file.list = IGH.list, aggregate = "nucleotide")

phyloTree(list = productive.IGH.nt, sample = "IGH_MVQ92552A_BL", type = "nucleotide", 
         layout = "rectangular")

Multiple sequence alignment

In LymphoSeq, you can perform a multiple sequence alignment using one of three methods provided by the Bioconductor msa package (ClustalW, ClustalOmega, or Muscle) and output results to the console or as a pdf file. One may perform the alignment of all amino acid or nucleotide sequences in a single sample. Alternatively, one may search for a given sequence within a list of samples using an edit distance threshold.

alignSeq(list = productive.IGH.nt, sample = "IGH_MVQ92552A_BL", type = "aminoAcid", 
         method = "ClustalW", output = "consule")

use default substitution matrix

Searching for sequences

To search for one or more amino acid or nucleotide CDR3 sequences in a list of data frames, use the function searchSeq. You may specify to search in either a list of productive or unproductive data frames.

searchSeq(list = productive.TRB.aa, sequence = "CASSPVSNEQFF", type = "aminoAcid", 
          match = "global", editDistance = 0)

ABCDEFGHIJ0123456789

sample <fctr>	aminoAcid <chr>	count <int>	frequencyCount <dbl>	estimatedNumberGenomes <dbl>
TRB_Unsorted_0	CASSPVSNEQFF	822	5.773282764	822
TRB_CD8_949	CASSPVSNEQFF	2	0.008923791	2

If you have only a partial sequence, set the parameter match to "partial". If you are looking for related sequences that differ by one or more nucleotides or amino acids, then increase the editDistance value. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences differ by a single amino acid or nucleotide.

Searching for published sequences

To search your entire list of data frames for a published amino acid CDR3 TCRB sequence with known antigen specificity, use the function searchPublished.

published <- searchPublished(list = productive.TRB.aa)
head(published)

ABCDEFGHIJ0123456789

	sample <fctr>	aminoAcid <chr>	count <int>	frequencyCount <dbl>	estimatedNumberGenomes <dbl>	PMID <fctr>
1	TRB_Unsorted_32	CASASSGTDTQYF	33	0.131594688	0	20647322
2	TRB_Unsorted_1496	CASSETGGTEAFF	2	0.007267706	2	19786555
3	TRB_Unsorted_949	CASSFSTDTQYF	2	0.038277512	2	23267020
4	TRB_CD8_949	CASSIRSAYEQYF	7	0.031233268	7	21048112
5	TRB_Unsorted_1320	CASSIRSAYEQYF	15	0.010313886	15	21048112
6	TRB_Unsorted_1496	CASSIRSSYEQYF	4	0.014535412	4	21048112

For each found sequence, a table is provides listing the antigen, epitope, HLA type, PubMed ID (PMID), and prevalence (%) of the sequence among 55 healthy donor blood samples. The data for this function resides in the separate LymphoSeqDB package that is automatically loaded when the function is called. Please refer to that package manual for more details.

Visualizing repertoire diversity

Antigen receptor repertoire diversity can be characterized by a number such as clonality or Gini coefficient calculated by the clonality function. Alternatively, you can visualize the repertoire diversity by plotting the Lorenz curve for each sample as defined above. In this plot, the more diverse samples will appear near the dotted diagonal line (the line of equality) whereas the more clonal samples will appear to have a more bowed shape.

lorenzCurve(samples = names(productive.TRB.aa), list = productive.TRB.aa)

Alternatively, you can get a feel for the repertoire diversity by plotting the cumulative frequency of a selected number of the top most frequent clones using the function topSeqsPlot. In this case, each of the top sequences are represented by a different color and all less frequent clones will be assigned a single color (violet).

topSeqsPlot(list = productive.TRB.aa, top = 10)

Both of these functions are built using the ggplot2 package. You can reformat the plot using ggplot2 functions. Please refer to the lorenzCurve and topSeqsPlot manual for specific examples.

Comparing samples

To compare the T or B cell repertoires of all samples in a pairwise fashion, use the bhattacharyyaMatrix or similarityMatrix functions. Both the Bhattacharyya coefficient and similarity score are measures of the amount of overlap between two samples. The value for each ranges from 0 to 1 where 1 indicates the sequence frequencies are identical in the two samples and 0 indicates no shared frequencies. The Bhattacharyya coefficient differs from the similarity score in that it involves weighting each shared sequence in the two distributions by the arithmetic mean of the frequency of each sequence, while calculating the similarity scores involves weighting each shared sequence in the two distributions by the geometric mean of the frequency of each sequence in the two distributions.

bhattacharyya.matrix <- bhattacharyyaMatrix(productive.seqs = productive.TRB.aa)
bhattacharyya.matrix

ABCDEFGHIJ0123456789

	TRB_Unsorted_1496 <dbl>	TRB_Unsorted_1320 <dbl>	TRB_Unsorted_949 <dbl>	TRB_Unsorted_369 <dbl>
TRB_Unsorted_1496	1.00000000	0.95291865	0.85590946	0.524018989
TRB_Unsorted_1320	0.95291865	1.00000000	0.87674297	0.532764399
TRB_Unsorted_949	0.85590946	0.87674297	1.00000000	0.517496433
TRB_Unsorted_369	0.52401899	0.53276440	0.51749643	1.000000000
TRB_Unsorted_83	0.30375115	0.33141730	0.37518862	0.469787569
TRB_Unsorted_32	0.28782025	0.31292611	0.28018709	0.192342747
TRB_CD8_CMV_369	0.75528817	0.77886610	0.71503376	0.512020867
TRB_Unsorted_0	0.01542965	0.01627338	0.01388704	0.008900138
TRB_CD8_949	0.78871302	0.81794347	0.81414832	0.532143928
TRB_CD4_949	0.44361207	0.43743309	0.42587691	0.176016728

similarity.matrix <- similarityMatrix(productive.seqs = productive.TRB.aa)
similarity.matrix

ABCDEFGHIJ0123456789

	TRB_Unsorted_1496 <dbl>	TRB_Unsorted_1320 <dbl>	TRB_Unsorted_949 <dbl>	TRB_Unsorted_369 <dbl>
TRB_Unsorted_1496	1.00000000	0.9581854	0.89906548	0.74206009
TRB_Unsorted_1320	0.95818541	1.0000000	0.93863667	0.65143812
TRB_Unsorted_949	0.89906548	0.9386367	1.00000000	0.77154129
TRB_Unsorted_369	0.74206009	0.6514381	0.77154129	1.00000000
TRB_Unsorted_83	0.54589852	0.4325548	0.59723589	0.47127662
TRB_Unsorted_32	0.32105103	0.4035552	0.26159989	0.23387758
TRB_CD8_CMV_369	0.67188308	0.7063167	0.65222570	0.70312873
TRB_Unsorted_0	0.02730081	0.0371885	0.01387248	0.01588281
TRB_CD8_949	0.78972983	0.7648513	0.90794949	0.69642713
TRB_CD4_949	0.47223856	0.2900181	0.72941015	0.11852468

The results of either function can be visualized by the pairwisePlot function.

pairwisePlot(matrix = bhattacharyya.matrix)

To view sequences shared between two or more samples, use the function commonSeqs. This function requires that a productive amino acid list be specified.

common <- commonSeqs(samples = c("TRB_Unsorted_0", "TRB_Unsorted_32"), 
                    productive.aa = productive.TRB.aa)
head(common)

ABCDEFGHIJ0123456789

	aminoAcid <chr>	TRB_Unsorted_0 <dbl>	TRB_Unsorted_32 <dbl>
1	CASSQDRTGQYGYTF	0.47057171	0.80551900
2	CAWTGGTTEAFF	0.10535188	0.15153328
3	CAISEGNYGYTF	0.03511729	0.18742274
4	CASSFGIQETQYF	0.01404692	0.09570523

To visualize the number of overlapping sequences between two or three samples in the form of a Venn diagram, use the function commonSeqVenn.

commonSeqsVenn(samples = c("TRB_Unsorted_32", "TRB_Unsorted_83"), 
               productive.seqs = productive.TRB.aa)

commonSeqsVenn(samples = c("TRB_Unsorted_0", "TRB_Unsorted_32", "TRB_Unsorted_83"), 
               productive.seqs = productive.TRB.aa)

To compare the frequency of sequences between two samples as a scatter plot, use the function commonSeqsPlot.

commonSeqsPlot("TRB_Unsorted_32", "TRB_Unsorted_83", 
               productive.aa = productive.TRB.aa, show = "common")

If you have more than 3 samples to compare, use the commonSeqBar function. You can chose to color a single sample with the color.sample argument or a desired intersection with the color.intersection argument.

commonSeqsBar(productive.aa = productive.TRB.aa, 
              samples = c("TRB_CD4_949", "TRB_CD8_949", 
                          "TRB_Unsorted_949", "TRB_Unsorted_1320"), 
              color.sample = "TRB_CD8_949",
              labels = "no")

Differential abundance

When comparing a sample from two different time points, it is useful to identify sequences that are significantly more or less abundant in one versus the other time point (DeWitt, W.S., et al. Journal of Virology 2015 89(8):4517-4526). The differentialAbundance function uses a Fisher exact test to calculate differential abundance of each sequence in two time points and reports the log2 transformed fold change, P value and adjusted P value.

differentialAbundance(list = productive.TRB.aa, 
                      sample1 = "TRB_Unsorted_949", 
                      sample2 = "TRB_Unsorted_1320", 
                      type = "aminoAcid", q = 0.01)

ABCDEFGHIJ0123456789

aminoAcid <fctr>	TRB_Unsorted_949 <dbl>	TRB_Unsorted_1320 <dbl>	p <dbl>	q <dbl>	l2fc <dbl>
CASSPPTGERDTQYF	5.62679426	12.93567573	2.406577e-66	3.205561e-63	-1.200970
CASSQDLMTVDSLFAGANVLTF	8.05741627	3.11891910	2.038483e-63	2.713221e-60	1.369271
CASSLAGDSQETQYF	2.02870813	6.57888404	6.843544e-52	9.101914e-49	-1.697282
CASSSIKTGATEAFF	0.61244019	0.01856499	3.382517e-31	4.495366e-28	5.043912
CASSQDTGNEQFF	0.32535885	0.00000000	1.481077e-25	1.966870e-22	8.345888
CASSQGGSYNSPLHF	0.28708134	0.00000000	1.238489e-22	1.643475e-19	8.165316
CASSFHRDAAYGYTF	0.24880383	0.00000000	1.034867e-19	1.372234e-16	7.958865
CASSQEGGRDNEQFF	0.36363636	0.01031389	1.991455e-19	2.638678e-16	5.139837
CASSPWSNEKLFF	0.22966507	0.00000000	2.990614e-18	3.959573e-15	7.843388
CASSYFRDGGELGTF	0.57416268	0.07013442	2.034201e-16	2.691248e-13	3.033265

Finding recurring sequences

To create a data frame of unique, productive amino acid sequences as rows and sample names as headers use the seqMatrix function. Each value in the data frame represents the frequency that each sequence appears in the sample. You can specify your own list of sequences or all unique sequences in the list using the output of the function uniqueSeqs. The uniqueSeqs function creates a data frame of all unique, productive sequences and reports the total count in all samples.

unique.seqs <- uniqueSeqs(productive.aa = productive.TRB.aa)
head(unique.seqs)

ABCDEFGHIJ0123456789

	aminoAcid <chr>	count <int>
3143	CASSQDWERLGEQFF	99480
2178	CASSLQGREKLFF	90567
3039	CASSQDLMTVDSLFAGANVLTF	68682
2506	CASSPAGAYYNEQFF	30454
2744	CASSPPTGERDTQYF	24703
1642	CASSLAGDSQETQYF	22147

sequence.matrix <- seqMatrix(productive.aa = productive.TRB.aa, sequences = unique.seqs$aminoAcid)
head(sequence.matrix)

ABCDEFGHIJ0123456789

	aminoAcid <chr>	numberSamples <int>	TRB_CD8_949 <dbl>	TRB_CD8_CMV_369 <dbl>	TRB_Unsorted_0 <dbl>
1	CASSQDRTGQYGYTF	9	0.99500268	0.6032172	0.47057171
2	CASSLQGREKLFF	8	8.73192932	5.1608579	0.00000000
3	CASSQDLMTVDSLFAGANVLTF	8	10.65054435	8.8471850	0.00000000
4	CASSYSGNTEAFF	8	0.16955203	0.1340483	0.04214075
5	CASSFMDWTGGNSPLHF	8	0.04461895	0.4691689	0.00000000
6	CASSREGDQPQHF	8	0.29894699	0.1340483	0.00000000

If just the top clones with a frequency greater than a specified amount are of interest to you, then use the topFreq function. This creates a data frame of the top productive amino acid sequences having a minimum specified frequency and reports the minimum, maximum, and mean frequency that the sequence appears in a list of samples. For TCRB sequences, the prevalence (%) and the published antigen specificity of that sequence are also provided.

top.freq <- topFreq(productive.aa = productive.TRB.aa, percent = 0.1)
head(top.freq)

ABCDEFGHIJ0123456789

	aminoAcid <chr>	minFrequency <dbl>	maxFrequency <dbl>	meanFrequency <dbl>	numberSamples <int>	prevalence <dbl>
387	CASSQDRTGQYGYTF	0.43249562	1.8625084	0.8334973	9	3.6
276	CASSLQGREKLFF	4.86209528	23.6458434	8.7311794	8	30.9
382	CASSQDLMTVDSLFAGANVLTF	1.55122224	12.1922371	7.3944297	8	1.8
436	CASSREGDQPQHF	0.13404826	2.8472305	0.5767910	8	18.2
158	CASSFMDWTGGNSPLHF	0.03588946	0.6695954	0.2314942	8	1.8
536	CASSYSGNTEAFF	0.04214075	0.8385633	0.2246444	8	69.1

One very useful thing to do is merge the output of seqMatrix and topFreq.

top.freq <- topFreq(productive.aa = productive.TRB.aa, percent = 0)
top.freq.matrix <- merge(top.freq, sequence.matrix)
head(top.freq.matrix)

ABCDEFGHIJ0123456789

	aminoAcid <chr>	numberSamples <int>	minFrequency <dbl>	maxFrequency <dbl>	meanFrequency <dbl>	prevalence <dbl>
1	CAAGDTTLYEQYF	1	0.014046917	0.014046917	0.014046917	0.0
2	CAAGTSGDTQYF	1	0.019138756	0.019138756	0.019138756	1.8
3	CAARGGGESYEQYF	1	0.028093833	0.028093833	0.028093833	0.0
4	CAATRRQGDVMNTEAFF	1	0.083541316	0.083541316	0.083541316	0.0
5	CAAWGTGPLGSSGANVLTF	1	0.029931447	0.029931447	0.029931447	0.0
6	CACALGDGYTF	1	0.001375185	0.001375185	0.001375185	0.0

Tracking sequences across samples

To visually track the frequency of sequences across multiple samples, use the function cloneTrack. This function takes the output from the seqMatrix function. You can specify a character vector of amino acid sequences using the parameter track to highlight those sequences with a different color. Alternatively, you can highlight all of the sequences from a given sample using the parameter map. If the mapping feature is use, then you must specify a productive amino acid list and a character vector of labels to title the mapped samples. To hide sequences that are not being tracked or mapped, set unassigned to FALSE.

cloneTrack(sequence.matrix = sequence.matrix, 
           productive.aa = productive.TRB.aa, 
           map = c("TRB_CD4_949", "TRB_CD8_949"), 
           label = c("CD4", "CD8"), 
           track = "CASSPPTGERDTQYF", 
           unassigned = FALSE)

Refer to the cloneTrack manual for examples on how to reformat the chart using ggplot2 function.

Comparing V(D)J gene usage

To compare the V, D, and J gene usage across samples, start by creating a data frame of V, D, and J gene counts and frequencies using the function geneFreq. You can specify if you are interested in the “VDJ”, “DJ”, “VJ”, “DJ”, “V”, “D”, or “J” loci using the locus parameter. Set family to TRUE if you prefer the family names instead of the gene names as reported by ImmunoSeq.

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
head(vGenes)

ABCDEFGHIJ0123456789

	samples <fctr>	familyName <fctr>	count <int>	frequencyGene <dbl>
1	TRB_Unsorted_949	TCRBV02	72	1.377990
2	TRB_Unsorted_949	TCRBV03	165	3.157895
3	TRB_Unsorted_949	TCRBV04	780	14.928230
4	TRB_Unsorted_949	TCRBV05	398	7.617225
5	TRB_Unsorted_949	TCRBV06	413	7.904306
6	TRB_Unsorted_949	TCRBV07	475	9.090909

To create a chord diagram showing VJ or DJ gene associations from one or more more samples, combine the output of geneFreq with the function chordDiagramVDJ. This function works well the topSeqs function that creates a data frame of a selected number of top productive sequences. In the example below, a chord diagram is made showing the association between V and J genes of just the single dominant clones in each sample. The size of the ribbons connecting VJ genes correspond to the number of samples that have that recombination event. The thicker the ribbon, the higher the frequency of the recombination.

top.seqs <- topSeqs(productive.seqs = productive.TRB.nt, top = 1)
chordDiagramVDJ(sample = top.seqs, 
                association = "VJ", 
                colors = c("darkred", "navyblue"))

You can also visualize the results of geneFreq as a heat map, word cloud, our cumulative frequency bar plot with the support of additional R packages as shown below.

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
library(RColorBrewer)
library(grDevices)
RedBlue <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(11, "RdBu")))(256)
library(wordcloud)
wordcloud::wordcloud(words = vGenes[vGenes$samples == "TRB_Unsorted_83", "familyName"], 
                     freq = vGenes[vGenes$samples == "TRB_Unsorted_83", "frequencyGene"], 
                     colors = RedBlue)

library(reshape)
vGenes <- reshape::cast(vGenes, familyName ~ samples, value = "frequencyGene", sum)
rownames(vGenes) = as.character(vGenes$familyName)
vGenes$familyName = NULL
library(pheatmap)
pheatmap::pheatmap(vGenes, color = RedBlue, scale = "row")

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
library(ggplot2)
multicolors <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(9, "Set1")))(28)
ggplot2::ggplot(vGenes, aes(x = samples, y = frequencyGene, fill = familyName)) +
  geom_bar(stat = "identity") +
  theme_minimal() + 
  scale_y_continuous(expand = c(0, 0)) + 
  guides(fill = guide_legend(ncol = 2)) +
  scale_fill_manual(values = multicolors) + 
  labs(y = "Frequency (%)", x = "", fill = "") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Removing sequences

Occasionally you may identify one or more sequences in your data set that appear to be contamination. You can remove an amino acid sequence from all data frames using the function removeSeq and recompute frequencyCount for all remaining sequences.

searchSeq(list = productive.TRB.aa, sequence = "CASSESAGSTGELFF")

ABCDEFGHIJ0123456789

sample <fctr>	aminoAcid <chr>	count <int>	frequencyCount <dbl>	estimatedNumberGenomes <dbl>
TRB_CD4_949	CASSESAGSTGELFF	5019	30.091732	5019
TRB_Unsorted_1320	CASSESAGSTGELFF	10326	7.100079	10326
TRB_Unsorted_949	CASSESAGSTGELFF	338	6.468900	338
TRB_Unsorted_1496	CASSESAGSTGELFF	1755	6.377412	1755

cleansed <- removeSeq(file.list = productive.TRB.aa, sequence = "CASSESAGSTGELFF")
searchSeq(list = cleansed, sequence = "CASSESAGSTGELFF")

No sequences found.

Merging samples

If you need to combine multiple samples into one, use the mergeFiles function. It merges two or more sample data frames into a single data frame and aggregates count, frequencyCount, and estimatedNumberGenomes.

TRB_949_Merged <- mergeFiles(samples = c("TRB_CD4_949", "TRB_CD8_949"), 
                                file.list = TCRB.list)

Conclusion

Advances in high-throughput sequencing have enabled characterizing T and B lymphocyte repertoires with unprecedented depth. LymphoSeq was developed as a tool to assist in the analysis of targeted next generation sequencing of the hypervariable CDR3 region of T and B cell receptors. The three key features of this R package are to characterize lymphocyte repertoire diversity, compare two or more lymphocyte repertoires, and track the frequency of CDR3 sequences across multiple samples. LymphoSeq also provides the unique ability to search for sequences in a curated database of published TCRB sequences with known antigen specificity. Finally, LymphoSeq can assign the percent prevalence that any given TCRB sequence appears in a the peripheral blood in healthy population of donors.

Session info

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.1.1      pheatmap_1.0.12    reshape_0.8.8     
## [4] wordcloud_2.6      RColorBrewer_1.1-2 LymphoSeq_1.12.0  
## [7] LymphoSeqDB_0.99.2
## 
## loaded via a namespace (and not attached):
##  [1] stringdist_0.9.5.1   Rcpp_1.0.1           ape_5.3             
##  [4] lattice_0.20-38      circlize_0.4.6       tidyr_0.8.3         
##  [7] Biostrings_2.52.0    assertthat_0.2.1     digest_0.6.18       
## [10] R6_2.4.0             plyr_1.8.4           futile.options_1.0.1
## [13] stats4_3.6.0         evaluate_0.13        pillar_1.3.1        
## [16] GlobalOptions_0.1.0  zlibbioc_1.30.0      rlang_0.3.4         
## [19] lazyeval_0.2.2       VennDiagram_1.6.20   data.table_1.12.2   
## [22] phangorn_2.5.3       S4Vectors_0.22.0     Matrix_1.2-17       
## [25] ineq_0.2-13          rmarkdown_1.12       labeling_0.3        
## [28] stringr_1.4.0        igraph_1.2.4.1       munsell_0.5.0       
## [31] compiler_3.6.0       xfun_0.6             pkgconfig_2.0.2     
## [34] BiocGenerics_0.30.0  shape_1.4.4          htmltools_0.3.6     
## [37] tidyselect_0.2.5     tibble_2.1.1         gridExtra_2.3       
## [40] quadprog_1.5-6       IRanges_2.18.0       withr_2.1.2         
## [43] crayon_1.3.4         dplyr_0.8.0.1        grid_3.6.0          
## [46] nlme_3.1-139         jsonlite_1.6         gtable_0.3.0        
## [49] magrittr_1.5         formatR_1.6          scales_1.0.0        
## [52] tidytree_0.2.4       stringi_1.4.3        XVector_0.24.0      
## [55] ggtree_1.16.0        futile.logger_1.4.3  rvcheck_0.1.3       
## [58] fastmatch_1.1-0      lambda.r_1.2.3       tools_3.6.0         
## [61] treeio_1.8.0         msa_1.16.0           glue_1.3.1          
## [64] purrr_0.3.2          parallel_3.6.0       yaml_2.2.0          
## [67] colorspace_1.4-1     UpSetR_1.3.3         knitr_1.22

Analysis of high-throughput sequencing of T and B cell receptors with LymphoSeq

David G. Coffey, MD

April 14, 2017

Importing data

Subsetting data

Extracting productive sequences

Create a table of summary statistics

Calculate clonal relatedness

Draw a phylogenetic tree

Multiple sequence alignment

Searching for sequences

Searching for published sequences

Visualizing repertoire diversity

Comparing samples

Differential abundance

Finding recurring sequences

Tracking sequences across samples

Comparing V(D)J gene usage

Removing sequences

Merging samples

Conclusion

Session info