BiocNeighbors 1.6.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3521 6940 7197 2671 590 6508 649 891 1436 2265
## [2,] 5076 1010 6926 3469 122 2042 6395 778 7022 1902
## [3,] 6755 1777 5789 9447 9730 1215 2609 793 7163 1266
## [4,] 1553 8994 8187 3844 1962 3457 109 8824 7417 939
## [5,] 2426 91 9785 9127 2655 5888 5433 3348 9616 8991
## [6,] 8786 6180 3485 3826 4068 8907 4252 3467 6429 7282
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8939867 0.9363595 1.0008545 1.0161842 1.0631888 1.064085 1.077162
## [2,] 0.9608915 0.9682344 1.0185841 1.0358274 1.0508249 1.051615 1.089248
## [3,] 0.8045722 0.9334469 0.9345300 0.9352991 0.9947063 1.039884 1.044366
## [4,] 0.8362793 0.8582160 0.8825148 0.9497098 0.9522839 1.007295 1.007637
## [5,] 0.9503789 1.0751200 1.1055510 1.1151500 1.1241959 1.133050 1.135079
## [6,] 1.0123729 1.0160055 1.0189724 1.0288382 1.0357256 1.050178 1.080513
## [,8] [,9] [,10]
## [1,] 1.094295 1.102976 1.121646
## [2,] 1.094440 1.136807 1.139871
## [3,] 1.053339 1.083925 1.087731
## [4,] 1.007907 1.016277 1.019069
## [5,] 1.136122 1.137514 1.143163
## [6,] 1.101067 1.112083 1.123004
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2527 1139 2203 9993 6640
## [2,] 3129 4395 6689 4217 6701
## [3,] 9677 385 10 3993 7034
## [4,] 7533 8356 1771 48 5844
## [5,] 1566 6995 2749 9165 3525
## [6,] 4126 7452 6597 6468 9665
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8259433 0.9379281 0.9428999 0.9689491 0.9751185
## [2,] 0.8221393 0.9631063 1.1422907 1.1520110 1.1591171
## [3,] 0.8956584 0.9790214 1.0144969 1.0156194 1.0178609
## [4,] 0.8217030 1.0031134 1.0049297 1.0262222 1.0467658
## [5,] 0.9914263 1.0444784 1.0719147 1.0735639 1.0821632
## [6,] 0.9146192 1.0163833 1.0326532 1.0444288 1.0478787
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "C:\\Users\\biocbuild\\bbs-3.11-bioc\\tmpdir\\Rtmp6hDGS5\\file2e806c4a418a.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.6.0 knitr_1.28 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 bookdown_0.18 lattice_0.20-41
## [4] digest_0.6.25 grid_4.0.0 stats4_4.0.0
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.5
## [10] stringi_1.4.6 S4Vectors_0.26.0 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.22.0 tools_4.0.0
## [16] stringr_1.4.0 parallel_4.0.0 xfun_0.13
## [19] yaml_2.2.1 compiler_4.0.0 BiocGenerics_0.34.0
## [22] BiocManager_1.30.10 htmltools_0.4.0