BiocNeighbors 1.12.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9132 486 3702 9226 3669 6981 3108 9122 6643 9331
## [2,] 4647 3732 2324 6478 3408 5033 4518 4443 3730 2800
## [3,] 5338 6931 1751 7620 7500 8486 9228 7420 9835 9790
## [4,] 3731 3099 9586 6848 1331 6740 9061 4712 7945 1727
## [5,] 5027 353 701 5631 7507 9070 9111 8457 3522 2309
## [6,] 8563 4388 9485 5990 9660 1023 6027 2571 5531 6911
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.7955538 0.7993070 0.8670275 0.8893975 0.8990859 0.9185116 0.9308243
## [2,] 0.8612506 0.8810785 0.8846026 0.9016428 0.9525601 0.9794540 0.9894384
## [3,] 0.8985270 0.9190195 0.9773331 0.9811532 0.9923183 0.9931732 1.0417367
## [4,] 0.8401244 0.8699952 0.8739700 0.8985938 0.8994962 0.9097665 0.9111904
## [5,] 1.0512825 1.0699428 1.0717589 1.0932229 1.1015829 1.1048776 1.1068014
## [6,] 0.7201223 0.8758298 0.9073330 0.9589700 0.9773981 0.9850024 0.9959804
## [,8] [,9] [,10]
## [1,] 0.9489902 0.9665209 0.9728610
## [2,] 0.9967210 1.0027425 1.0114421
## [3,] 1.0571125 1.0683599 1.0922165
## [4,] 0.9245242 0.9503780 0.9547779
## [5,] 1.1175274 1.1209606 1.1239760
## [6,] 1.0304445 1.0322986 1.0459485
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1346 4530 4406 8088 5497
## [2,] 8659 3643 8990 3902 1001
## [3,] 7100 2624 2153 4897 6453
## [4,] 7749 6135 2230 265 8620
## [5,] 2995 5749 6548 921 1780
## [6,] 7579 8470 2683 8108 7873
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9977517 1.0536718 1.068333 1.073585 1.077637
## [2,] 1.0567617 1.0594093 1.060906 1.062710 1.075028
## [3,] 0.9056990 0.9921265 1.035131 1.040361 1.051483
## [4,] 1.0141206 1.0184829 1.037667 1.063885 1.068576
## [5,] 1.0487195 1.0592717 1.082760 1.119992 1.138012
## [6,] 0.9695914 0.9891419 1.058603 1.082217 1.083325
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "D:\\biocbuild\\bbs-3.14-bioc\\tmpdir\\RtmpKWPENZ\\file3988b7419e.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.12.0 knitr_1.36 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 magrittr_2.0.1 BiocGenerics_0.40.0
## [4] BiocParallel_1.28.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_0.4.12 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.1.1 parallel_4.1.1 grid_4.1.1
## [13] xfun_0.27 jquerylib_0.1.4 htmltools_0.5.2
## [16] yaml_2.2.1 digest_0.6.28 bookdown_0.24
## [19] Matrix_1.3-4 BiocManager_1.30.16 S4Vectors_0.32.0
## [22] sass_0.4.0 evaluate_0.14 rmarkdown_2.11
## [25] stringi_1.7.5 compiler_4.1.1 bslib_0.3.1
## [28] stats4_4.1.1 jsonlite_1.7.2