BiocNeighbors 1.22.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7807 4048 1690 7496 4275 9032 8503 1309 3223 2511
## [2,] 4247 4150 5612 306 7384 2442 6105 9878 741 2088
## [3,] 6359 1742 5790 5423 5590 773 5162 6554 5966 5886
## [4,] 168 7178 5324 9424 163 9508 4452 8181 2042 982
## [5,] 3179 8476 9428 9454 8266 7063 2614 6180 8623 3659
## [6,] 4691 1603 7267 5705 3601 6422 4417 1084 1174 9386
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8743414 0.9185905 0.9246314 0.9619106 0.9977632 0.9991344 1.027064
## [2,] 0.9375042 0.9496952 1.0273458 1.0308537 1.0516742 1.0647935 1.064933
## [3,] 0.9058248 0.9150775 0.9485200 0.9524078 0.9763735 1.0003943 1.001401
## [4,] 0.8935077 0.9116217 0.9428035 0.9848570 0.9997736 1.0091287 1.023886
## [5,] 0.8322350 0.8699463 0.8776898 0.9177148 0.9702912 1.0075313 1.009186
## [6,] 0.9632362 1.0675830 1.0817145 1.0837052 1.0850976 1.0946070 1.100409
## [,8] [,9] [,10]
## [1,] 1.032685 1.050188 1.075408
## [2,] 1.079661 1.088457 1.090731
## [3,] 1.032016 1.050325 1.054712
## [4,] 1.031963 1.036094 1.037548
## [5,] 1.040292 1.043038 1.067523
## [6,] 1.114256 1.115492 1.117128
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8820 8386 4250 8215 3245
## [2,] 3022 5510 4203 5314 7900
## [3,] 5878 92 893 3203 9905
## [4,] 2297 3135 7971 2122 8963
## [5,] 4079 6059 5278 6550 635
## [6,] 2170 564 4309 3086 8310
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0121223 1.0327640 1.0490565 1.0862927 1.0990795
## [2,] 0.8792346 0.8977629 0.9626414 0.9746867 0.9945005
## [3,] 0.8142797 0.8937566 1.0337393 1.0369730 1.0720905
## [4,] 0.8253521 0.8921404 0.9003357 0.9294668 0.9371983
## [5,] 0.8711024 1.0315286 1.1635962 1.1840237 1.2081349
## [6,] 0.8470211 0.9116422 0.9375792 1.0012208 1.0519012
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/var/folders/db/4tvgx8jx4z3fm1gzlnlzw9rc0000gq/T//Rtmp74d8OV/file10835ec41fcd.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 beta (2024-04-14 r86421)
## Platform: x86_64-apple-darwin20
## Running under: macOS Monterey 12.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.22.0 knitr_1.46 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.42.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.38.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.50.0 cachem_1.0.8