BiocNeighbors 1.10.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6839 3564 8970 8494 2514 3375 8041 4673 2809 6557
## [2,] 3762 8510 1930 1709 7576 7814 9139 3511 1450 2967
## [3,] 3446 3654 4728 1638 2751 2345 3632 1000 7965 4392
## [4,] 8817 6584 8180 59 3050 2645 1465 7738 9324 6834
## [5,] 9889 1561 3571 9 4544 2686 9911 6703 2909 9965
## [6,] 2855 5428 5531 5760 771 7536 3368 6385 670 2652
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9921864 0.9957834 0.9994216 1.0113053 1.0138078 1.0223548 1.0240149
## [2,] 0.8645985 0.9447311 0.9579430 0.9840193 0.9866839 0.9876871 0.9897641
## [3,] 0.9259275 0.9518557 0.9632121 0.9735848 0.9742126 0.9901797 0.9999163
## [4,] 0.8075849 0.8929428 0.9125759 0.9290096 0.9432879 0.9578164 0.9866946
## [5,] 0.9659729 0.9993480 1.0383309 1.1012558 1.1051797 1.1090480 1.1115953
## [6,] 0.7669217 0.9034132 0.9614732 0.9849571 0.9958620 0.9966433 1.0280401
## [,8] [,9] [,10]
## [1,] 1.0260382 1.0628729 1.072670
## [2,] 0.9969519 0.9970498 1.003801
## [3,] 1.0107327 1.0114325 1.023189
## [4,] 0.9878797 1.0132945 1.048254
## [5,] 1.1497359 1.1527627 1.160382
## [6,] 1.0282890 1.0388981 1.039620
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4718 2150 4276 4908 1207
## [2,] 4228 2541 8287 1392 6196
## [3,] 3533 7801 6299 1937 7967
## [4,] 9597 9572 5819 6116 5156
## [5,] 8190 367 8620 223 5560
## [6,] 4112 6999 9241 1377 9643
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8218788 0.8370792 0.8468705 0.9359561 0.9615508
## [2,] 0.8191825 0.8814184 0.9263154 0.9368848 0.9414537
## [3,] 0.7210571 0.8295944 0.8740436 0.9274628 0.9452164
## [4,] 0.8753332 0.8821529 0.8910749 0.9699543 1.0150330
## [5,] 0.8536909 0.8577128 0.8752092 0.8840870 0.8868778
## [6,] 1.0393667 1.0482323 1.1029489 1.1369380 1.1387308
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpfiYBOk/file2a908cf0048a2.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.10.0 knitr_1.33 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 magrittr_2.0.1 BiocGenerics_0.38.0
## [4] BiocParallel_1.26.0 lattice_0.20-44 R6_2.5.0
## [7] rlang_0.4.11 stringr_1.4.0 tools_4.1.0
## [10] parallel_4.1.0 grid_4.1.0 xfun_0.23
## [13] jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [16] digest_0.6.27 bookdown_0.22 Matrix_1.3-3
## [19] BiocManager_1.30.15 S4Vectors_0.30.0 sass_0.4.0
## [22] evaluate_0.14 rmarkdown_2.8 stringi_1.6.2
## [25] compiler_4.1.0 bslib_0.2.5.1 stats4_4.1.0
## [28] jsonlite_1.7.2