BiocNeighbors 1.23.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2946 8513 5088 8334 230 3847 7478 1100 476 5916
## [2,] 5594 1764 3340 4430 8609 6034 3743 513 4963 9756
## [3,] 6028 2128 3395 9548 4812 4027 7436 51 3285 7576
## [4,] 9854 5862 9243 2470 5976 6570 2920 5169 6440 8100
## [5,] 6714 9853 8177 4265 6562 4747 4128 8761 7740 5936
## [6,] 1222 8082 4233 4291 1877 2284 6607 2689 999 3007
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9235007 0.9530529 0.9890140 1.0011725 1.0154260 1.0261490 1.0292401
## [2,] 0.9257770 0.9334330 0.9399729 0.9481642 0.9917586 1.0002267 1.0243708
## [3,] 0.8136322 0.8520054 0.9238144 0.9251338 0.9374168 0.9412203 0.9571537
## [4,] 1.0394421 1.0892286 1.0902098 1.1228871 1.1352572 1.1468672 1.1534649
## [5,] 0.9842608 1.0259783 1.0298620 1.0343409 1.0476205 1.0844207 1.0887936
## [6,] 0.8660069 0.8760877 1.0065460 1.0092150 1.0407312 1.0518422 1.0564597
## [,8] [,9] [,10]
## [1,] 1.0390409 1.0397460 1.050268
## [2,] 1.0445685 1.0509053 1.055647
## [3,] 0.9660875 0.9688493 0.982708
## [4,] 1.1632903 1.1635214 1.167719
## [5,] 1.0903933 1.1147499 1.129012
## [6,] 1.0669339 1.0831796 1.088232
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 187 9832 8085 7868 8267
## [2,] 4652 652 2841 8680 3832
## [3,] 8697 247 4633 6875 8524
## [4,] 952 5814 9058 8477 3749
## [5,] 2296 591 6945 1780 5490
## [6,] 9022 7920 9120 8311 6118
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8515963 0.9193652 0.9327104 0.9757583 0.9771211
## [2,] 0.9044446 0.9417438 1.0381624 1.0646532 1.0714722
## [3,] 0.9152412 0.9439426 0.9461997 0.9637957 0.9833848
## [4,] 0.8363032 0.9108430 1.0056951 1.0078031 1.0273498
## [5,] 0.8867046 0.9226076 0.9278751 0.9409248 0.9547678
## [6,] 0.8343502 0.9201649 0.9257336 0.9330530 0.9600113
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpgYLa71/file3a520785bb907.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.23.0 knitr_1.46 BiocStyle_2.33.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.43.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.39.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.51.0 cachem_1.0.8