BiocNeighbors 1.4.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3330 453 5202 1713 84 1954 6101 5370 6996 255
## [2,] 4316 4108 8339 2349 8027 4012 7406 4294 9085 2981
## [3,] 1488 2739 4203 7265 2415 9604 1684 4807 1929 368
## [4,] 503 3776 3251 223 8246 154 7273 4974 8659 4734
## [5,] 263 2254 4146 7821 8135 6604 5996 1620 8644 2938
## [6,] 9232 5844 9273 2854 29 6970 6750 3720 8387 6651
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9549388 0.9730841 0.9882953 0.9905818 0.9938405 0.9974384 1.000726
## [2,] 1.0250026 1.0309801 1.0598124 1.1316205 1.1389508 1.1660076 1.177922
## [3,] 0.8048801 0.9330851 0.9579151 0.9606285 0.9664934 1.0006901 1.012451
## [4,] 0.8739503 0.9477754 0.9998853 1.0192013 1.0375460 1.0391052 1.041700
## [5,] 0.9000127 0.9015812 0.9615081 0.9739660 0.9995403 1.0007290 1.025438
## [6,] 0.9290770 1.0146314 1.0224247 1.0559702 1.0629684 1.0785404 1.093382
## [,8] [,9] [,10]
## [1,] 1.002664 1.007617 1.019291
## [2,] 1.178585 1.190359 1.196825
## [3,] 1.012911 1.023324 1.029479
## [4,] 1.052582 1.058496 1.059432
## [5,] 1.058356 1.063869 1.067919
## [6,] 1.100609 1.100866 1.105177
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9728 4016 4777 1961 3795
## [2,] 8444 2423 9186 7961 4366
## [3,] 5585 8332 6774 3664 6207
## [4,] 1247 2838 748 5210 8646
## [5,] 25 8704 419 589 6180
## [6,] 2465 864 5663 6221 4024
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9035632 0.9583917 0.9695784 0.9708687 0.9732593
## [2,] 0.7779893 0.9002860 0.9738870 0.9785135 0.9811455
## [3,] 0.7432720 0.9567128 0.9840043 1.0007811 1.0292883
## [4,] 0.9533122 0.9736031 0.9905141 1.0444406 1.0453193
## [5,] 0.8218890 0.8650465 0.9208409 0.9362080 0.9643939
## [6,] 0.9163575 0.9436435 1.0057408 1.0197005 1.0507250
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpFLnDHO/file7bd9127d33df.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.4.2 knitr_1.28 BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 bookdown_0.17 lattice_0.20-40
## [4] digest_0.6.25 grid_3.6.2 stats4_3.6.2
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.4
## [10] stringi_1.4.6 S4Vectors_0.24.3 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.20.1 tools_3.6.2
## [16] stringr_1.4.0 parallel_3.6.2 xfun_0.12
## [19] yaml_2.2.1 compiler_3.6.2 BiocGenerics_0.32.0
## [22] BiocManager_1.30.10 htmltools_0.4.0