BiocNeighbors 1.6.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2472 8785 4480 2287 7154 5087 306 9725 355 3394
## [2,] 5679 9999 7159 1622 8671 1447 3588 9629 1197 297
## [3,] 9862 8733 1684 7877 4205 1388 3570 1146 3504 10
## [4,] 8520 1375 7533 5937 9217 1923 6576 3922 6534 3078
## [5,] 7475 7014 2667 1272 2384 2656 6914 7399 5816 3233
## [6,] 1076 1939 6881 6542 7726 8623 8700 120 6332 2766
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0160896 1.0305198 1.0539002 1.0731078 1.0773553 1.079474 1.0834223
## [2,] 0.9352278 0.9359603 0.9668155 0.9747400 0.9870442 1.019455 1.0362035
## [3,] 0.8528845 0.9201739 0.9223238 0.9334182 0.9353036 0.959167 0.9967918
## [4,] 0.8827361 0.9544235 0.9728699 1.0183616 1.0454656 1.067823 1.0709544
## [5,] 0.8144282 0.8833866 0.9234329 0.9283065 0.9300396 1.002293 1.0083969
## [6,] 0.9313099 0.9518151 0.9607627 0.9678324 1.0065429 1.022031 1.0428864
## [,8] [,9] [,10]
## [1,] 1.084278 1.114447 1.116229
## [2,] 1.055618 1.071051 1.079966
## [3,] 1.001344 1.013241 1.029634
## [4,] 1.079332 1.102296 1.107838
## [5,] 1.029544 1.035694 1.036367
## [6,] 1.058502 1.072493 1.084400
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9284 9570 1140 6935 7038
## [2,] 9412 1858 5012 9139 2021
## [3,] 4960 657 9308 8547 798
## [4,] 9404 4924 1916 7117 5605
## [5,] 163 807 2996 2809 9403
## [6,] 5270 2339 7438 4378 9005
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9304292 0.9459063 1.0207646 1.0232471 1.025561
## [2,] 0.8957567 0.9605144 1.0573766 1.0687405 1.092705
## [3,] 0.9340398 0.9353478 0.9844260 1.0137274 1.057676
## [4,] 0.9156544 0.9845881 1.0024112 1.0515140 1.055231
## [5,] 0.9132901 0.9353450 0.9453691 1.0402807 1.040792
## [6,] 0.8896392 0.8957989 0.9428604 0.9937068 1.045079
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpmRlYuW/file5485745454c1.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.6.0 knitr_1.28 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 bookdown_0.18 lattice_0.20-41
## [4] digest_0.6.25 grid_4.0.0 stats4_4.0.0
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.5
## [10] stringi_1.4.6 S4Vectors_0.26.0 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.22.0 tools_4.0.0
## [16] stringr_1.4.0 parallel_4.0.0 xfun_0.13
## [19] yaml_2.2.1 compiler_4.0.0 BiocGenerics_0.34.0
## [22] BiocManager_1.30.10 htmltools_0.4.0