BiocNeighbors 1.4.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 8771 5059 56 4070 7404 2231 8376 8101 3327 5661
## [2,] 9923 2309 9300 6519 4514 2897 1081 5905 1752 6343
## [3,] 780 6729 5792 8257 2741 367 6741 767 788 963
## [4,] 4786 5471 1459 5984 187 7896 9861 9790 3155 5987
## [5,] 6601 1948 3652 1731 1278 7916 2014 2056 6725 142
## [6,] 1586 2850 6887 149 7567 7155 3651 964 5058 8658
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8715133 0.9678817 0.9833923 0.9893116 1.0315957 1.0330536 1.0331842
## [2,] 0.9367853 1.0039316 1.0237066 1.0342082 1.0561885 1.0673904 1.0785345
## [3,] 0.9123796 0.9180802 0.9481109 0.9552479 0.9676833 0.9927770 0.9976734
## [4,] 0.7858602 0.8895646 0.8904961 0.9430057 0.9458280 0.9496274 0.9524800
## [5,] 0.9369186 0.9492112 0.9567255 0.9793873 0.9822933 0.9842297 0.9861553
## [6,] 0.8238962 0.9035017 0.9434627 1.0584267 1.1050088 1.1274583 1.1323702
## [,8] [,9] [,10]
## [1,] 1.0371625 1.0666491 1.082250
## [2,] 1.0863938 1.0982471 1.099542
## [3,] 1.0026765 1.0033996 1.019348
## [4,] 0.9617489 0.9933506 1.000173
## [5,] 0.9936376 1.0054382 1.021920
## [6,] 1.1386094 1.1392430 1.141825
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2016 973 4386 8287 1124
## [2,] 499 9522 9925 73 5016
## [3,] 5965 6174 4464 260 6317
## [4,] 5192 3421 9886 3100 9923
## [5,] 31 2993 3642 3055 346
## [6,] 3702 7485 5121 8620 9426
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8418320 0.9973949 1.0008547 1.0193620 1.0248270
## [2,] 0.8564282 0.9597957 1.0455211 1.0805972 1.0828418
## [3,] 0.7865698 0.8234340 0.9240150 0.9528975 0.9762501
## [4,] 0.9860422 1.0171522 1.0203296 1.0233701 1.0313755
## [5,] 0.8818297 0.9705501 0.9707427 1.0054229 1.0084831
## [6,] 0.8956466 0.8967537 0.9626351 0.9702065 0.9805890
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "C:\\Users\\biocbuild\\bbs-3.10-bioc\\tmpdir\\Rtmp4UPzyy\\file12b859ed4484.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.4.2 knitr_1.28 BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 bookdown_0.17 lattice_0.20-40
## [4] digest_0.6.25 grid_3.6.2 stats4_3.6.2
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.4
## [10] stringi_1.4.6 S4Vectors_0.24.3 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.20.1 tools_3.6.2
## [16] stringr_1.4.0 parallel_3.6.2 xfun_0.12
## [19] yaml_2.2.1 compiler_3.6.2 BiocGenerics_0.32.0
## [22] BiocManager_1.30.10 htmltools_0.4.0