BiocNeighbors 1.8.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2654 2795 2372 3507 1828 5941 7404 2050 3322 5761
## [2,] 4471 528 1194 2271 4080 7943 497 1395 3512 4952
## [3,] 9730 8952 4478 7756 9266 3040 8816 1609 245 7683
## [4,] 3598 6189 2173 4587 7112 1415 3829 5403 2849 7702
## [5,] 7061 965 2834 7312 242 9055 3232 5683 4227 3314
## [6,] 1421 2326 8207 8878 5156 8131 721 642 8684 3586
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8012792 0.8576069 0.9046848 0.9490177 0.9916587 1.0011322 1.0193053
## [2,] 0.7700940 0.8116404 0.9214732 0.9281875 0.9308216 0.9314354 0.9315361
## [3,] 0.9155341 0.9325950 1.0045272 1.0122676 1.0171826 1.0219508 1.0568852
## [4,] 0.9674190 1.0256823 1.0293052 1.0404011 1.0442752 1.0458566 1.0477467
## [5,] 0.7588240 0.9408486 0.9848236 0.9884602 0.9973440 0.9977134 1.0008332
## [6,] 1.1229731 1.1668361 1.1895242 1.1918395 1.2040933 1.2074112 1.2119087
## [,8] [,9] [,10]
## [1,] 1.0411603 1.0498405 1.076182
## [2,] 0.9545985 0.9920003 1.013403
## [3,] 1.0721554 1.0794981 1.085272
## [4,] 1.0519229 1.0652701 1.093188
## [5,] 1.0118767 1.0177152 1.024742
## [6,] 1.2119669 1.2142785 1.224844
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2854 9175 5590 9116 3424
## [2,] 1427 8457 9117 5034 4801
## [3,] 7847 3334 9282 8705 7149
## [4,] 7792 5069 7402 3808 7300
## [5,] 5718 7055 741 1945 468
## [6,] 7550 8877 8009 5407 5133
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.7150112 0.9551283 0.9579080 0.9610922 1.0176965
## [2,] 0.8281745 0.9279171 0.9591281 0.9603502 1.0136175
## [3,] 0.9463708 0.9597221 0.9659172 0.9703273 0.9726027
## [4,] 0.7495880 0.8711673 0.9315405 0.9349653 0.9385618
## [5,] 0.9025577 1.0188659 1.0495381 1.0741827 1.0832633
## [6,] 0.7759522 1.0424271 1.0444030 1.0837094 1.0912106
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "C:\\Users\\biocbuild\\bbs-3.12-bioc\\tmpdir\\RtmpKexI9l\\file24d85f35434c.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.8.2 knitr_1.30 BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 bookdown_0.21 lattice_0.20-41
## [4] digest_0.6.27 grid_4.0.3 stats4_4.0.3
## [7] magrittr_2.0.1 evaluate_0.14 rlang_0.4.9
## [10] stringi_1.5.3 S4Vectors_0.28.0 Matrix_1.2-18
## [13] rmarkdown_2.5 BiocParallel_1.24.1 tools_4.0.3
## [16] stringr_1.4.0 parallel_4.0.3 xfun_0.19
## [19] yaml_2.2.1 compiler_4.0.3 BiocGenerics_0.36.0
## [22] BiocManager_1.30.10 htmltools_0.5.0