BiocNeighbors 1.10.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2058 8583 2729 4354 5508 6397 7357 3262 4010 8446
## [2,] 4318 2397 9828 5949 9518 5283 8595 2716 6275 4289
## [3,] 7427 112 7725 188 6884 5177 6281 7469 3075 4396
## [4,] 9918 4779 4346 4730 2564 6460 8494 5580 110 7868
## [5,] 8850 6098 6622 9433 8338 7286 3181 5043 6061 2445
## [6,] 2896 824 1144 7969 2335 552 4366 156 177 7693
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8414909 0.8698932 0.9355368 0.9541349 0.9717639 0.9991492 1.0364299
## [2,] 0.8837594 0.9832281 1.0007690 1.0060079 1.0102490 1.0112789 1.0447117
## [3,] 0.9199062 0.9470513 1.0197800 1.0204711 1.0504456 1.0727901 1.0770069
## [4,] 0.7379470 0.8063428 0.8113312 0.8561726 0.8616189 0.8751896 0.8967044
## [5,] 0.9775219 1.0697929 1.0967655 1.1031532 1.1066418 1.1163386 1.1320385
## [6,] 0.9128596 0.9900898 0.9991747 1.0269202 1.0584425 1.0603285 1.0850970
## [,8] [,9] [,10]
## [1,] 1.0367655 1.0437986 1.0546287
## [2,] 1.0605900 1.0618711 1.0711055
## [3,] 1.1224986 1.1257536 1.1304235
## [4,] 0.9294451 0.9345718 0.9384831
## [5,] 1.1427239 1.1543937 1.1700542
## [6,] 1.0867983 1.1067376 1.1250880
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9215 4436 6180 9154 2566
## [2,] 1552 1741 1767 3431 1836
## [3,] 3886 1074 5290 3572 7506
## [4,] 188 6120 10 8273 6874
## [5,] 4147 7623 2084 5674 8640
## [6,] 1523 3040 250 9290 6846
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9506123 1.0106703 1.0128031 1.0335578 1.0427752
## [2,] 0.8683564 0.8903276 0.9439428 1.0131069 1.0594602
## [3,] 0.9215837 0.9470725 1.0127544 1.0229198 1.0231688
## [4,] 0.8339254 0.8486065 0.9568959 0.9866970 1.0023173
## [5,] 0.8708979 0.9105967 0.9178559 0.9603201 0.9961723
## [6,] 0.9792755 0.9906029 1.0066026 1.0443240 1.0502740
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "D:\\biocbuild\\bbs-3.13-bioc\\tmpdir\\RtmpawNBJG\\file3a4865a17989.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.0 RC (2021-05-10 r80283)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.10.0 knitr_1.33 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 magrittr_2.0.1 BiocGenerics_0.38.0
## [4] BiocParallel_1.26.0 lattice_0.20-44 R6_2.5.0
## [7] rlang_0.4.11 stringr_1.4.0 tools_4.1.0
## [10] parallel_4.1.0 grid_4.1.0 xfun_0.23
## [13] jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [16] digest_0.6.27 bookdown_0.22 Matrix_1.3-3
## [19] BiocManager_1.30.15 S4Vectors_0.30.0 sass_0.4.0
## [22] evaluate_0.14 rmarkdown_2.8 stringi_1.6.2
## [25] compiler_4.1.0 bslib_0.2.5.1 stats4_4.1.0
## [28] jsonlite_1.7.2