BiocNeighbors 1.16.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9914 4887 1905 5420 886 7911 9385 6633 4443 7307
## [2,] 472 7043 3864 22 4758 1742 7150 2154 648 9269
## [3,] 2983 4217 6705 3479 9993 9038 2177 6777 6126 8163
## [4,] 8717 9615 4092 4330 7798 1944 5011 1028 8256 5285
## [5,] 8544 3641 348 6107 9659 2349 914 8990 3003 9151
## [6,] 5487 5502 605 8506 110 3400 1709 9257 2876 7274
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8746893 0.9142388 0.9416809 0.9836461 1.0059603 1.0093969 1.016923
## [2,] 0.9342504 0.9876963 1.0723996 1.0802686 1.0895022 1.1108315 1.123464
## [3,] 0.7903494 0.9019915 0.9205718 0.9427502 0.9465615 0.9586173 0.985653
## [4,] 0.8401635 0.8422725 0.9640290 1.0012240 1.0118080 1.0222057 1.028501
## [5,] 0.9436089 0.9973082 1.0294535 1.0579594 1.0846288 1.1207398 1.122880
## [6,] 0.8672464 0.9721619 0.9863429 0.9867064 1.0011046 1.0268304 1.042644
## [,8] [,9] [,10]
## [1,] 1.018853 1.038106 1.039073
## [2,] 1.130734 1.143806 1.153106
## [3,] 1.018336 1.030459 1.037817
## [4,] 1.031136 1.040623 1.041168
## [5,] 1.139928 1.155217 1.159286
## [6,] 1.049732 1.050953 1.060267
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9828 5920 6255 17 4432
## [2,] 2569 135 3802 3705 3990
## [3,] 8150 6144 8632 4597 2945
## [4,] 1558 3188 712 1813 4915
## [5,] 464 4117 9385 9844 2955
## [6,] 4372 7359 6682 1870 418
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9509446 0.9660091 0.9772828 0.9929542 0.9930505
## [2,] 1.0248313 1.0255874 1.0346212 1.0599446 1.0846179
## [3,] 1.1336325 1.1553030 1.1573037 1.1610074 1.1670579
## [4,] 0.8689280 0.9612259 0.9707019 0.9786192 1.0004725
## [5,] 0.8250182 0.8269905 0.8936530 0.9698160 0.9773988
## [6,] 0.9199631 0.9365416 1.0380791 1.0584692 1.0680044
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "F:\\biocbuild\\bbs-3.16-bioc\\tmpdir\\RtmpoX64Pd\\file3c04bb5dfd.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 20348)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.16.0 knitr_1.40 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 magrittr_2.0.3 BiocGenerics_0.44.0
## [4] BiocParallel_1.32.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.6 fastmap_1.1.0 stringr_1.4.1
## [10] tools_4.2.1 parallel_4.2.1 grid_4.2.1
## [13] xfun_0.34 cli_3.4.1 jquerylib_0.1.4
## [16] htmltools_0.5.3 yaml_2.3.6 digest_0.6.30
## [19] bookdown_0.29 Matrix_1.5-1 BiocManager_1.30.19
## [22] S4Vectors_0.36.0 sass_0.4.2 codetools_0.2-18
## [25] cachem_1.0.6 evaluate_0.17 rmarkdown_2.17
## [28] stringi_1.7.8 compiler_4.2.1 bslib_0.4.0
## [31] stats4_4.2.1 jsonlite_1.8.3