SharedObject 1.0.0
The SharedObject
package is designed for sharing data across
multiple R processes, where all processes can read the data located in
the same memory location. This sharing mechanism has the potential to
save the memory usage and reduce the overhead of data transmission in
the parallel computing. The use of the package arises from many
data-science subjects such as high-throughput gene data analysis, in
which case A paralle computing is desirable and the data is very
large. Blindly calling an export function such as clusterExport
will
duplicate the data for each process and it is obviously unnecessary if
the data is read-only in the parallel computing. The sharedObject
package can share the data without duplications and is able to reduce
the time cost. A new set of R APIs called ALTREP
is used to provide
a seamless experience when sharing an object.
We first illustrate the package with an example. In the example, we
create a cluster with 4 cores and share an n-by-n matrix A
, we use
the function share
to create the shared object A_shr
and call the
function clusterExport
to export it:
library(parallel)
## Initiate the cluster
cl=makeCluster(1)
## create data
n=3
A=matrix(runif(n^2),n,n)
## create shared object
A_shr=share(A)
## export the shared object
clusterExport(cl,"A_shr")
stopCluster(cl)
As the code shows above, the procedure of sharing a shared object is
similar to the procedure of sharing an R object, except that we
replace the matrix A
with a shared object A_shr
. Notably, there is
no different between the matrix A
and the shared object A_shr
. The
shared object A_shr
is neither an S3 nor S4 object and its behaviors
are exactly the same as the matrix A
, so there is no need to change
the existing code to work with the shared object. We can verify this
through
## check the data
A
#> [,1] [,2] [,3]
#> [1,] 0.08120278 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
A_shr
#> [,1] [,2] [,3]
#> [1,] 0.08120278 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
## check the properties
attributes(A)
#> $dim
#> [1] 3 3
attributes(A_shr)
#> $dim
#> [1] 3 3
## check the class
class(A)
#> [1] "matrix"
class(A_shr)
#> [1] "matrix"
Users can treate the shared object as a matrix and do operations on it as usual.
Currently, the package supports atomic
(aka vector
), matrix
and
data.frame
data structures. List
is not allowed for the
sharedObject
function but users can create a shared object for each
child of the list.
Please note that data.frame
is fundamentally a list of
vectors. Sharing a data.frame
will share its vector elements, not
the data.frame
itself. Therefore, adding or replace a column in a
shared data.frame
will not affect the shared memory. Users should
avoid such behaviors.
The type of integer
, numeric
, logical
and raw
are available
for sharing. string
is not supported.
In order to distinguish a shared object, the package provide several functions to examine the internal data structure
## Check if an object is of an ALTREP class
is.altrep(A)
#> [1] FALSE
is.altrep(A_shr)
#> [1] TRUE
## Check if an object is a shared object
## This works for both vector and data.frame
is.shared(A)
#> [1] FALSE
is.shared(A_shr)
#> [1] TRUE
The function is.altrep
only checks if an object is an ALTREP
object. Since the shared object class inherits ALTREP class, the
function returns TRUE
for a shared object. However, R also creates
ALTREP object in some cases(e.g. A=1:10, A is an ALTREP object), this
function may fail to check determine whether an object is a shared
object. is.shared
is the most suitable way to check the shared
object. For data.frame
type, it return TRUE
only when all of its
vector elements are shared objects.
There are several properties with the shared object, one can check them via
## get a summary report
getSharedProperties(A_shr)
#> Shared property object
#> dataId 1018810786119680
#> processId 5514
#> typeId 3
#> length 9
#> totalSize 72
#> copyOnWrite 1
#> sharedSubset 1
#> sharedCopy 0
## Internal function to check the properties
## All properties can be accessed via the similar way
.getProperty(A_shr,"dataId")
#> [1] 1.018811e+15
.getProperty(A_shr,"processId")
#> [1] 5514
.getProperty(A_shr,"typeId")
#> [1] 3
## Public function to check the properties
getCopyOnWrite(A_shr)
#> [1] TRUE
getSharedSubset(A_shr)
#> [1] TRUE
getSharedCopy(A_shr)
#> [1] FALSE
Please see the advanced topic in the next section to see which properties are changable and how to change them in a proper way.
Because all cores are using the shared object A_shr
located in the
same memory location, a reckless change made on the matrix A_shr
in
one process will immediately be broadcasted to the other process. To
prevent users from changing the values of a shared object without
awareness, a shared object will duplicate itself if a change of its
value is made. Therefore, the code like
A_shr2=A_shr
A_shr[1,1]=10
A_shr
#> [,1] [,2] [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
A_shr2
#> [,1] [,2] [,3]
#> [1,] 0.08120278 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
will result in a memory dulplication. The matrix A_shr2
is not
affected. This default behavior can be overwritten by passing an
argument copyOnWrite
to the function share
. For example
A_shr=share(A,copyOnWrite=FALSE)
A_shr2=A_shr
A_shr[1,1]=10
A_shr
#> [,1] [,2] [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
A_shr2
#> [,1] [,2] [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
A change in the matrix A_shr
cause a change in A_shr2
. This
feature could be potentially useful to return the result from each R
process without additional memory allocation, so A_shr
can be both
the initial data and the final result. However, due to the limitation
of R, only copy-on-write feature is fully supported, not the
reverse. it is possible to change the value of a shared object
unexpectly.
A_shr=share(A,copyOnWrite=FALSE)
-A_shr
#> [,1] [,2] [,3]
#> [1,] -0.08120278 -0.7908218 -0.4848458
#> [2,] -0.69772741 -0.7128030 -0.1455111
#> [3,] -0.09897697 -0.6218568 -0.2442200
A_shr
#> [,1] [,2] [,3]
#> [1,] -0.08120278 -0.7908218 -0.4848458
#> [2,] -0.69772741 -0.7128030 -0.1455111
#> [3,] -0.09897697 -0.6218568 -0.2442200
The above example shows an unexpected result when the copy-on-write
feature is off. Simply calling an unary function can change the values
of a shared object. Therefore, for the safty of the naive user, the
copy-on-write feature is active by default. For the experienced user,
the the copy-on-write feature can be altered via setCopyOnwrite
funtion. There is no return value for the function.
A_shr=share(A,copyOnWrite=FALSE)
#Assign A_shr to another object
A_shr2=A_shr
#change the value of A_shr
A_shr[1,1]=10
#Both A_shr and A_shr2 are affected
A_shr
#> [,1] [,2] [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
A_shr2
#> [,1] [,2] [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
#Enable copy-on-write
setCopyOnWrite(A_shr,TRUE)
#The unary function does not affect the variable A_shr
-A_shr
#> [,1] [,2] [,3]
#> [1,] -10.00000000 -0.7908218 -0.4848458
#> [2,] -0.69772741 -0.7128030 -0.1455111
#> [3,] -0.09897697 -0.6218568 -0.2442200
A_shr
#> [,1] [,2] [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
getCopyOnWrite(A_shr)
#> [1] TRUE
These flexibilities provide us a way to do safe operations during the computation and return the results without memory duplications.
If a high-precision value is assigned to a low-precision shared object, An implicit type conversion will be triggered for correctly storing the change. The resulting object would be a regular R object, not a shared object. Therefore, the change will not be broadcasted even if the copy-on-write feature is off. The most common senario is to assign a numeric value to an integer shared object. Users should be caution with the data type that a shared object is using.
There is a certain limitation on how many shared objects a process can create on Linux system. In case if you see the error message “Too many open files”, it means either you have explictly created too many shared objects, or you have implicitly generated too many shared subsets via [
operator. You can turn sharedSubset
off for reducing the number of opened files, or check your system settings to increase the number of opened files that one process can have.
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SharedObject_1.0.0 BiocStyle_2.14.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.2 bookdown_0.14 digest_0.6.22
#> [4] magrittr_1.5 evaluate_0.14 rlang_0.4.1
#> [7] stringi_1.4.3 rmarkdown_1.16 tools_3.6.1
#> [10] stringr_1.4.0 xptr_1.1.1 xfun_0.10
#> [13] yaml_2.2.0 compiler_3.6.1 BiocGenerics_0.32.0
#> [16] BiocManager_1.30.9 htmltools_0.4.0 knitr_1.25