1 Introduction

2 Quick example

3 Supported data types

4 Check object class

5 Advanced topic: Copy-On-Write

5.1 Warning

6 Advanced topic: shared subset and shared copy

7 Last word on Linux system

8 Session Information

1 Introduction

The SharedObject package is designed for sharing data across multiple R processes, where all processes can read the data located in the same memory location. This sharing mechanism has the potential to save the memory usage and reduce the overhead of data transmission in the parallel computing. The use of the package arises from many data-science subjects such as high-throughput gene data analysis, in which case A paralle computing is desirable and the data is very large. Blindly calling an export function such as clusterExport will duplicate the data for each process and it is obviously unnecessary if the data is read-only in the parallel computing. The sharedObject package can share the data without duplications and is able to reduce the time cost. A new set of R APIs called ALTREP is used to provide a seamless experience when sharing an object.

2 Quick example

We first illustrate the package with an example. In the example, we create a cluster with 4 cores and share an n-by-n matrix A, we use the function share to create the shared object A_shr and call the function clusterExport to export it:

library(parallel)
## Initiate the cluster
cl=makeCluster(1)
## create data
n=3
A=matrix(runif(n^2),n,n)
## create shared object
A_shr=share(A)
## export the shared object
clusterExport(cl,"A_shr")

stopCluster(cl)

As the code shows above, the procedure of sharing a shared object is similar to the procedure of sharing an R object, except that we replace the matrix A with a shared object A_shr. Notably, there is no different between the matrix A and the shared object A_shr. The shared object A_shr is neither an S3 nor S4 object and its behaviors are exactly the same as the matrix A, so there is no need to change the existing code to work with the shared object. We can verify this through

## check the data 
A 
#>            [,1]      [,2]      [,3]
#> [1,] 0.08120278 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
A_shr 
#>            [,1]      [,2]      [,3]
#> [1,] 0.08120278 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200
## check the properties
attributes(A) 
#> $dim
#> [1] 3 3
attributes(A_shr) 
#> $dim
#> [1] 3 3
## check the class 
class(A)
#> [1] "matrix"
class(A_shr) 
#> [1] "matrix"

Users can treate the shared object as a matrix and do operations on it as usual.

3 Supported data types

Currently, the package supports atomic(aka vector), matrix and data.frame data structures. List is not allowed for the sharedObject function but users can create a shared object for each child of the list.

Please note that data.frame is fundamentally a list of vectors. Sharing a data.frame will share its vector elements, not the data.frame itself. Therefore, adding or replace a column in a shared data.frame will not affect the shared memory. Users should avoid such behaviors.

The type of integer, numeric, logical and raw are available for sharing. string is not supported.

4 Check object class

In order to distinguish a shared object, the package provide several functions to examine the internal data structure

## Check if an object is of an ALTREP class
is.altrep(A)
#> [1] FALSE
is.altrep(A_shr)
#> [1] TRUE

## Check if an object is a shared object
## This works for both vector and data.frame
is.shared(A)
#> [1] FALSE
is.shared(A_shr)
#> [1] TRUE

The function is.altrep only checks if an object is an ALTREP object. Since the shared object class inherits ALTREP class, the function returns TRUE for a shared object. However, R also creates ALTREP object in some cases(e.g. A=1:10, A is an ALTREP object), this function may fail to check determine whether an object is a shared object. is.shared is the most suitable way to check the shared object. For data.frame type, it return TRUE only when all of its vector elements are shared objects.

There are several properties with the shared object, one can check them via

## get a summary report
getSharedProperties(A_shr)
#> Shared property object
#>   dataId      1018810786119680 
#>   processId   5514 
#>   typeId      3 
#>   length      9 
#>   totalSize   72 
#>   copyOnWrite     1 
#>   sharedSubset    1 
#>   sharedCopy      0

## Internal function to check the properties
## All properties can be accessed via the similar way
.getProperty(A_shr,"dataId")
#> [1] 1.018811e+15
.getProperty(A_shr,"processId")
#> [1] 5514
.getProperty(A_shr,"typeId")
#> [1] 3

## Public function to check the properties
getCopyOnWrite(A_shr)
#> [1] TRUE
getSharedSubset(A_shr)
#> [1] TRUE
getSharedCopy(A_shr)
#> [1] FALSE

Please see the advanced topic in the next section to see which properties are changable and how to change them in a proper way.

5 Advanced topic: Copy-On-Write

Because all cores are using the shared object A_shr located in the same memory location, a reckless change made on the matrix A_shr in one process will immediately be broadcasted to the other process. To prevent users from changing the values of a shared object without awareness, a shared object will duplicate itself if a change of its value is made. Therefore, the code like

A_shr2=A_shr
A_shr[1,1]=10

A_shr
#>             [,1]      [,2]      [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,]  0.69772741 0.7128030 0.1455111
#> [3,]  0.09897697 0.6218568 0.2442200
A_shr2
#>            [,1]      [,2]      [,3]
#> [1,] 0.08120278 0.7908218 0.4848458
#> [2,] 0.69772741 0.7128030 0.1455111
#> [3,] 0.09897697 0.6218568 0.2442200

will result in a memory dulplication. The matrix A_shr2 is not affected. This default behavior can be overwritten by passing an argument copyOnWrite to the function share. For example

A_shr=share(A,copyOnWrite=FALSE)
A_shr2=A_shr
A_shr[1,1]=10

A_shr
#>             [,1]      [,2]      [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,]  0.69772741 0.7128030 0.1455111
#> [3,]  0.09897697 0.6218568 0.2442200
A_shr2
#>             [,1]      [,2]      [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,]  0.69772741 0.7128030 0.1455111
#> [3,]  0.09897697 0.6218568 0.2442200

A change in the matrix A_shr cause a change in A_shr2. This feature could be potentially useful to return the result from each R process without additional memory allocation, so A_shr can be both the initial data and the final result. However, due to the limitation of R, only copy-on-write feature is fully supported, not the reverse. it is possible to change the value of a shared object unexpectly.

A_shr=share(A,copyOnWrite=FALSE)
-A_shr
#>             [,1]       [,2]       [,3]
#> [1,] -0.08120278 -0.7908218 -0.4848458
#> [2,] -0.69772741 -0.7128030 -0.1455111
#> [3,] -0.09897697 -0.6218568 -0.2442200
A_shr
#>             [,1]       [,2]       [,3]
#> [1,] -0.08120278 -0.7908218 -0.4848458
#> [2,] -0.69772741 -0.7128030 -0.1455111
#> [3,] -0.09897697 -0.6218568 -0.2442200

The above example shows an unexpected result when the copy-on-write feature is off. Simply calling an unary function can change the values of a shared object. Therefore, for the safty of the naive user, the copy-on-write feature is active by default. For the experienced user, the the copy-on-write feature can be altered via setCopyOnwrite funtion. There is no return value for the function.

A_shr=share(A,copyOnWrite=FALSE)
#Assign A_shr to another object
A_shr2=A_shr
#change the value of A_shr
A_shr[1,1]=10
#Both A_shr and A_shr2 are affected
A_shr
#>             [,1]      [,2]      [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,]  0.69772741 0.7128030 0.1455111
#> [3,]  0.09897697 0.6218568 0.2442200
A_shr2
#>             [,1]      [,2]      [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,]  0.69772741 0.7128030 0.1455111
#> [3,]  0.09897697 0.6218568 0.2442200
#Enable copy-on-write
setCopyOnWrite(A_shr,TRUE)
#The unary function does not affect the variable A_shr
-A_shr
#>              [,1]       [,2]       [,3]
#> [1,] -10.00000000 -0.7908218 -0.4848458
#> [2,]  -0.69772741 -0.7128030 -0.1455111
#> [3,]  -0.09897697 -0.6218568 -0.2442200
A_shr
#>             [,1]      [,2]      [,3]
#> [1,] 10.00000000 0.7908218 0.4848458
#> [2,]  0.69772741 0.7128030 0.1455111
#> [3,]  0.09897697 0.6218568 0.2442200

getCopyOnWrite(A_shr)
#> [1] TRUE

These flexibilities provide us a way to do safe operations during the computation and return the results without memory duplications.

5.1 Warning

If a high-precision value is assigned to a low-precision shared object, An implicit type conversion will be triggered for correctly storing the change. The resulting object would be a regular R object, not a shared object. Therefore, the change will not be broadcasted even if the copy-on-write feature is off. The most common senario is to assign a numeric value to an integer shared object. Users should be caution with the data type that a shared object is using.

6 Advanced topic: shared subset and shared copy

The options sharedSubset controls whether to create a shared object when subsetting a shared object. sharedCopy determines if the duplication of a shared object is still a shared object. For performance consideration, the default settings are sharedSubset=TRUE and sharedCopy=FALSE, but they can be overwritten via:

A_shr=share(A,sharedSubset=FALSE,sharedCopy=TRUE)
getSharedProperties(A_shr)
#> Shared property object
#>   dataId      4532519739326464 
#>   processId   5514 
#>   typeId      3 
#>   length      9 
#>   totalSize   72 
#>   copyOnWrite     1 
#>   sharedSubset    0 
#>   sharedCopy      1

Please note that sharedCopy is only effective when copyOnWrite=TRUE.

7 Last word on Linux system

There is a certain limitation on how many shared objects a process can create on Linux system. In case if you see the error message “Too many open files”, it means either you have explictly created too many shared objects, or you have implicitly generated too many shared subsets via [ operator. You can turn sharedSubset off for reducing the number of opened files, or check your system settings to increase the number of opened files that one process can have.

8 Session Information

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] SharedObject_1.0.0 BiocStyle_2.14.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2          bookdown_0.14       digest_0.6.22      
#>  [4] magrittr_1.5        evaluate_0.14       rlang_0.4.1        
#>  [7] stringi_1.4.3       rmarkdown_1.16      tools_3.6.1        
#> [10] stringr_1.4.0       xptr_1.1.1          xfun_0.10          
#> [13] yaml_2.2.0          compiler_3.6.1      BiocGenerics_0.32.0
#> [16] BiocManager_1.30.9  htmltools_0.4.0     knitr_1.25

Package Quick Start Guide

2019-10-30

Package