A cluster-computing platform for R scripts!
The cluster-computing platform for R scripts is based on the doRedis-package. So its only parallizes foreach. A overview over the functionality can be get from the following schema:
- When starting a new job by running the master.R script, as first step the worker.init function if it is available gets exported to redis
- Further the jobscript and data, over which the foreach loop iterates, gets also exported to redis
- Then the available worker on the cluster, that gets started with the worker.R script grabs the new job, run once before starting the job the worker.init script and runs the jobscript
- After a dataset was processed, the results gets written to redis
- The master takes the results and combines them by using the combine function
- Redis: 2.8.0
- R: 3.5.2
- R-Libraries:
- doRedis: 1.2.2
- parallel: 3.5.1
- optparse: 1.6.0
- redux: 1.1.0
The installation of the requirements depends on the underlying operation system. Its recommended that you have at least two machines, one for the worker and one for the master.
Redis is used to share the data with master and workers and for the job management for the workers. For more details see at the documentation of doRedis.
Download and install Redis on the master machine. For more details see on the official website.
Download and install R on the master and worker machines. For more details see on the official website.
The required libraries must be installed on the master machine as well as on the worker machines. It might be necessary to specify the library path if the default library path is not writeable for the current user. To run the platform the libraries can be installed with the following instructions in the interactive R shell:
install.packages("parallel")
install.packages("optparse")
install.packages("redux")
The latest version of the doRedis
package is only available on github, it is
necessary to install it with the following commands:
install.packages("devtools")
library(devtools)
install_github("bwlewis/doRedis")
The installation of the reqired libraries for the specific job should be implemented in the initialization script.
After installing and configuring all the requirements the platform can be used. Either download the source code or the latest release and place it to the folder of your desire.
To get a job done at least a worker must run on the cluster. This can be done by executing the following command:
Rscript worker.R [options]
The following options can be passed to the worker script:
Short | Long | Description |
---|---|---|
-m | --master | The hostname or ip address of the master where the redis process runs. |
-p | --master-port | The port of the redis process on the master. |
-w | --master-password | The password of the redis process on the master. |
-l | --logfile | The path to the workers log file. |
This script exits after all jobs for a queue where done or if there isn't any queue on the cluster available. So it may be necessary to run this script in a service. For windows the clients can be used.
To run a new job on the cluster you must execute the following line:
Rscript master.R [options]
The following options can be passed to the master script:
Short | Long | Description |
---|---|---|
-m | --master | The hostname or ip address of the master where the redis process runs. |
-p | --master-port | The port of the redis process on the master. |
-w | --master-password | The password of the redis process on the master. |
-c | --chunksize | Size of the chunks for the jobs that gets submitted to the worker. |
-f | --file | Path to the file which contains the data for the job. |
-i | --init | Path to the init script (e.g. installation of libs) that should be executed on each worker. This file should contain a function named woker.init without any parameters. |
-o | --outfile | Path to the file where the results of the job should be saved. |
-q | --queue | The queue the workes should run on. |
-s | --script | Path to the job script. This script should contain a run function taking only one argument, where the data for the job will be passed. |
-t | --time | Indicates whether the run time of the script should be measured and printed to the console. Warning: The real time of the algorithm may be shorter, you must take into account the waiting time for free workers. |
A job consists of job script and if necessary a initialization script for the workers. The job script gets executed on the master and the initialization script gets executed once at the begin of a job on each worker node.
The job script must contain a run
function, that gets called when starting the
job. It takes one parameter, which contains the path to the input file that
should be processed in the job. The processing algorithm can be written in this
function or be splitted up to other functions or scripts. To load necessary
scripts relative to the script path following construction can be used:
path <- dirname(parent.frame(2)$ofile)
source(paste(path, FILENAME, sep="/"))
The optionally initialization script must contain a worker.init
function,
where the initilization of each worker node can be done. Be sure that the user
starting the worker nodes and the master have write access to the lib path of R.
If necessary adjust this path by chainging the appropriate environment
variables.
- The job could run slow if there are a lot of iterations, it might be useful to
adjust the
chunksize
to get a faster job execution. Also depending on thecombine
function for the results can lead to high cpu consumption on the machine, where was the master script started and this can lead to slowing up the whole job.
There are clients for the scripts for windows available, so that it is not necessary to run the scripts from the console.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details