slurmworkflow
slurmworkflow.Rmd
The goal of slurmworkflow is to provide a way to run complex multi step computations on a slurm equipped HPC.
Definitions:
- A workflow is a predefined set of computation steps.
- A computation step is a call to
sbatch
to be run on the HPC - A workflow directory is a directory containing all the code required to run a full workflow on the HPC
By default the steps are run sequentially. But slurmworkflow provides tools for changing the execution order. This allows conditional execution of the steps and loop like behavior.
In this vignette we walk through the creation of a 4 steps workflow showcasing the main utilities provided by slurmworkflow.
Setup
Any HPC using slurm as its workload manager should work with slurmworkflow.
HPC tested:
- Emory’s RSPH HPC (in this vignette)
- Washington’s HYAK HPC
We highly recommend using renv when working with an HPC.
Creating a new workflow
library(slurmworkflow)
wf <- create_workflow(
wf_name = "test_slurmworkflow",
default_sbatch_opts = list(
"partition" = "epimodel",
"mail-type" = "FAIL",
"mail-user" = "user@emory.edu"
)
)
First we create a new workflow called “test_slurmworkflow” and store
a summary of it in the wf
object. The second argument
specifies that by default each step should run on the
“epimodel” partition and send a mail at “user@emory.edu” if the step fails.
Calling create_workflow()
result in the creation of the
workflow directory: “workflows/test_slurmworkflow/” that
contains the code to send to the HPC. A workflow summary is
returned and stored in the wf
variable. We’ll use it to add
elements to the workflow.
Adding a step to the workflow
The first step that we use on most of our workflows ensures that our local project and the HPC are in sync.
wf <- add_workflow_step(
wf_summary = wf,
step_tmpl = step_tmpl_bash_lines(c(
"git pull",
". /projects/epimodel/spack/share/spack/setup-env.sh",
"spack load r@4.2.1",
"Rscript -e \"renv::init(bare = TRUE)\"",
"Rscript -e \"renv::restore()\""
)),
sbatch_opts = list(
"mem" = "16G",
"cpus-per-task" = 4,
"time" = 120
)
)
The add_workflow_step()
functions takes three
arguments:
-
wf_summary
: the object we made withcreate_workflow()
, indicating onto which workflow we want to add a step. -
step_tmpl
: a step template, a helper function defining what to run on the HPC (more on this latter). -
sbatch_opts
: arguments to be passed tosbatch
. Here we specify that we want 16GB of RAM, 4 CPUs and that the job should not take more than 120 minutes. The default options defined increate_workflow()
will also be used.
The step template we are using here,
step_tmpl_bash_lines()
takes a vector of bash
lines to be run by sbatch
on the HPC.
Here we tell the step to: 1. run git pull
2. load our
own version of spack and load the
R@4.2.1
module 3. ensure that renv
is
initialized on the project (no effect if its already the case) 4. update
the packages to match the renv.lock file
Running R Code directly
As we usually want to run R
code and not
bash
, slurmworkflow provides step templates
simplifying this process.
On HPCs, R
is usually not available directly. On RSPH
HPC we use spack to manage our modules. Therefore, we store the lines
used to setup R
on the HPC in a variable as it will be used
by all R
step templates.
setup_lines <- c(
". /projects/epimodel/spack/share/spack/setup-env.sh",
"spack load r@4.2.1"
)
Run code from an R script
Our next step will run the following script on the HPC.
# filename: R/01-test_do_call.R
cat(paste0("var1 = ", var1, ", var2 = ", var2))
if (!file.exists("did_run")) {
file.create("did_run")
current_step <- slurmworkflow::get_current_workflow_step()
slurmworkflow::change_next_workflow_step(current_step)
} else {
file.remove("did_run")
}
This very simple script prints the content of var1
and
var2
to the standard output. Note that these variables are
never declared in the script. We will pass them as arguments to the
step template.
The second part checks for the existence of a file called “did_run”. If it does not exist yet, it’s created and we instruct slurmworkflow to change the next step to the current step. This is how you make a loop in slurmworkflow.
If the file exists, which means that it’s the second time this step
is run, it removes it. In this case
change_next_workflow_step()
is not called and the workflow
will just continue to the next step.
Let’s now see how we add this script as a workflow step.
wf <- add_workflow_step(
wf_summary = wf,
step_tmpl = step_tmpl_do_call_script(
r_script = "R/01-test_do_call.R",
args = list(var1 = "ABC", var2 = "DEF"),
setup_lines = setup_lines
),
sbatch_opts = list(
"cpus-per-task" = 1,
"time" = "00:10:00",
"mem" = "4G"
)
)
As before we use the add_workflow_step()
function. But
we change the step_tmpl
to use
step_tmpl_do_call_script()
with 3 arguments:
-
r_script
: the path to the script to be run. Here “R/01-test_do_call.R”. Note that this path must be valid on the HPC. -
args
: a list of variables that will be available for the step. These are thevar1
andvar2
that were missing from the script. -
setup_lines
: some bash code to be run before trying to source the script. These are the lines used to load the R module we defined earlier.
For the sbatch
options, we ask here for 1 CPU, 4GB of
RAM and a maximum of 10 minutes.
Iterating over values in an R script
One common task on an HPC is to run the same code many time and only vary the value of some arguments.
On a typical R session lapply()
, Map()
and
Mapply()
are available for this purpose.
slurmworkflow provides the step_tmpl_map_script()
to run
a script with a syntax similar to the Map()
function. This
create an array job where each element of input are processed in
parallel.
First let’s take a look at the script to be run.
# filename: R/02-test_map.R
library(future.apply)
plan(multicore, workers = ncores)
future_lapply(seq_len(ncores), function(i) {
msg <- paste0(
"On core: ", i, "\n",
"iterator1: ", iterator1, "\n",
"iterator2: ", iterator2, "\n",
"var1 = ", var1, ", var2 = ", var2, "\n\n"
)
cat(msg)
})
This script needs 4 undeclared variables: - iterator1
and iterator2
: varying values - ncores
,
var1
and var2
: fixed values shared by all
replications
As before these values will be set by the step template.
In this script we will print in parallel the message over
ncores
.
Now for the addition of the step.
cores_to_use <- 2
wf <- add_workflow_step(
wf_summary = wf,
step_tmpl = step_tmpl_map_script(
r_script = "R/02-test_map.R",
# arguments passed to the script
iterator1 = seq_len(5),
iterator2 = seq_len(5) + 5,
MoreArgs = list(
ncores = cores_to_use,
var1 = "IJK",
var2 = "LMN"
),
setup_lines = setup_lines,
max_array_size = 2
),
sbatch_opts = list(
"cpus-per-task" = cores_to_use,
"time" = "00:10:00",
"mem-per-cpu" = "4G"
)
)
The step_tmpl_map_script()
takes an
r_script
argument similar to
step_tmpl_do_call_script()
. The next two arguments
iterator1
and iterator2
will be iterated over
using sbatch
arrays. Each replication of the job will only
have one value for each (1-6, 2-7, 3-8, 4-9 and 5-10). Similar to
Map()
, the MoreArgs
argument defines variables
to be shared across replication.
A new argument max_array_size
as been set to 2. This
prevents the array jobs to be of size more than 2. In our case, we have
submissions of array sizes 2, 2, and 1. In real analysis situation, the
value would be around 500. This argument prevents slurm from refusing a job
submission because of the size of the array (a limit of 1000 submission
is common). With EpiModel we already had cases where 30000 array
jobs were to be run. This template simply submit them in sequential
chunks of 500
In the sbatch_opts
we specified
mem-per-cpu = "4G"
. This means that if we change the
cores_to_use
value, the memory asked will scale as
well.
To recap, this step will submit an array of 5 jobs, each receiving a
different value for iterator1
and iterator2
.
Each of theses jobs will run over cores_to_use
. We use this
approach with EpiModel where we run
huge arrays of jobs where each job is a set of around 30 parallel
simulations. Therefore, we here have 2 levels of parallelization. One in
slurm and one in the script
itself.
Running an R function directly
Sometimes we want to run a simple function directly without storing
it into an R script. The step_tmpl_do_call()
and
step_tmpl_map()
do exactly that for one-of functions and
Map()
s.
wf <- add_workflow_step(
wf_summary = wf,
step_tmpl = step_tmpl_do_call(
what = function(var1, var2) {
cat(paste0("var1 = ", var1, ", var2 = ", var2))
},
args = list(var1 = "XYZ", var2 = "UVW"),
setup_lines = setup_lines
),
sbatch_opts = list(
"cpus-per-task" = 1,
"time" = "00:10:00",
"mem" = "4G",
"mail-type" = "END"
)
)
The syntax of these two templates is almost identical to the previous
two that we discussed. The main difference the first argument where we
pass a function
instead of a path to a script.
Note: the function will be run in clean R
session on the
HPC. All the values used by the function
must be either
created by it, loaded by it or passed as argument.
Finally, as this will be our last step, we override the
mail-type
sbatch_opts
to receive a mail when
this step finishes, whatever the outcome. This way we receive a
mail telling us that the workflow is finished.
Using the workflow on an HPC
Now that our workflow is created how to actually run the code on the HPC?
We assume that we are working on a project called “test_proj”, that this project was cloned on the HPC at the following path: “~/projects/test_proj” and that the “~/projects/test_proj/workflows/” directory exists.
Sending the workflow to the HPC
The following commands are to be run from your local computer.
MacOS or GNU/Linux
# bash - local
scp -r workflows/test_slurmworkflow <user>@clogin01.sph.emory.edu:projects/test_proj/workflows/
Windows
# bash - local
set DISPLAY=
scp -r workflows\test_slurmworkflow <user>@clogin01.sph.emory.edu:projects/test_proj/workflows/
Forgetting set DISPLAY=
will prevent scp
from working correctly if using the RStudio terminal.
Note that it’s workflows\networks_estimation
. Windows
uses back-slashes for directories and Unix OSes uses
forward-slashes.
Running the workflow from the HPC
For this step, you must be at the command line on the HPC. This means
that you have run: ssh <user>@clogin01.sph.emory.edu
from your local computer.
run set DISPLAY=
on Windows before if you get this
error:
ssh_askpass: posix_spawnp: No such file or directory
You also need to be at the root directory of the project (where the
“.git” folder is as well as the “renv.lock” file”. In this example you
would get there by running cd ~/projects/test_proj
. The
following steps will not work if you are not at the root of your
project.
Running the workflow is done by executing the file “workflows/estimation/start_workflow.sh” with the following command:
# bash - hpc
./workflows/test_slurmworkflow/start_workflow.sh
If you are using Windows, it may not be executable. You can solve it with the following command:
# bash - hpc
chmod +x workflows/test_slurmworkflow/start_workflow.sh`
The workflow will not work if you source the file (with
source <script>
or
. <script>
).
You can check the state of your running workflow as usual with
squeue -u <user>
.
The logs for the workflows are in “workflows/test_slurmworkflow/log/”.
The “start_workflow.sh” script
This start script additionally allows you to start a workflow at a
specific step with the -s
argument.
# bash - hpc
./workflows/test_slurmworkflow/start_workflow.sh -s 3
This will start the workflow at the 3rd step. Skipping steps 1 and 2.
It is sometimes desirable to start the workflow from outside of the
project it has to run on. The -d
argument allows you to set
a different working directory for the workflow.
# bash - hpc
cd /
~/projects/test_proj/workflows/test_slurmworkflow/start_workflow.sh -d ~/projects/test_proj
The previous block places us at the root of the file system with
cd /
. Then we call the “start_workflow.sh” script using its
absolute path and we specify that the working directory for the workflow
must be the root of the project.
Remember that for renv
to work, R
must be
called from the directory where the “.Rprofile” file is. It’s the
directory where you can also find the “renv.lock” file.
Conclusion
slurmworkflow is a very low level package providing only basic building blocks for complex HPC computation.
This package is used for EpiModel applied projects through higher level functions in EpiModelHPC and swfcalib, an automated calibration system used for our models (WIP).