Running Veros on a cluster¶
This tutorial walks you through some of the most common challenges that are specific to large, shared architectures like clusters and supercomputers. In case you are still having trouble setting up or running Veros on a large architecture after reading it, you should first contact the administrator of your cluster. Otherwise, you should of course feel free to open an issue.
Probably the easiest way to install Veros on a cluster is to, once again, use Anaconda. Since it is mostly platform independent and does not require elevated permissions, Anaconda is the perfect way to try out Veros without too much hassle.
If you are an administrator and want to make Veros accessible to multiple users on your cluster, we recommend that you do not install Veros system-wide, since it severely limits the possibilities of the users: First of all, they won’t be able to install additional Python modules they might want to use for post-processing or development. And second of all, the source code (and playing with it) is supposed to be a critical part of the Veros experience. Instead, you could e.g. use virtualenv to create a lightweight Python environment for every user that they can freely manage.
If you want to run Veros on a shared computing architecture, there are several issues that require special handling:
- Preventing timeouts: In cloud computing, it is common that scheduling constraints limit the maximum execution time of a given process. Processes that exceed this time are killed. To prevent that long-running processes have to be restarted manually after each timeout, one usually makes use of a resubmit mechanism: The long-running process is split into chunks that each finish before a timeout is triggered, with subsequent runs starting from the restart files that the previous process has written.
- Allocation of resources: Most applications use MPI to distribute work across processors; however, this is not supported by Bohrium. We therefore need to make sure that just one single process on a single node is started for our simulation (Bohrium will then divide the workload among different threads using OpenMP).
To solve these issues, the scheduling manager needs to be told exactly how it should run our model, which is usually being done by writing a batch script that prepares the environment and states which resources to request. The exact set-up of such a script will vary depending on the scheduling manager running on your cluster, and how exactly you chose to install Veros and Bohrium. One possible way to write such a batch script for the scheduling manager SLURM is presented here:
#!/bin/bash -l # #SBATCH -p mycluster #SBATCH -A myaccount #SBATCH --job-name=veros_mysetup #SBATCH --nodes=2 #SBATCH --ntasks=16 #SBATCH --cpus-per-task=4 #SBATCH --exclusive #SBATCH --mail-type=ALL #SBATCH --email@example.com # load module dependencies module load bohrium # only needed if not found automatically export BH_CONFIG=/path/to/bohrium/config.ini # if needed, you can modify the internal Bohrium compiler flags export BH_OPENMP_COMPILER_FLG="-x c -fPIC -shared -std=gnu99 -O3 -Werror -fopenmp" # set number of threads to cpus-per-task export OMP_NUM_THREADS=4 # adapt srun command to your available scheduler / MPI implementation veros resubmit -i my_run -n 8 -l 7776000 \ -c "srun --mpi=pmi2 -- python my_setup.py -b bohrium -v debug -n 4 4" \ --callback "sbatch veros_batch.sh"
saved as veros_batch.sh in the model setup folder and called using
This script makes use of the veros resubmit command and its
--callback option to create a script that automatically re-runs itself in a new process after each successful run (see also Command line tools). Upon execution, a job is created on one node, using 16 processors in one process, that runs the Veros setup located in
my_setup.py a total of eight times for 90 days (7776000 seconds) each, with identifier
my_run. Note that the
--callback "sbatch veros_batch.sh" part of the command is needed to actually create a new job after every run, to prevent the script from being killed after a timeout.