OpenMP, MPI and SLURM Settings

MPI specific settings

In addition to --map-by node some other MPI arguments can be useful while running the benchmark. Some of them are

  • --tag-output: This will tag each line of the stdout with the MPI rank.

  • --timestamp-output: This will put the time stamp on the each line of stdout

MCA parameters provide more finer level tuning on the mpirun. For OpenMPI 3, Infiniband (IB) support is provided via openib libraries. To use only IB devices, we can use --mca btl openib,self,vader. To use specific port --mca btl_openib_if_include mlx5_0:1 must be used. This says that we should use port mlx5_0:1 only for MPI communications. Important to know that this option must be passed on p3-alaska to get best performance. Simply passing --mca btl openib,self,vader will give us warnings on p3-alaska as one of the high speed port is broken on the cluster network interface.

Another MCA parameter we should pass here is --mca mpi_yield_when_idle 1. This puts MPI in degraded busy wait mode. By default OpenMPI uses aggressive busy wait meaning that the processes continually poll for the messages to decrease the latency. Sometimes it can result in very high CPU usages and we can turn it off passing this argument. When there are more processes than processors, ie, when we overcommit the resources, OpenMPI automatically uses degraded busy wait mode. This gives minor performance benefits when running on fewer nodes.

Using MPI argument --map-by node binds a given process to a certain node. Typically OS moves these processes around the sockets based on its scheduling policies. Imagine a case where an MPI process starts in Socket 0. After sometime, if Socket 0 has too much work to do, the kernel might move this MPI process to Socket 1 to balance the work on its resources. But this is not desirable in certain cases as we will lose NUMA locality when migrating the processes. It is important to define the process bindings in the MPI applications. OpenMPI binds the processes to sockets by default and for the Imaging IO benchmark, this works well as we will not use NUMA locality.

All these options are OpenMPI specific. The equivalent options for Intel MPI will be documented in the future.

OpenMP specific settings

Similar to the process bindings we need to bind the OpenMP threads too. As processes we would like the OpenMP threads to bind to sockets as well. To do so we should export couple of environment variables export OMP_PROC_BIND=true OMP_PLACES=sockets. The first variable turns on the OpenMP thread binding option and second one specifies the place where the threads should bind.

SLURM specific settings

MPI startup is not standardised and thus, mpirun is not only way to run an MPI application. On most of the production machines srun is used to run the MPI applications. So some of the MPI specific arguments discussed above will not work for srun. If the srun is using OpenMPI implementation, MCA parameters can still be used by passing them via environment variables with prefix OMPI_MCA. For example command line option --mca mpi_yield_when_idle 1 can also be passed as export OMPI_MCA_mpi_yield_when_idle=1. Process binding in SLURM srun is done via --distribution and --cpu-bind options. For the Imaging IO benchmark, we can use --distribution=cyclic:cyclic:fcyclic --cpu-bind=verbose,sockets. More details on these options can be found in SLURM documentation.

A sample SLURM script using srun to launch MPI that is used on JUWELS (JSC) machine is

#!/bin/bash

#SBATCH --time=01:00:00
#SBATCH --job-name=sdp-benchmarks
#SBATCH --account=training2022
#SBATCH --nodes=128
#SBATCH --ntasks=12288
#SBATCH --partition=batch
#SBATCH --output=/p/project/training2022/paipuri/sdp-benchmarks/io_bench/out/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2.out
#SBATCH --error=/p/project/training2022/paipuri/sdp-benchmarks/io_bench/out/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2.out
#SBATCH --mail-type=FAIL
#SBATCH --no-requeue
#SBATCH --exclusive

# GENERATED FILE

set -e

# Purge previously loaded modules
module purge
# Load GCC OpenMPI HDF5/1.10.6-serial FFTW/3.3.8 git, git-lfs modules
module load GCC/9.3.0 OpenMPI/4.1.0rc1 FFTW/3.3.8 git-lfs/2.12.0 git/2.28.0 zlib/1.2.11-GCCcore-9.3.0 HDF5/1.10.6-GCCcore-9.3.0-serial

# Give a name to the benchmark
BENCH_NAME=juwels-io-scalability

# Directory where executable is
WORK_DIR=/p/project/training2022/paipuri/sdp-benchmarks/io_bench/ska-sdp-exec-iotest/src

# Any machine specific environment variables that needed to be given.
# export UCX log to suppress warnings about transport on JUWELS
# (Need to check with JUWELS technical support about
# this warning)
export UCX_LOG_LEVEL=ERROR

# Any additional commands that might be specific to a machine


# Change to script directory
cd $WORK_DIR

echo "JobID: $SLURM_JOB_ID"
echo "Job start time: `date`"
echo "Job num nodes: $SLURM_JOB_NUM_NODES"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

echo "Executing the command:"
CMD="export OMPI_MCA_mpi_yield_when_idle=1 OMP_PROC_BIND=true OMP_PLACES=sockets OMP_NUM_THREADS=24 &&
  srun --distribution=cyclic:cyclic:fcyclic \
  --cpu-bind=verbose,sockets --overcommit --label --nodes=128 --ntasks=266 --cpus-per-task=24 ./iotest \
  --facet-workers=10 --rec-set=small --vis-set=lowbd2 --time=-460:460/1024/256 --freq=260e6:300e6/8192/256 --dec=-30 \
  --source-count=10 --send-queue=4 --subgrid-queue=16 --bls-per-task=8 --task-queue=32 --fork-writer --wstep=64 \
  --margin=32 --writer-count=8 \
  /p/scratch/training2022/paipuri/sdp-benchmarks-scratch/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2/out%d.h5\
   && rm -rf \
  /p/scratch/training2022/paipuri/sdp-benchmarks-scratch/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2/out*.h5"

echo $CMD

eval $CMD

echo "Job finish time: `date`"

The equivalent of --tag-output is --label with srun. It is better to reserve all the nodes and cores at the #SBATCH level and use these resources in the srun command. A standard compute node on JUWELS has 48 physical (96 with logical included) cores and if we are using 128 nodes, the total number of available cores are 128x96=12288 cores. By looking at the SLURM script, we see that we are asking for 12288 tasks, which means SLURM will reserve all the cores for the job. When we launch the application using srun, we can use these reserved resources that is best suited for our application.

In the case where we want to overcommit the resources, passing --ntasks and --cpus-per-task at the #SBATCH will not work as SLURM complains that it does not have enough resources to reserve. For example, using the same example of JUWELS standard node, we want 10 tasks with 24 OpenMP threads per task on one node. This will need 240 CPUs in total and a standard JUWELS compute node only offers 96 CPUs. If we ask SLURM to reserve the resources in this configuration, it will return an error. However, the same configuration can be used using srun with --overcommit flag without any issues. So, it is always safe to use --ntasks and --cpus-per-task on the srun. On JUWELS, we had to exclusively specify the --cpu-bind option as the default is --cpu-bind=cores. This is something we should keep in mind and check the default settings before running the benchmark on production machines.