OpenMP, MPI and SLURM Settings¶
MPI specific settings¶
In addition to --map-by node
some other MPI arguments can be
useful while running the benchmark. Some of them are
--tag-output
: This will tag each line of the stdout with the MPI rank.--timestamp-output
: This will put the time stamp on the each line of stdout
MCA parameters provide more finer level tuning on the mpirun
. For
OpenMPI 3, Infiniband (IB) support is provided via openib libraries.
To use only IB devices, we can use --mca btl openib,self,vader
. To
use specific port --mca btl_openib_if_include mlx5_0:1
must be
used. This says that we should use port mlx5_0:1
only for MPI
communications. Important to know that this option must be passed
on p3-alaska to get best performance. Simply passing --mca btl
openib,self,vader
will give us warnings on p3-alaska as one of the
high speed port is broken on the cluster network interface.
Another MCA parameter we should pass here is --mca
mpi_yield_when_idle 1
. This puts MPI in degraded busy wait mode. By
default OpenMPI uses aggressive busy wait meaning that the processes
continually poll for the messages to decrease the latency. Sometimes
it can result in very high CPU usages and we can turn it off passing
this argument. When there are more processes than processors, ie, when
we overcommit the resources, OpenMPI automatically uses degraded busy
wait mode. This gives minor performance benefits when running on
fewer nodes.
Using MPI argument --map-by node
binds a given process to a
certain node. Typically OS moves these processes around the sockets
based on its scheduling policies. Imagine a case where an MPI process
starts in Socket 0. After sometime, if Socket 0 has too much work to
do, the kernel might move this MPI process to Socket 1 to balance the
work on its resources. But this is not desirable in certain cases as
we will lose NUMA locality when migrating the processes. It is
important to define the process bindings in the MPI applications.
OpenMPI binds the processes to sockets by default and for the Imaging
IO benchmark, this works well as we will not use NUMA locality.
All these options are OpenMPI specific. The equivalent options for Intel MPI will be documented in the future.
OpenMP specific settings¶
Similar to the process bindings we need to bind the OpenMP threads
too. As processes we would like the OpenMP threads to bind to sockets
as well. To do so we should export couple of environment variables
export OMP_PROC_BIND=true OMP_PLACES=sockets
. The first variable
turns on the OpenMP thread binding option and second one specifies the
place where the threads should bind.
SLURM specific settings¶
MPI startup is not standardised and thus, mpirun
is not only way to run an MPI application. On most of the production
machines srun
is used to run the MPI applications. So some of the
MPI specific arguments discussed above will not work for srun
. If
the srun
is using OpenMPI implementation, MCA parameters can still
be used by passing them via environment variables with prefix
OMPI_MCA
. For example command line option
--mca mpi_yield_when_idle 1
can also be passed as
export OMPI_MCA_mpi_yield_when_idle=1
. Process binding in SLURM
srun
is done via --distribution
and --cpu-bind
options.
For the Imaging IO benchmark, we can use
--distribution=cyclic:cyclic:fcyclic --cpu-bind=verbose,sockets
.
More details on these options can be found in SLURM documentation.
A sample SLURM script using srun
to launch MPI that is used on
JUWELS (JSC) machine is
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name=sdp-benchmarks
#SBATCH --account=training2022
#SBATCH --nodes=128
#SBATCH --ntasks=12288
#SBATCH --partition=batch
#SBATCH --output=/p/project/training2022/paipuri/sdp-benchmarks/io_bench/out/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2.out
#SBATCH --error=/p/project/training2022/paipuri/sdp-benchmarks/io_bench/out/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2.out
#SBATCH --mail-type=FAIL
#SBATCH --no-requeue
#SBATCH --exclusive
# GENERATED FILE
set -e
# Purge previously loaded modules
module purge
# Load GCC OpenMPI HDF5/1.10.6-serial FFTW/3.3.8 git, git-lfs modules
module load GCC/9.3.0 OpenMPI/4.1.0rc1 FFTW/3.3.8 git-lfs/2.12.0 git/2.28.0 zlib/1.2.11-GCCcore-9.3.0 HDF5/1.10.6-GCCcore-9.3.0-serial
# Give a name to the benchmark
BENCH_NAME=juwels-io-scalability
# Directory where executable is
WORK_DIR=/p/project/training2022/paipuri/sdp-benchmarks/io_bench/ska-sdp-exec-iotest/src
# Any machine specific environment variables that needed to be given.
# export UCX log to suppress warnings about transport on JUWELS
# (Need to check with JUWELS technical support about
# this warning)
export UCX_LOG_LEVEL=ERROR
# Any additional commands that might be specific to a machine
# Change to script directory
cd $WORK_DIR
echo "JobID: $SLURM_JOB_ID"
echo "Job start time: `date`"
echo "Job num nodes: $SLURM_JOB_NUM_NODES"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
echo "Executing the command:"
CMD="export OMPI_MCA_mpi_yield_when_idle=1 OMP_PROC_BIND=true OMP_PLACES=sockets OMP_NUM_THREADS=24 &&
srun --distribution=cyclic:cyclic:fcyclic \
--cpu-bind=verbose,sockets --overcommit --label --nodes=128 --ntasks=266 --cpus-per-task=24 ./iotest \
--facet-workers=10 --rec-set=small --vis-set=lowbd2 --time=-460:460/1024/256 --freq=260e6:300e6/8192/256 --dec=-30 \
--source-count=10 --send-queue=4 --subgrid-queue=16 --bls-per-task=8 --task-queue=32 --fork-writer --wstep=64 \
--margin=32 --writer-count=8 \
/p/scratch/training2022/paipuri/sdp-benchmarks-scratch/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2/out%d.h5\
&& rm -rf \
/p/scratch/training2022/paipuri/sdp-benchmarks-scratch/juwels-io-scalability-bare-metal-128-small-lowbd2-256-256-2/out*.h5"
echo $CMD
eval $CMD
echo "Job finish time: `date`"
The equivalent of --tag-output
is --label
with srun
. It is
better to reserve all the nodes and cores at the #SBATCH
level and
use these resources in the srun
command. A standard compute node on JUWELS has 48 physical (96 with logical included) cores and if we are using 128 nodes, the total number of available cores are 128x96=12288 cores. By looking at the SLURM script, we see that we are asking for 12288 tasks, which means SLURM will reserve all the cores for the job. When we launch the application using srun
, we can use these reserved resources that is best suited for our application.
In the case where we want to overcommit the resources, passing --ntasks
and --cpus-per-task
at the #SBATCH
will not work as SLURM complains that it does not have enough resources to reserve. For example, using the same example of JUWELS standard node, we want 10 tasks with 24 OpenMP threads per task on one node. This will need 240 CPUs in total and a standard JUWELS compute node only offers 96 CPUs. If we ask SLURM to reserve the resources in this configuration, it will return an error. However, the same configuration can be used using srun
with --overcommit
flag without any issues. So, it is always safe to use --ntasks
and --cpus-per-task
on the srun
. On JUWELS, we had to exclusively specify the --cpu-bind
option as the default is --cpu-bind=cores
. This is something we should keep in mind and check the default settings before running the benchmark on production machines.