Configuration Settings¶
Number of MPI processes should be chosen based on the
recombination parameters (--rec-set
) and number of
sockets (NUMA node) available on each node. Essentially
number of MPI processes are sum of number of facet workers,
number of subgrid workers and number of dynamic assignment
worker. Currently, we use one worker for dynamic work
assignment and this might change in the future. Number
of facet workers should be equal to the number of facets
in the recombination parameters, which is presented in
Recombination parameters.
Number of subgrid workers should be chosen based on the
NUMA nodes on each compute node. If a compute node has
two sockets, which means two NUMA nodes, we should ideally
have two subgrid workers. If there are 16 physical cores
on each socket with hyperthreading enabled, we can use all
the available cores, both physical and logical, for the
subgrid workers. Hence, we can use for this example, number
of OpenMP threads would be 32. It is noticed that using all
available cores gives us a marginal performance benefits
for the dry runs.
num_mpi_process = num_facets + num_nodes * num_numa_nodes + num_dynamic_work_assignement
where num_nodes
is number of nodes in the reservation,
num_numa_nodes
is number of NUMA node
in each compute node, num_facets
is number of
facets and finally, num_dynamic_work_assignement
is
number of dynamic work assignment workers which is 1.
Some of the configurations for single and multiple nodes are:
Single Node¶
The program can be configured using the command line. Useful configurations:
$ mpirun -n 2 ./iotest --rec-set=T05
Runs a small test-case with two local processes to ensure that everything works correctly. It should show RMSE values of about 1e-8 for all recombined subgrids and 1e-7 for visibilities.
$ ./iotest --rec-set=small
$ ./iotest --rec-set=large
Does a dry run of the “producer” end in stand-alone mode. Primarily useful to check whether enough memory is available. The former will use about 10 GB memory, the latter about 350 GB.
$ ./iotest --rec-set=small --vis-set=vlaa --facet-workers=0
Does a dry run of the “degrid” part in stand-alone mode. Good to check stability and ensure that we can degrid fast enough.
$ ./iotest --rec-set=small --vis-set=vlaa --facet-workers=0 /tmp/out%d.h5
Same as above, but actually writes data to the out the given file. By
default, each subgrid worker creates two writer threads and hence
%d
placeholder is used. Data will be all zeroes, but this runs
through the entire back-end without involving actual distribution.
Typically quite a bit slower, as writing out data is generally the
bottleneck.
$ mpirun -n 2 ./iotest --rec-set=small --vis-set=vlaa /tmp/out%d.h5
Runs the entire thing with one facet and one subgrid worker, this time producing actual-ish visibility data (for a random facet without grid correction).
$ mpirun -n 2 ./iotest --rec-set=small --vis-set=vlaa --time=-230:230/512/128 --freq=225e6:300e6/8192/128 /tmp/out.h5
The “–vis-set” and “–rec-set” parameters are just default parameter
sets that can be overridden. The command line above increases time and
frequency sampling to the point where it would roughly correspond to
an SKA Low snapshot (7 minutes, 25% frequency range). The time and
frequency specification is <start>/<end>/<steps>/<chunk>
, so in
this case 512 time steps with chunk size 128 and 8192 frequency
channels with chunks size 128. This will write roughly 9 TB of data
with a chunk granularity of 256 KB.
Distributed¶
As explained the benchmark can also be run across a number of nodes. This will distribute both the facet working set as well as the visibility write rate pretty evenly across nodes. As noted you might want at minimum a producer and a streamer process per node, and configure OpenMP such that its threads take full advantage of the machine’s available cores at least for the subgrid workers. Something that would be worth testing systematically is whether facet workers might not actually be faster with fewer threads. They are likely waiting most of the time.
To distribute facet workers among all the nodes --map-by node
argument should be used for OpenMPI. By default OpenMPI assigns the
processes in blocks and without --map-by node
argument, one or
more nodes might get many facet workers. This is not what we want as
facet workers are memory bound. With OpenMPI default mapping, we would
end up with subgrid workers on all low ranks and facet workers on high
ranks. As facet workers wait most of the time (so use little CPU), yet
use the a lot of memory, that would cause the entire thing to become
very unbalanced.
For example, if we use --rec-set=small
across 8 nodes (2 NUMA
nodes each and 16 cores on each socket) we want to run 10 producer
processes (facet workers) and 16 streamer processes (subgrid workers),
using 16 threads each:
export OMP_NUM_THREADS=16
mpirun --map-by node -np 26 ./iotest --facet-workers=10 --rec-set=small $options
This would allocate 2 streamer processes per node with 16 threads each, appropriate for a node with 32 (physical) cores available. Facet workers are typically heavily memory bound and do not interfere too much with co-existing processes outside of reserving large amounts of memory.
This configuration (mpirun --map-by node -np 26 ./iotest --facet-workers=10 --rec-set=small
)
will just do a full re-distribution of facet/subgrid data between all
nodes. This serves as a network I/O test. Note that because we are
operating the benchmark without a telescope configuration, the entire
grid is going to get transferred - not just the pieces of it that have
baselines.
Other options for distributed mode:
options="--vis-set=lowbd2"
Will only do re-distribute data that overlaps with baselines, then do degridding.
options="--vis-set=lowbd2 /local/out%d.h5"
Also write out visibilities to the given file. Note that the benchmark
does not currently implement parallel HDF5, so different streamer
processes will have to write separate output files. The name can be
made dependent on streamer ID by putting a %d
placeholder into it
so it won’t cause conflicts on shared file systems.
options="--vis-set=lowbd2 --fork-writer --writer-count=4 /local/out%d.h5"
This will create 4 writer processes for each subgrid worker and writes
the data to the file system. Remember that without --fork-writer
option, the benchmark will create only threads and HDF5 library
currently do not have support for concurrent writing threads. So, it
will not increase the data throughput.
SKA1 LOW and MID settings¶
To run the benchmark that correspond to SKA1 LOW and SKA1 MID settings
following configuration can be used. These settings are provided
assuming we are running on 16 compute nodes with 2 NUMA nodes and 32
cores on each compute node and --rec-set=small
.
SKA LOW:
export OMP_NUM_THREADS=16 mpirun --map-by node -np 42 ./iotest --rec-set=small --vis-set=lowbd2 --facet-workers=10 --time=-460:460/1024/64 --freq=260e6:300e6/8192/64 --dec=-30 --source-count=10 --send-queue=4 --subgrid-queue=16 --bls-per-task=8 --task-queue=32 --target-err=1e-5 --margin=32`
SKA MID:
export OMP_NUM_THREADS=16 mpirun --map-by node -np 42 ./iotest --rec-set=small --vis-set=midr5 --facet-workers=10 --time=-290:290/4096/64 --freq=0.35e9:0.4725e9/11264/64 --dec=-30 --source-count=10 --send-queue=4 --subgrid-queue=16 --bls-per-task=8 --task-queue=32 --target-err=1e-5 --margin=32
The above configurations run the benchmark in the dry mode without
writing the visibility data to the file system. If we want to write
the data, we have to add /some_scratch/out%d.h5
to the end, where
/some_scratch
is the scratch directory of the file system. SKA LOW
has 131,328 baselines and with above configuration 1,101,659,111,424
(131,328x1024x8192) visibilities will be produced which correspond to
roughly 14 TB of data. SKA MID will have 19,306 baselines, so
890,727,563,264 visibilities which is 17 TB of data. The amount of
generated data can be effected by the chunk size used. Bigger chunk
size involves more subgrids which eventually require some re-writes.
Running with singularity image¶
Singularity image can be pulled from GitLab registry as shown in Prerequisites and Installation. Currently, the singularity image supports three different entry points, which are defined using apps
feature from SCIF. Three entry points are as follows:
openmpi-3.1-ibverbs
: OpenMPI 3.1.3 is built inside the container with IB verbs support for high performance interconnect.openmpi-4.1-ucx
: OpenMPI 4.1.0 with UCX is build inside the containeropenmpi-3.1-ibverbs-haswell
: This is similar toopenmpi-3.1-ibverbs
, albeit, the imaging I/O test code is compiled withhaswell
microarchitecture instruction set. This default entrypoint to the container unless otherwise specified.
To list all the apps installed in the container, we can use singularity inspect
command as
singularity inspect --list-apps iotest.sif
Typically, singularity can be run using MPI as follows:
mpirun -np 2 singularity run --env OMP_NUM_THREADS=8 --bind /scratch --app ompi-3.1-ibverbs iotest.sif ${ARGS}
The above command launches two MPI processes with 8 OpenMP threads with entry point defined in ompi-3.1-ibverbs
of iotest.sif
image and ${ARGS}
correspond to typical Imaging I/O test arguments presented above. If visibility data is written to non-standard directories, it is necessary to bind the directory using --bind
option as shown in the command.