Background¶

The prototype design follows recommendations from the earlier working set memo (Wortmann, 2017). We see (de)gridding as the main computational work to be distributed. This is because this work will eventually involve heavy computation to deal with calibration and non-coplanarity. A scalable implementation requires distribution, as for SDP even “fat” nodes with multiple accelerators will likely not have enough power to shoulder all the required work on their own.

This leads to the proposed distribution by facets/subgrids, with facet data staying stationary while the distributed program walks through grid regions. This involves loosely synchronised all-to-all network communication between all participating nodes. As argued above, this characteristic is likely what will ultimately limit the performance of imaging/predict pipelines.

Finally, we have to consider raw visibility throughput. As we cannot possibly keep all visibilities in memory at all times, this data needs to be handled using mass storage technology. The achieved throughput of this system must be large enough to keep pace with the (accelerated) de/gridding work. While this only represents a comparatively predictable “base” load of order of magnitude 1 byte load per 1000 flops executed, we still need to pay attention due to the somewhat unusual amount of I/O required. This is especially significant because we will likely want to serve this data using flexible distributed storage technologies (Taylor, 2018), which introduce another set of scalability considerations.

Goals:¶

Focused benchmarking of platform (especially buffer and network) hardware and software
Verify parametric model assumptions concerning distributed performance and scaling, especially the extended analysis concerning I/O and memory from SDP memo 038

Main aspects:

Distribution by facets/subgrids - involves all-to-all loosely synchronised communication and phase rotation work
(De)gridding - main work that needs to be distributed. Will involve heavy computational work to deal with calibration and non-coplanarity, main motivation for requiring distribution
Buffer I/O - needs to deal with high throughput, possibly not entirely sequential access patterns due to telescope layout

Technology choices:

Written in plain C to minimise language environment as possible bottleneck
Use MPI for communication - same reason
OpenMP for parallelism - a bit limiting in terms of thread control, but seems good enough for a first pass
HDF5 for data storage - seems flexible enough for a start. Might port to other storage back-ends at some point

Algorithmic choices:

We consider the continuum case, but with only one polarisation, taylor term, snapshot (= no reprojection). These would add code complexity, but are likely easy to scale up.
We start with prediction (= writing visibilities). This is clearly the wrong way around, the backwards step will be added in the next version.
Use slightly experimental gridded phase rotation (“recombination”), allowing us to skip baseline-dependent averaging while still keeping network transfer low and separating benchmark stages more cleanly.