Publications / Conference Poster

Practical resilient cases for FA-MPI, A transactional fault-Tolerant MPI

Hassani, Amin; Skjellum, Anthony; Bangalore, Purushotham V.; Brightwell, Ronald B.

MPI is insu-cient when confronting failures. FA-MPI (Fault-Aware MPI) provides extensions to the MPI standard de-signed to enable data-parallel applications to achieve re-silience without sacri-cing scalability. FA-MPI introduces transactions as a novel extension to theMPI message-passing model. Transactions support failure detection, isolation, mitigation, and recovery via application-driven policies. To achieve maximum achievable performance of modern ma-chines, overlapping communication and I/O with computa-Tion through non-blocking operations is of growing impor-Tance. Therefore, we emphasize fault-Tolerant, non-blocking communication operations plus a set of nestable lightweight transactional TryBlock API extensions able to exploit sys-Tem and application hierarchy. This strategy enables appli-cations to run to completion with higher probability than nominally. We modi-ed two proxy applications|MiniFE and LULESH|by adding FA-MPI semantics to them. Fi-nally we present performance and overhead results for 1K MPI processes.