“Lazy Fault Recovery for Redundant MPI”
Date: Thursday, June 6th, 2019
Time: 2:00 PM – 4:00 PM
Location: Building 14 Rm 238B
Committee: Professors Pantoja, Lupo, Smith
Abstract:
Distributed Systems (DS) where multiple computers share a workload across a net-work, are used everywhere, from data intensive computations to storage and machine learning. DS provide a relatively cheap and efficient solution that allows stability with improved performance for computational intensive applications. In a DS faults and failures are the norm not the exception. DS can experience problems caused by application bugs, operating systems bugs, failures with disks, memory, connectors, networking, power supply, and other components; therefore, constant monitoring and failure detection are fundamental. Automatic recovery must be integral to the system. One of the most commonly used programming languages for DS is Message Passing Interface (MPI). Unfortunately MPI does not support fault detection or recovery. In this thesis, we build a recovery mechanism based on replicas that works on top of the asynchronous fault detection implemented in previous work. Results shows that our recovery implementation is successful and the overhead in execution time is minimal.