Heterogeneous, dynamically provisioned distributed systems for high-performance computing are becoming increasingly available to support a diverse range of compute- and storage-intensive tasks. The authors of this paper propose a resiliency model based on logical encapsulation for these systems that offers lower overhead and higher performance compared to current reactive resiliency approaches.
The primary contribution of the work is a model defined by simple assets, which consist of virtual machines and storage platforms, and complex assets. Complex assets are a logical encapsulation of an arbitrary number of individually provisioned simple assets, and are recursively defined. This mechanism facilitates management of a dynamic distributed system as a single logical entity during state capture and restore operations. The model clearly defines resiliency mechanisms for simple assets in terms of virtual machine state capture and rollback, and storage volume state capture and rollback. Complex asset state capture and rollback is defined in a parallel and asynchronous manner, which eliminates blocking and minimizes incurred overhead during state capture and rollback.
An evaluation of the proposed model with infrastructure commonly used in the industry on a variety of realistic workloads adds validity to the work. Results show that the model consistently reduces overhead compared to traditional hypervisor snapshots or message passing interface (MPI) checkpoints. These results are encouraging, and anyone who is provisioning or using a heterogeneous distributed system for high-performance workloads should find this paper valuable.