Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
A resiliency model for high performance infrastructure based on logical encapsulation
Moore J., Kesselman C.  HPDC 2012 (Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, Delft, the Netherlands, Jun 18-22, 2012)283-294.2012.Type:Proceedings
Date Reviewed: Jul 8 2013

Heterogeneous, dynamically provisioned distributed systems for high-performance computing are becoming increasingly available to support a diverse range of compute- and storage-intensive tasks. The authors of this paper propose a resiliency model based on logical encapsulation for these systems that offers lower overhead and higher performance compared to current reactive resiliency approaches.

The primary contribution of the work is a model defined by simple assets, which consist of virtual machines and storage platforms, and complex assets. Complex assets are a logical encapsulation of an arbitrary number of individually provisioned simple assets, and are recursively defined. This mechanism facilitates management of a dynamic distributed system as a single logical entity during state capture and restore operations. The model clearly defines resiliency mechanisms for simple assets in terms of virtual machine state capture and rollback, and storage volume state capture and rollback. Complex asset state capture and rollback is defined in a parallel and asynchronous manner, which eliminates blocking and minimizes incurred overhead during state capture and rollback.

An evaluation of the proposed model with infrastructure commonly used in the industry on a variety of realistic workloads adds validity to the work. Results show that the model consistently reduces overhead compared to traditional hypervisor snapshots or message passing interface (MPI) checkpoints. These results are encouraging, and anyone who is provisioning or using a heterogeneous distributed system for high-performance workloads should find this paper valuable.

Reviewer:  Chris Lupo Review #: CR141341 (1309-0802)
Bookmark and Share
 
Distributed Architectures (C.1.4 ... )
 
 
Fault-Tolerance (D.4.5 ... )
 
 
Reliability, Availability, And Serviceability (C.4 ... )
 
 
Performance of Systems (C.4 )
 
Would you recommend this review?
yes
no
Other reviews under "Distributed Architectures": Date
Distributed and parallel computing
El-Rewini H., Lewis T. (ed), Manning Publications Co., Greenwich, CT, 1998. Type: Book (9780137955923)
Mar 1 1999
In search of clusters (2nd ed.)
Pfister G., Prentice-Hall, Inc., Upper Saddle River, NJ, 1998. Type: Book (9780138997090)
Nov 1 1998
A correctness condition for high-performance multiprocessors
Attiya H., Friedman R. SIAM Journal on Computing 27(6): 1637-1670, 1998. Type: Article
May 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy