Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
Sultan F., Nguyen T., Iftode L. IEEE Transactions on Parallel and Distributed Systems13 (7):673-686,2002.Type:Article
Date Reviewed: Jan 20 2003

The problem of garbage collection of recovery states for a fault-tolerant distributed shared memory (DSM) protocol is addressed in this paper. In particular, the authors build on their previous work [1,2], and prove the correctness of the algorithms they used for garbage collection of checkpoints and logs. The reader must have a solid understanding of the original system to get the most out of this paper.

Whereas a great deal of research has been conducted on DSMs, most of it has focused on performance metrics. Hence, only recently has there been any attempt to also design fault-tolerant systems. In their previous work [2], the authors address the problem of integrating independent checkpointing and logging with a scalable software DSM protocol to build a single-failure fault-tolerant DSM system that can be deployed on large-scale clusters. Independent checkpointing is used, since coordinated checkpointing requires global coordination, and scalability becomes a bottleneck for large systems. However, independent checkpointing requires a careful scheme for garbage collection of obsolete checkpoints and logs, without forcing global synchronization among processes.

The main contributions of this paper are the theoretical results on which the system described in their previous work [2] is based, namely lazy log trimming (LLT) and checkpoint garbage collection (CGC). The authors prove bounds on the minimal state that needs to be checkpointed and logged to support recovery from single-fault failures. Although not mentioned in this paper, it would be interesting to see how these ideas could be extended to a system with more than one fail-stop failure, or a system with Byzantine failures.

Reviewer:  Eno Thereska Review #: CR126855 (0304-0365)
1) Sultan, F.; Nguyen, T.; Iftode, L. Limited-size logging for fault-tolerant distributed shared memory with independent checkpointing. Technical Report DCS-TR-409, 2000.
2) Sultan, F.; Nguyen, T.; Iftode, L. Scalable fault-tolerant distributed shared memory. In High Performance Networking and Computing Conference (Dallas, TX, Nov. 4-10, 2000), IEEE Computer Society, New York, 2000, 1–12.
Bookmark and Share
 
Memory Management (Garbage Collection) (D.3.4 ... )
 
 
Checkpoint/ Restart (D.4.5 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Memory Management (Garbage Collection)": Date
Hardware support for real-time embedded multiprocessor system-on-a-chip memory management
Shalan M., Vincent J I.  Hardware/software codesign (Proceedings of the tenth international symposium, Estes Park, Colorado, May 6-8, 2002)79-84, 2002. Type: Proceedings
Jan 30 2004
Memory as a programming concept in C and C++
Franek F., Cambridge University Press, New York, NY, 2003.  250, Type: Book (9780521520430)
Apr 30 2004
Interprocedural pointer alias analysis
Hind M., Burke M., Carini P., Choi J. ACM Transactions on Programming Languages and Systems 21(4): 848-894, 1999. Type: Article
Mar 1 2000
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy