Computing Reviews, the leading online review service for computing literature.

Search

Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
Sultan F., Nguyen T., Iftode L. IEEE Transactions on Parallel and Distributed Systems13 (7):673-686,2002.Type:Article

Date Reviewed: Jan 20 2003

The problem of garbage collection of recovery states for a fault-tolerant distributed shared memory (DSM) protocol is addressed in this paper. In particular, the authors build on their previous work [1,2], and prove the correctness of the algorithms they used for garbage collection of checkpoints and logs. The reader must have a solid understanding of the original system to get the most out of this paper. Whereas a great deal of research has been conducted on DSMs, most of it has focused on performance metrics. Hence, only recently has there been any attempt to also design fault-tolerant systems. In their previous work [2], the authors address the problem of integrating independent checkpointing and logging with a scalable software DSM protocol to build a single-failure fault-tolerant DSM system that can be deployed on large-scale clusters. Independent checkpointing is used, since coordinated checkpointing requires global coordination, and scalability becomes a bottleneck for large systems. However, independent checkpointing requires a careful scheme for garbage collection of obsolete checkpoints and logs, without forcing global synchronization among processes. The main contributions of this paper are the theoretical results on which the system described in their previous work [2] is based, namely lazy log trimming (LLT) and checkpoint garbage collection (CGC). The authors prove bounds on the minimal state that needs to be checkpointed and logged to support recovery from single-fault failures. Although not mentioned in this paper, it would be interesting to see how these ideas could be extended to a system with more than one fail-stop failure, or a system with Byzantine failures.

Reviewer: Eno Thereska	Review #: CR126855 (0304-0365)

1)	Sultan, F.; Nguyen, T.; Iftode, L. Limited-size logging for fault-tolerant distributed shared memory with independent checkpointing. Technical Report DCS-TR-409, 2000.

2)	Sultan, F.; Nguyen, T.; Iftode, L. Scalable fault-tolerant distributed shared memory. In High Performance Networking and Computing Conference (Dallas, TX, Nov. 4-10, 2000), IEEE Computer Society, New York, 2000, 1–12.

Memory Management (Garbage Collection) (D.3.4 ... )

Checkpoint/ Restart (D.4.5 ... )

Would you recommend this review?

yes

Other reviews under "Memory Management (Garbage Collection)":	Date

Hardware support for real-time embedded multiprocessor system-on-a-chip memory management Shalan M., Vincent J I. Hardware/software codesign (Proceedings of the tenth international symposium, Estes Park, Colorado, May 6-8, 2002)79-84, 2002. Type: Proceedings	Jan 30 2004

Memory as a programming concept in C and C++ Franek F., Cambridge University Press, New York, NY, 2003. 250, Type: Book (9780521520430)	Apr 30 2004

Interprocedural pointer alias analysis Hind M., Burke M., Carini P., Choi J. ACM Transactions on Programming Languages and Systems 21(4): 848-894, 1999. Type: Article	Mar 1 2000

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy