The problem of garbage collection of recovery states for a fault-tolerant distributed shared memory (DSM) protocol is addressed in this paper. In particular, the authors build on their previous work [1,2], and prove the correctness of the algorithms they used for garbage collection of checkpoints and logs. The reader must have a solid understanding of the original system to get the most out of this paper.
Whereas a great deal of research has been conducted on DSMs, most of it has focused on performance metrics. Hence, only recently has there been any attempt to also design fault-tolerant systems. In their previous work [2], the authors address the problem of integrating independent checkpointing and logging with a scalable software DSM protocol to build a single-failure fault-tolerant DSM system that can be deployed on large-scale clusters. Independent checkpointing is used, since coordinated checkpointing requires global coordination, and scalability becomes a bottleneck for large systems. However, independent checkpointing requires a careful scheme for garbage collection of obsolete checkpoints and logs, without forcing global synchronization among processes.
The main contributions of this paper are the theoretical results on which the system described in their previous work [2] is based, namely lazy log trimming (LLT) and checkpoint garbage collection (CGC). The authors prove bounds on the minimal state that needs to be checkpointed and logged to support recovery from single-fault failures. Although not mentioned in this paper, it would be interesting to see how these ideas could be extended to a system with more than one fail-stop failure, or a system with Byzantine failures.