One of the basic techniques to achieve fault tolerance is the cold standby approach. The system consists of an active primary site and a set of passive backup sites, which store the list of operations performed by the primary site and periodically store checkpoints of its state. This paper presents an analytic model to describe the behavior of certain types of cold standby systems; the goal is to study the effect of a fault-tolerance technique on the response time of the system. The method is based on the machine repairman model. The model is used to analyze how the average response time is affected by different repair techniques, by the frequency of checkpointing, and by the degree of replication. The results give some insight into the circumstances under which replication and checkpointing really improve the average response time (not too often, according to this model).
I have a minor remark concerning the performance metrics. In a situation like this, involving long intervals (days, weeks, or years) with short response times (less than a second), and short intervals (hours) with long response times (hours), the average response time contains little information of any relevance; it would be far better to stress the effects of the various techniques on the length of intervals between failures, on the repair times, and on the average response times during the normal operation (which includes the checkpointing overhead). To be fair, these metrics are also contained in the paper, but only as intermediate results.
The paper is easy to read for anyone with some basic knowledge of stochastics. Some of the remarks and discussions are trivial, but they probably help readers who are not familiar with the topic. The validity of the results is treated rather superficially.