Computing Reviews, the leading online review service for computing literature.

Search

Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach
Huang Y., Jalote P. IEEE Transactions on Computers41 (4):420-428,1992.Type:Article

Date Reviewed: Dec 1 1993

One of the basic techniques to achieve fault tolerance is the cold standby approach. The system consists of an active primary site and a set of passive backup sites, which store the list of operations performed by the primary site and periodically store checkpoints of its state. This paper presents an analytic model to describe the behavior of certain types of cold standby systems; the goal is to study the effect of a fault-tolerance technique on the response time of the system. The method is based on the machine repairman model. The model is used to analyze how the average response time is affected by different repair techniques, by the frequency of checkpointing, and by the degree of replication. The results give some insight into the circumstances under which replication and checkpointing really improve the average response time (not too often, according to this model). I have a minor remark concerning the performance metrics. In a situation like this, involving long intervals (days, weeks, or years) with short response times (less than a second), and short intervals (hours) with long response times (hours), the average response time contains little information of any relevance; it would be far better to stress the effects of the various techniques on the length of intervals between failures, on the repair times, and on the average response times during the normal operation (which includes the checkpointing overhead). To be fair, these metrics are also contained in the paper, but only as intermediate results. The paper is easy to read for anyone with some basic knowledge of stochastics. Some of the remarks and discussions are trivial, but they probably help readers who are not familiar with the topic. The validity of the results is treated rather superficially.

Reviewer: T. Alanko	Review #: CR117040

Reliability, Availability, And Serviceability (C.4 ... )

Fault-Tolerance (D.4.5 ... )

Would you recommend this review?

yes

Other reviews under "Reliability, Availability, And Serviceability":	Date

Implementing fault-tolerant services using the state machine approach: a tutorial Schneider F. ACM Computing Surveys 22(4): 299-319, 2001. Type: Article	Jul 1 1992

Network reliability and algebraic structures Shier D., Clarendon Press, New York, NY, 1991. Type: Book (9780198533863)	Sep 1 1992

On building systems that will fail Corbató F. Communications of the ACM 34(9): 72-81, 1991. Type: Article	Sep 1 1992

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy