Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach
Huang Y., Jalote P. IEEE Transactions on Computers41 (4):420-428,1992.Type:Article
Date Reviewed: Dec 1 1993

One of the basic techniques to achieve fault tolerance is the cold standby approach. The system consists of an active primary site and a set of passive backup sites, which store the list of operations performed by the primary site and periodically store checkpoints of its state. This paper presents an analytic model to describe the behavior of certain types of cold standby systems; the goal is to study the effect of a fault-tolerance technique on the response time of the system. The method is based on the machine repairman model. The model is used to analyze how the average response time is affected by different repair techniques, by the frequency of checkpointing, and by the degree of replication. The results give some insight into the circumstances under which replication and checkpointing really improve the average response time (not too often, according to this model).

I have a minor remark concerning the performance metrics. In a situation like this, involving long intervals (days, weeks, or years) with short response times (less than a second), and short intervals (hours) with long response times (hours), the average response time contains little information of any relevance; it would be far better to stress the effects of the various techniques on the length of intervals between failures, on the repair times, and on the average response times during the normal operation (which includes the checkpointing overhead). To be fair, these metrics are also contained in the paper, but only as intermediate results.

The paper is easy to read for anyone with some basic knowledge of stochastics. Some of the remarks and discussions are trivial, but they probably help readers who are not familiar with the topic. The validity of the results is treated rather superficially.

Reviewer:  T. Alanko Review #: CR117040
Bookmark and Share
 
Reliability, Availability, And Serviceability (C.4 ... )
 
 
Fault-Tolerance (D.4.5 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Reliability, Availability, And Serviceability": Date
Implementing fault-tolerant services using the state machine approach: a tutorial
Schneider F. ACM Computing Surveys 22(4): 299-319, 2001. Type: Article
Jul 1 1992
Network reliability and algebraic structures
Shier D., Clarendon Press, New York, NY, 1991. Type: Book (9780198533863)
Sep 1 1992
On building systems that will fail
Corbató F. Communications of the ACM 34(9): 72-81, 1991. Type: Article
Sep 1 1992
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy