Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
Iyer R., Young L., Iyer P. IEEE Transactions on Computers39 (4):525-537,1990.Type:Article
Date Reviewed: May 1 1991

The authors propose a methodology for recognizing the symptoms of persistent failures in large systems. This methodology also allows one to locate the subsystem in which the error occurred. A log of abnormal system events is used to analyze the occurrences of errors and their types (such as software error or channel check).

This methodology was applied to an IBM 3081 system running under the MVS operating system and to two large CYBER mainframes at the University of Illinois. Both the IBM 3081 and the CYBER generate logs of normal and abnormal events. Super events are deduced from the set of basic events according to their relationships with system-level faults. These super events identify the intermittent manifestation of persistent failures.

Results produced using this methodology were compared with the log of failures and repairs used by the CYBER system staff. The evaluation was made according to the CYBER staff’s experience and their maintenance logs. In nearly 85 percent of the cases, the engineers were directly able to confirm that inferred super events corresponded to real system problems. For the remaining 15 percent of the cases, the engineers could confirm the existence of a real problem. Moreover, two of the detected failures were long-term, persistent problems that had previously gone undiagnosed.

This novel methodology gives unexpectedly good results for the CYBER example. Nevertheless, a more complete evaluation will require more experiments using other target computers, such as VAXes or Crays.

Reviewer:  G. Saucier Review #: CR123915
Bookmark and Share
 
Diagnostics (B.1.3 ... )
 
 
Error-Checking (B.1.3 ... )
 
 
Applications And Expert Systems (I.2.1 )
 
 
Control Structure Reliability, Testing, And Fault-Tolerance (B.1.3 )
 
Would you recommend this review?
yes
no

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy