This chapter is a survey of some aspects of reliable design and hardware fault tolerance. It stresses the important areas of self-testing and self-restoration of faulty computer systems. Self-restoration can be accomplished through system modification, by either bypassing the faulty module or introducing a spare unit.
The first section of the chapter briefly mentions some techniques for fault testing and system verification, including methods of adding hardware to enhance testability. The second section describes several procedures for graceful degradation and for partial system recovery. It also discusses some of the concepts behind error-correcting and error-detecting codes. The last section discusses various design methods to facilitate system recovery. The process of recovery begins with the detection of an error message; the error is diagnosed and, if possible, the faulty modules are identified. Finally, recovery procedures and/or the installation of spare units are effected.
The chapter gives references to over 200 papers, most of which it mentions very briefly. It is a useful survey for those working in the area who are already familiar with many of the basic techniques, but is too brief to serve as an introduction to the subject.