Computing Reviews, the leading online review service for computing literature.

Search

The reliability of computer memories
McEliece R. Scientific American252 (1):88-95,1985.Type:Article

Date Reviewed: Feb 1 1986

The author presents the background of why error-correcting codes are successful in making computer memories reliable. After discussing the causes of soft errors in computer memories, McEliece presents the method of using Hamming codes to correct single-bit and to detect double-bit errors. He concludes with a refreshing paradox to justify the computations for reliability. The introduction to computer memory (primary mass storage) includes a brief discussion of the architecture of a memory chip. The author presents the alpha particle as the candidate for most of the soft errors because of the common occurrence of the helium nucleus being ejected from heavy atomic nuclei during radioactive decay. Consequently, mathematicians determined that it is better to correct errors than to try to prevent them. McEliece uses a 32-bit word and one megabyte of memory (8,388,608 cells) to demonstrate reliability of the memory function and to determine the Mean-Time Between Failure (MTBF). The MTBF for one memory cell in a 64K chip has been suggested as exceeding one million years. Therefore, the MTBF for a one-megabyte memory is 43 days, or one million years divided by 8,388,608. Richard Hamming, a mathematician at the Bell Telephone Laboratories, was motivated by the needs of sophisticated telecommunications systems to identify a coding theory that handled errors. He discovered the encoding and decoding techniques of error correction and detection in 1948. Examples of the (7,4) Hamming code and the (8,4) Hamming code are provided in the paper using Venn diagrams to illustrate the techniques. The techniques benefit from parity principles being applied to subsets of bits within one or more bytes of memory. McEliece then continues the discussion by validating the benefits of correcting single-bit soft errors with error-correcting codes. By adding seven memory cells for error correcting to every 32 bits, the increased number of cells means that errors occur more frequently. Having demonstrated that the MTBF of one megabyte of memory is 43 days, the additional parity bits decrease the MTBF to only 35.7 days (before error correcting takes place). How long can the computer memory be reliable before a double-bit soft error occurs? Mathematicians, in a totally different context, use the birthday-surprise paradox to determine the real MTBF, or about 63 years. This paper concludes with a simple application of the birthday theory for arriving at the 63-year MTBF. Besides the Venn diagrams, the other illustrations of a memory cell in silicon, a memory chip as an array of memory cells, the sequence of how a soft error occurs, and the probability curves all complement the topic of this paper. McEliece, who authored [1], presents his material in a very understandable fashion without getting lost in the mechanics of mathematics. As is typical of Scientific American, the paper presents enough of the subject area to explain the topic, allowing the reader to practice the examples on a calculator. For any readers needing a simple explanation of how error correcting codes work, this paper will satisfy them.

Reviewer: R. A. Smith	Review #: CR110287

1)	McEliece, R. J.The theory of information and coding: a mathematical framework for communication, Addison-Wesley, Reading, MA, 1978. See <CR> 19, 4 (April 1978), Rev. 32,931.

Reliability, Testing, And Fault-Tolerance (B.3.4 )

Miscellaneous (B.3.m )

Coding And Information Theory (E.4 )

Would you recommend this review?

yes

Other reviews under "Reliability, Testing, And Fault-Tolerance":	Date

Parallel Testing for Pattern-Sensitive Faults in Semiconductor Random-Access Memories Mazumder P., Patel J. IEEE Transactions on Computers 38(3): 394-407, 1989. Type: Article	Oct 1 1989

Partitioning techniques for partially protected caches in resource-constrained embedded systems Lee K., Shrivastava A., Dutt N., Venkatasubramanian N. ACM Transactions on Design Automation of Electronic Systems 15(4): 1-30, 2010. Type: Article	May 5 2011

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy