Soft errors in on-chip cache hierarchies are studied in this paper, and solutions to the problem are proposed. Soft errors are the result of events such as cosmic ray strikes that flip bits held in on-die transistors. Due to shrinking feature sizes, soft errors present a reliability problem in modern microprocessors. Since most of the die area (the authors indicate 60 percent) is occupied by on-chip caches, mitigating soft errors in on-chip caches is an important problem to be solved.
The authors observe that soft errors only affect dirty lines in caches, since for clean lines an equivalent copy exists in the main memory. Their proposed solution maintains multiple copies of dirty lines in different caches at the same level of the hierarchy. On a soft error, therefore, a dirty line can simply be dropped without problems. This of course requires that all dirty blocks be kept synchronized by either propagating updates or invalidating them. The paper concludes that when high performance is required, propagating updates makes sense in order to avoid invalidating a needed line; when low energy is required, it makes better sense to invalidate lines to avoid interconnect traffic.
This paper makes interesting points about the vulnerability of cache lines.