There are some computer fields, like computer forensics and security analysis, that require the replaying of interesting operational sequences. On uniprocessor systems, this is not a problem. Critical events are logged and can then be replayed as needed. But what do you do on contemporary multiprocessor systems running distributed parallel applications? The standard logging approach requires single-threaded log entries, which turns multiprocessing back into uniprocessing, defeating the performance advantages of multiprocessing. The few record-and-replay solutions that have been created for parallel processing systems thus have prohibitive performance penalties as a result of such sequentialization.
The DoublePlay solution presented in this paper minimizes sequentialization and is now the most efficient deterministic replay solution for multiprocessors. Overhead ranges from 20 to 100 percent, as opposed to other solutions that run as high as 1100 percent. This innovation can make record-and-replay practical for many parallel applications.
Overall execution is subdivided into time slices called epochs, which are bounded by critical events. The 2D matrix created by epochs and processors is then swapped, exchanging columns for rows. This translation re-orders the epoch/processor segments so that they can be sequenced in parallel, but deterministically, preserving the order of critical events. The application is rerun following this reordered execution sequence, putting the “double” in DoublePlay. The replay log is created by this reordered run, and a deterministic sequence is then available for future replays.
I found the subject matter to be complex, and the paper is relatively long at 24 pages. However, the writing, style, and presentation of the paper was excellent and made for easy reading. The paper itself did not get in the way, which happens all too frequently, even with simple subjects.
As an apparent breakthrough in its field, this paper should interest anyone researching or developing deterministic replay systems. It may have broader application, though, in operating system design, fault tolerance, or even computer forensics.