This research paper reports on performance improvements for pipelined scientific computers provided by two scalar compilation techniques: loop unrolling and software pipelining. Architectural features to support the techniques are also discussed. The authors chose a CRAY-1S architecture with an expanded scalar register set as a baseline and used the first 14 Livermore Loops as a benchmark. TheY studied performance improvements by modifying the code generated by the CRAY FORTRAN compiler (CFT) with the vectorizer turned off. Results were obtained using a simulator described elsewhere.
From the figures, the paper concludes that loop unrolling can produce significant performance improvements. Although software pipelining achieves lower speedups, the authors point out that it demands less hardware than loop unrolling. The CRAY-1S architecture with additional scalar registers, a larger instruction buffer, and loop unrolling reaches a stage of performance comparable to that of the CRAY-1S with the vector unit and the CFT vectorizing compiler. The combination of loop unrolling and dynamic software pipelining (on a machine partitioned into address and execute processors) shows a speedup of 2.64 compared with the baseline.
The paper is meant for compiler people and architects. It is easy to read, and the experimental results provide some insight into the usefulness of the two compiler techniques. Some prior knowledge of these techniques would be helpful to the reader. A few related papers have appeared since this paper was written.