Sohn, Woo, and Yoo propose a programmable, low-power vertex shader for mobile terminals, characterized by programmability, single instruction multiple data (SIMD) (vector) processing, and a multi-threaded programming paradigm and microarchitecture. They report geometry processing rates of 7.2 megavertices per second at 115 milliwatts (mW) of power.
The three-dimensional (3D) graphics pipeline has been addressed in current- and next-generation video game consoles quite successfully. It is only recently that users have required very high geometry processing capability on mobile terminals. Currently, such terminals use the main reduced instruction set computer (RISC) central processing unit (CPU) integer datapath, or floating point datapath/coprocessor/instruction extensions for processing, but it is clear that such a solution can’t provide sufficient performance for vertex shading, due to operating frequency reasons that directly relate to the power budget of the system on a chip (SoC).
Key characteristics of the architecture reported in this work are the fixed-point, 128-bit SIMD datapath, a multithreaded communication channel to the ARM10 CPU, and a programmable vertex engine. The authors elaborate on the fixed-point format as a winning solution for portable graphics, by virtue of its microarchitectural simplicity, and, thus, ability to operate at a higher frequency than a full Institute of Electrical and Electronics Engineers (IEEE) 754 floating point (FP) implementation. The second aspect elaborated on is the multi-threaded coprocessor interface to the ARM10 processor. Lockstep operation with the controlling CPU ensures support for precise exceptions, and alleviates the need to use dedicated memory infrastructure.
An interesting observation made by the authors is that the ARM CPU does not execute an integer instruction in parallel to the coprocessor instruction. As a result, they chose to have the vertex engine execute its code independently of the main CPU. This limitation is elaborated in Figure 6a, where the advantage of their method is depicted. The authors proceed to describe the instruction set architecture of the multithreaded coprocessor. Finally, they discuss the application-specific integrated circuit (ASIC) implementation, and provide performance data for the processor-coprocessor combination.
Overall, the authors demonstrate excellent technical ability and innovation in a number of ways, including using fixed-point arithmetic, developing a multi-threaded coprocessor, and, finally, getting around the apparently quite severe limitation of the ARM10 coprocessor interface, which mandates that only the ARM or the coprocessor can execute an instruction per cycle.