In high-performance reconfigurable computing (HPRC), a reconfigurable device is used to accelerate some parts of a computing-intensive application. HPRC is an emerging field, and its importance can be seen in the number of companies that have launched a system that has a field-programmable gate array (FPGA) connected to its computational nodes, such as SGI’s reconfigurable application-specific computing (RASC), Nallatech’s front-side bus (FSB) modules, and XtremeData’s XD1000 system--they all have a similar architecture, and the system’s microprocessor uses a high-speed channel to connect to the FPGA. The authors use an Opteron processor that connects to Altera’s Stratix 2 FPGA via hypertransport links.
The paper presents an optical lithography simulation algorithm that accelerates using reconfigurable hardware. “Optical lithography is the technology used for printing circuit patterns onto wafers. As the technology scales down and the feature size is even smaller than the wavelength of the light employed, significant light interference and diffraction may occur during the imaging process.” Therefore, it is necessary to simulate the imaging process prior to manufacturing, in order to ensure its correctness.
The method used to resolve the problem is based on decomposing the “system into many coherent systems with decreasing importance.” As the authors explain, “the image corresponding to each coherent system can be obtained via numerical image convolution, and the final image is the weighted sum of the image of each coherent system.”
In the frequency domain, the convolution is done by applying fast Fourier transforms to the data. Since the layout of the very large-scale integration (VLSI) circuits is only composed of rectangles, the convolution values are precomputed and stored. Although this method is accurate enough to solve the problem, it is computationally demanding. The authors present a new hardware architecture to solve the problem, and then compare it with other existing architectures. Using C, they explore the problem and propose an optimized architecture. Next, a synthesis tool--AutoPilot--generates the final hardware implementation. The algorithm kernel is a loop that can be rearranged to exploit its intrinsic parallelism. The authors analyze the results from this rearranged loop to decide a hardware/software partition and a communication pattern for the system.
The paper mainly discusses how to parallelize the hardware implementation and partition the memory, based on the data extracted from the high-level C implementation of the system. In Section 4.2, the authors describe how they rewrote the C code to implement specific architectural decisions. The section concludes that there is still a gap between the software C code and the C code suitable for hardware generation.
The paper ends with results from different experiments, and a critique of the Compute Unified Device Architecture (CUDA) version of the algorithm, running on a graphics processing unit (GPU). Unfortunately, the authors fail to explain the scalability advantages of FPGAs over GPUs. The authors conclude that, while using a C tool is both useful and reduces the design time, it is difficult to extract the algorithm’s parallelism and manage the system’s memory mapping. In summary, readers may find ideas in this paper for future research on HPRC machines.