Colic, Kalva, and Furht discuss techniques for optimizing full search motion estimation using NVIDIA’s compute unified device architecture (CUDA) and graphics coprocessors.
The motion estimation technique used is motivated by the fact that it is the best candidate for efficient parallelization. The authors offer a brief but insightful description of graphics processing unit (GPU) architecture, as well as the CUDA programming model. They present the optimization strategies, followed by a series of experiments, from the unoptimized test run to the fully optimized motion search. Many comparison charts are provided, in order to emphasize the speedup benefits of the optimization. Guidelines are provided at the end of the paper; these are almost universally valid for all GPU parallelization designs--whenever parallelization is possible, of course.
Overall, the paper makes a valuable contribution by showing how to approach optimization tasks using CUDA.