The performance of different high-performance computing (HPC) accelerator/coprocessor devices is evaluated and compared in this well-written paper. It analyzes the behavior of Xeon Phi, NVIDIA K20c, and AMD FirePro S9000 using an open computing language (OpenCL) framework. In order to have a fine-grain evaluation, the authors propose and develop FeatureBench, a benchmark test suite. The comparison considers only a single accelerator configuration, however.
Even though I agree with the authors on the fact that OpenCL is a portable framework and probably the best fit for this evaluation, it does not offer the productivity feature for hybrid message passing interface (MPI+X) models in HPC systems such as CUDA-aware MPI and OpenACC-aware MPI. I wish this aspect had been addressed in the discussion and comparison. Also, it is not clear how the comparison between hardware-accelerated and non-hardware-accelerated transcendental operations is performed. Is it through different benchmarks/application programming interface (API) calls or compiler options? Finally, what about the memory bandwidth and behavior? An analysis and comparison of the memory bandwidth and cache effects would have been a welcome contribution.