Grover, Shang, and Li present a bit-level matrix multiply algorithm where products are not computed completely at the word level, but, instead, partial sums and carries in the products are sent to the accumulation operation, resulting in the removal of carry propagation from the critical path. This work extends earlier fixed-point work to fixed-point two’s complement numbers.

Bit-level dependence structures are developed for the matrix multiply algorithm, and used for mapping to a bit-level architecture. The design is proven to be time optimal and conflict free, and for word length *p*, the algorithm is shown to be *O(log p)* faster than word-level matrix multiply.

The paper is well structured, with the inclusion of background material, and the use of examples and figures to illustrate the method and hardware implementation. The text presents a good level of detail for the method, mapping, and implementation of the algorithm. This paper should be of interest to hardware developers, as well as to readers interested in algorithm design and analysis.