Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution
Luo Y., Hsu W., Zhai A. ACM Transactions on Architecture and Code Optimization10 (4):1-29,2013.Type:Article
Date Reviewed: Apr 18 2014

Without a good background in computer architecture, this paper will be hard to follow. It revolves around four concepts: parallelism, reconfiguration, heterogeneity, and power efficiency. Parallelism is the main source of performance that we get from multicore processors. One way of extracting parallelism from a program is to make the compiler aggressively extract threads out of sequential code. The compiler, when doing so, will ignore dependencies among threads. During runtime, the hardware will detect dependencies and recover if needed. These are referred to as speculative threads. Wrong speculations lead to more dynamic power dissipation. Much work (prior to this paper) has tried to produce solutions for avoiding ineffective speculations.

This paper targets speculative threads (from loops) running on heterogeneous multicore processors. Its main goal is to dynamically characterize speculative thread behavior and configure the processor to match this behavior, consequently maintaining good performance while reducing power consumption. The authors consider same instruction set architecture cores, but cores can be of different issue width and may or may not support simultaneous multithreading (SMT) capability, hence the name heterogeneous multicore. They also vary level-1 (L1) caches but keep the shared level-2 (L2) (which is the last level cache in this study) intact.

The main criterion that the hardware must keep track of is memory access (because a compiler cannot determine independence based on memory, which is called the memory disambiguation problem). The authors adopt a method (published earlier by other authors) where the cache coherence protocol is extended to include speculatively shared (SpS) and speculatively exclusive (SpE) states. If an invalidation message arrives from a logically earlier thread (each speculative thread has a unique ID determining its logical order), speculation fails, the thread is squashed, and the thread is re-executed from the beginning.

The authors limit the number of speculative threads in any parallel segment (the sequence of dynamic instructions speculatively parallelized) to four. They made this decision based on previous work [1] stating that for applications in the Standard Performance Evaluation Corporation (SPEC) central processing unit (CPU) benchmark suite--the benchmarks the authors are using in this paper--there is a diminishing return after four speculative threads. Speculative threads can be allocated to either a simultaneous multithreading (SMT) core (SMT mode) or to multiple non-SMT cores where each core executes a single thread (chip multiprocessor (CMP) mode).

The authors build their scheme in two steps. In the first step, they do a profiling study of the benchmark programs that will be used. In this profiling study, they try several configurations (varying core capabilities from superscalar to SMT to different L1 cache configurations). They determine the best configuration for each benchmark based on energy-delay-squared product (ED2P), hence the power efficiency without sacrificing performance. Based on this study, they identify a set of configurations that cover most of the benchmarks used.

The second step is to build the reconfigurable hardware and design the runtime that will make the reconfiguration decision. The hardware is built as a group of processing blocks. Each processing block consists of one four-issue SMT core, one two-issue non-SMT core, and one resizable L1 cache. Processing blocks are connected to a unified L2 cache using a bus (the authors make it clear that the bus decision is orthogonal to the design and any kind of interconnect could be used). They experiment with four processing blocks. The runtime maintains a hardware-based resource allocation table, indexed by the program counter of the first instruction of each program segment. The resource allocation scheme makes use of hardware performance counters to monitor program execution and use the most energy-efficient configuration for each segment. The first time a thread is encountered, it is assigned to a default configuration of a four-issue SMT core with a 64 kilobyte (KB) L1 cache. The runtime system monitors the number of instructions issued when they are further from the reorder buffer (ROB) head with a distance equal to the ROB size of a two-issue core. This gives an estimation as to whether they need a smaller ROB (that is, two-issue instead of four-issue). Cache block reuse is used as an indicator of whether the cache is efficiently used or needs resizing. Also, they measure the level of contention at the issue stage on the default four-issue SMT. If there is contention then CMP mode is used. The paper also presents several tweaks to reduce overhead from sources such as thread migration and reconfiguration.

The paper presents a lot of detailed results. The proposed heterogeneous thread-level speculative system outperforms the most energy-efficient homogeneous configuration in 21 out of 25 benchmarks.

Reviewer:  Mohamed Zahran Review #: CR142194 (1407-0541)
1) Steffan, J.; Colohan, C.; Zhai, A.; Mowry, T. A scalable approach to thread-level speculation. In Proceedings of the 27th Annual International Symposium on Computer architecture. ACM, 2000, 1-12.
Bookmark and Share
  Featured Reviewer  
 
Multiple-Instruction-Stream, Multiple-Data-Stream Processors (MIMD) (C.1.2 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Multiple-Instruction-Stream, Multiple-Data-Stream Processors (MIMD)": Date
Functional organization of MIMD machines
Cioffi G., Springer-Verlag New York, Inc., New York, NY, 1984. Type: Book (9780387818160)
Jul 1 1985
Performance of an interconnected microprocessor system designed for fast user-level communications
London T., Ahuja S., Katseff H.  Concurrent languages in distributed systems: hardware supported implementation (, Bristol, UK,1331985. Type: Proceedings
Aug 1 1985
Parallel supercomputing in MIMD architectures
Hord R., CRC Press, Inc., Boca Raton, FL, 1993. Type: Book (9780849344176)
Feb 1 1994
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy