CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex Ramírez 1,2 Mateo Valero 1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center
CMP-MSI Feb. 11 th Overview Introduction Simulation Methodology Results Conclusions
CMP-MSI Feb. 11 th Introduction As Process Technology advances it is more important what to do with transistors. Current trend to replicate cores. Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad AMD: Opteron Dual-Core, Opteron Quad-Core IBM: POWER4, POWER5 Sun Microsystems: Niagara T1, Niagara T2
CMP-MSI Feb. 11 th Introduction Power4 (CMP) Power5 (CMP+SMT) Memory Subsystem (green) spreads over more than half the chip area.
CMP-MSI Feb. 11 th Introduction Each L1 is connected to each L2 bank with a bus- based interconnection network.
CMP-MSI Feb. 11 th Goal Is directly applicable prior research in the SMT field in the new CMP+SMT scenario? NO…we have to revisit well-known SMT ideas. Instruction Fetch Policy
CMP-MSI Feb. 11 th ICOUNT Fetch ROB
CMP-MSI Feb. 11 th ICOUNT Fetch ROB L2 miss FETCH Stalled Processor’s resources balanced between running threads. All resources devoted to blue thread unused until L2 miss resolution.
CMP-MSI Feb. 11 th FLUSH Fetch ROB L2 miss All resources devoted to the pending instructions of the blue thread are freed. FLUSH Triggered
CMP-MSI Feb. 11 th FLUSH Fetch ROB L2 miss Freed resources allow additional forward progress. L2 miss late detection L2 miss prediction. Thread Stalled
CMP-MSI Feb. 11 th Single vs Multi Core I$D$ Core L2 b0 I$D$ Core I$D$ Core L2 b1L2 b2L2 b3 I$D$ Core I$D$ Core L2 b0L2 b1L2 b2L2 b3 More pressure on both: Interconnection Network Shared L2 banks
CMP-MSI Feb. 11 th Single vs Multi Core I$D$ Core L2 b0 I$D$ Core I$D$ Core L2 b1L2 b2L2 b3 I$D$ Core I$D$ Core L2 b0L2 b1L2 b2L2 b3 More Unpredictable L2 Access Latency - BAD for FLUSH
CMP-MSI Feb. 11 th Overview Introduction Simulation Methodology Results Conclusions
CMP-MSI Feb. 11 th Simulation Methodology Trace driven SMT simulator derived from SMTsim. C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core) I$D$ Core L2 b0 I$D$ Core I$D$ Core L2 b1L2 b2L2 b3 I$D$ Core Core Details (* per thread)
CMP-MSI Feb. 11 th Simulation Methodology Instruction Fetch Policies: ICOUNT FLUSH Workload classified per type: ILP All threads have good memory behavior. MEM All threads have bad memory behavior. MIX Mixes both types of threads.
CMP-MSI Feb. 11 th Overview Introduction Simulation Methodology Results Conclusions
CMP-MSI Feb. 11 th Results : Single-Core (2 threads) FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. Mainly on MEM/MIX workloads
CMP-MSI Feb. 11 th Results : Multi-Core (2 threads/core) FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore. + Cores - Speedup
CMP-MSI Feb. 11 th Results : L2 Hits Latency on Multi-Core +Cores +latency +dispersion L2 hit latency (cycles)
CMP-MSI Feb. 11 th Results : L2 miss prediction In this four-cored example, the best choice is predicting L2 miss after 90 cycles.
CMP-MSI Feb. 11 th Results : L2 miss prediction But, in this other four-cored example the best choice is not to predict L2 miss.
CMP-MSI Feb. 11 th Overview Introduction Simulation Methodology Results Conclusions
CMP-MSI Feb. 11 th Conclusions Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation. The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance. For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario. FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.
CMP-MSI Feb. 11 th 2007 Thank you Questions?