MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman + Milad Hashemi Chris Wilkerson.

Slides:

Advertisements

Similar presentations

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

Architecture Basics ECE 454 Computer Systems Programming

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Pipelining and Parallelism Mark Staveley

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Simultaneous Multithreading

Computer Structure Multi-Threading

Lecture: Out-of-order Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

The University of Texas at Austin

Lecture 11: Out-of-order Processors

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Levels of Parallelism within a Single Processor

Hardware Multithreading

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

MICRO 2012 MorphCore An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater.

Presentation transcript:

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson ‡ Yale N. Patt * * HPS Research Group The University of Texas at Austin + Calxeda Inc. ‡ Intel Labs

The Need for an Adaptive Core Sometimes a single thread with high ILP – Need a heavy-weight out-of-order core – Provides high performance by exploiting ILP Sometimes many threads – Out-of-order is unnecessary – Need a power-efficient core – Provides high performance by exploiting thread-level parallelism We need an adaptive core that can do both – Exploits instruction-level parallelism when needed – Exploits thread-level parallelism when needed 1

Problem Large cores – Good: High single-thread performance – Bad: Inefficient when TLP is available Small cores – Good: High multithreaded performance – Bad: Poor single thread performance 2 Current core architectures do not adapt Large cores limit performance when TLP is high Small cores limit performance when TLP is low

Outline Problem Statement Previous Work – Asymmetric chip multiprocessors – Reconfigurable core architectures MorphCore Evaluation 3

Asymmetric Chip Multiprocessors One or few large out-of-order cores with many small in-order cores [Morad+ CAL’06, Suleman+ TR’07, Hill+ Computer’07, Suleman+ ASPLOS’09] – Limited flexibility Fixed number of large and small cores – Migration overhead Migrate the thread state/data to large core 4

Reconfigurable Core Architectures Fundamental Idea – Build a chip with “simpler cores” and “combine” them at runtime using additional logic to form a high-performance out-of-order core – Core Fusion - Ipek+ ISCA’07, TFlex - Kim+ MICRO’07, Federation Cores - Tarjan+ DAC’08, and many others Fused core has low performance and low energy-efficiency – Increased latencies among its pipeline stages Significant mode switching overhead 5

Outline Problem Statement Previous Work MorphCore – Key Insights and Basic Idea – Design and Operation Evaluation 6

Key Insight 1: The Potential of In-Order SMT With 8 threads, the in-order core’s performance almost matches the out-of-order core’s 7 Black- Scholes Program

Key Insight 2 Minimal changes to a traditional OOO core can transform it into a highly-threaded in-order SMT core 8 Existing structures in an OOO core can be re-used to support highly-threaded in-order SMT execution

MorphCore: Basic Idea 9 A) The base design: OOO core out-of-order core Exploits ILP High single-thread performance InOrder highly-threaded in-order SMT core Exploits TLP High multi-thread performance No OOO execution  Energy savings OutOfOrder Two modes: The opposite of previous proposals: B) Then we add in-order SMT

Outline Problem Statement Previous Work MorphCore – Key Insights and Basic Idea – Design and Operation Evaluation 10

Baseline OOO Pipeline 11 FETCH + DECODE RENAME + Insert in RS SELECT + WAKEUP REG READ EXECOMMIT Branch Pred + I-cache 2-way SMT Alloc ROB STQ RS Free List RS OOO Select + Wakeup Physical Reg File (PRF) Store Buffer D-cache ALUs LDQ/ STQ Lookup ROB Commit Speculative RATs Permanent RATs LDQ

MorphCore Pipeline 12

Permanent RATs MorphCore Pipeline 13 FETCH + DECODE RENAME + Insert in RS SELECT + WAKEUP REG READ EXECOMMIT Branch Pred + I-cache 2-way SMT RS Free List RS OOO Select + Wakeup Physical Reg File (PRF) LDQ/ STQ Lookup ROB Commit Speculative RATs 8-way SMT Alloc ROB STQ LDQ LDQ Alloc RS FIFO Insert In-Order Select + Wakeup STQ Lookup Store Buffer D-cache ALUs LDQ Lookup Delayed write back into PRF Shared OOO Only In-order Only Concatenate TID with Arch RegID

Microarchitecture Summary Use existing structures without modification – Physical Register File (PRF), Decode, Execution pipeline Use existing structures with minor modification – OOO Reservation Stations  InOrder instruction queues – Because of InOrder execution, delayed writeback into PRF (extra bypass) SMT related changes – Front-end (e.g. multiple PCs, branch history regs), changes in resource allocation algorithms In-Order instruction scheduler 14

Overheads Core area increases by 1.5% – Increase in SMT contexts (0.5%) (Note that added contexts are in-order, so no additional rename tables and physical registers) – InOrder Wakeup and Select Logic (0.5%) – Extra bypass (0.5%) Core frequency decreases by 2.5% – Add multiplexers in the critical path of 2 stages Rename and Scheduling 15

Mode Switching Policy Number of active threads ≤ 2 ? OutofOrder when active threads ≤ 2 – MorphCore can support up to 2 OOO threads – TLP is limited so execute OOO to obtain performance InOrder when active threads > 2 – More than 2 threads can only run simultaneously in InOrder mode – TLP is high so high core throughput and energy savings can be obtained by executing threads in-order 16

How Mode Switching Happens? (1) Drains the core pipeline (2) Spills architectural registers of currently active threads to reserved ways in the private 256KB L2 (3) Turns off/on Renaming, OOO Scheduling, Load Queue (4) Fills the architectural registers of next-active threads into PRF (update RATs when going into OutofOrder) Currently an overhead of cycles 17

Outline Problem Statement Previous Work MorphCore Evaluation 18

Methodology Detailed cycle-level x86 simulator McPAT (modified) to calculate energy/area Performance/energy evaluation of MorphCore vs. alternative architectures – Large OOO cores: optimized for single-thread – Medium and Small cores: optimized for multi-thread Workloads – Single-threaded (ST): 14 – SPEC 2006 – Multi-threaded (MT): 14 – Databases, SPLASH, others 19

Evaluated Architectures Core# of cores Freq. (GHz) TypeIssue width SMT threads Per core Total threads Peak throughput ops/cycle ST MT OOO-213.4OOO OOO-41-5%OOO MED3sameOOO SMALL3sameInO MorphCore1-2.5%OOO/ InO 42 OOO/ 8 InO All comparisons on approximately equal area ST : single-threadMT: multi-thread OOO : out-of-order InO : in-order

Evaluated Architectures Core# of cores Freq. (GHz) TypeIssue width SMT threads Per core Total threads Peak throughput ops/cycle ST MT OOO-213.4OOO OOO-41-5%OOO MED3sameOOO SMALL3sameInO MorphCore1-2.5%OOO/ InO 42 OOO/ 8 InO All comparisons on approximately equal area ST : single-threadMT: multi-thread OOO : out-of-order InO : in-order

Evaluated Architectures Core# of cores Freq. (GHz) TypeIssue width SMT threads Per core Total threads Peak throughput ops/cycle ST MT OOO-213.4OOO OOO-41-5%OOO MED33.4OOO SMALL33.4InO MorphCore1-2.5%OOO/ InO 42 OOO/ 8 InO All comparisons on approximately equal area ST : single-threadMT: multi-thread OOO : out-of-order InO : in-order

Evaluated Architectures Core# of cores Freq. (GHz) TypeIssue width SMT threads Per core Total threads Peak throughput ops/cycle ST MT OOO-213.4OOO OOO-41-5%OOO MED33.4OOO SMALL33.4InO MorphCore1-2.5%OOO/ InO 42 OOO/ 8 InO All comparisons on approximately equal area ST : single-threadMT: multi-thread OOO : out-of-order InO : in-order

Evaluated Architectures Core# of cores Freq. (GHz) TypeIssue width SMT threads Per core Total threads Peak throughput (ops/cycle) ST MT OOO-213.4OOO OOO-41-5%OOO MED33.4OOO SMALL33.4InO MorphCore1-2.5%OOO/ InO 42 OOO/ 8 InO All comparisons on approximately equal area ST : single-threadMT: multi-thread OOO : out-of-order InO : in-order

Performance: Single-thread 25 MorphCore: -1.2% MED: -25% SMALL: -59%

Performance: Multi-thread 26 MorphCore: +22% MED: +30% SMALL: +33%

Performance: Both ST and MT 27 MorphCore over OOO-2: +10% over OOO-4: +4% over MED: +11% over SMALL: +49%

Energy 28 For MT workloads, MorphCore is the second-best in energy-efficiency Consumes 9% less energy than OOO-2

Energy-delay-squared (ED 2 ) On average, across all workloads, MorphCore provides the lowest ED 2 22% lower than OOO-2 and 44% lower than SMALL

Summary MorphCore adapts well to both single-thread and multi-thread workloads Requires minimal changes to a traditional OOO core Operates in two modes: – OOO core when TLP is low – Highly-threaded in-order SMT core when TLP is high Significantly outperforms other alternative architectures 30

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson ‡ Yale N. Patt * * HPS Research Group The University of Texas at Austin + Calxeda Inc. ‡ Intel Labs