18-742 Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Lecture 6: Multicore Systems
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 17 Parallel Processing.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture: Multithreading (II)
Computer Architecture: Parallel Task Assignment
Computer Architecture: Branch Prediction (II) and Predicated Execution
18-447: Computer Architecture Lecture 30B: Multiprocessors
15-740/ Computer Architecture Lecture 3: Performance
Prof. Hsien-Hsin Sean Lee
Computer Architecture: Parallel Processing Basics
CS5102 High Performance Computer Systems Thread-Level Parallelism
15-740/ Computer Architecture Lecture 21: Superscalar Processing
Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture: SMT, Cache Hierarchies
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Hardware Multithreading
The University of Adelaide, School of Computer Science
Prof. Onur Mutlu Carnegie Mellon University 9/26/2012
Prof. Onur Mutlu Carnegie Mellon University 9/28/2012
Presentation transcript:

Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University

Announcements No class Monday (Feb 14) Interconnection Networks lectures on Wed-Fri (Feb 16, 18) 2

Reviews Due Today (Feb 11) midnight  Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA Due Tuesday (Feb 15) midnight  Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA  Dally, “Route packets, not wires: on-chip inteconnection network,” DAC  Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA

Last Lecture Speculative Lock Elision (SLE) SLE vs. Accelerated critical sections (ACS) Data Marshaling 4

Today Dynamic Core Combining (Core Fusion) Maybe start multithreading 5

How to Build a Dynamic ACMP Frequency boosting DVFS Core combining: Core Fusion Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA Idea: Dynamically fuse multiple small cores to form a single large core 6

Core Fusion: Motivation Programs are incrementally parallelized in stages Each parallelization stage is best executed on a different “type” of multi-core 7

Core Fusion Idea Combine multiple simple cores dynamically to form a larger, more powerful core 8

Core Fusion Microarchitecture Concept: Add enveloping hardware to make cores combine- able 9

How to Make Multiple Cores Operate Collectively as a Single Core Reconfigurable I-cache 10

Collective Fetch Each core fetches two instructions from own i-cache  Misaligned targets re-align in one cycle Fetch Management Unit (FMU) controls redirection  Cores process branches locally, communicate prediction to FMU  FMU communicates outcome and GHR updates Two-cycle interconnect  Two-cycle bubble per taken branch (+1 if misaligned core) Core “zero” provides RAS  Return encountered on another core gets its prediction from Core 0’s RAS FMU updates i-TLBs on a miss 11

Branching in Fused Mode 12 BTB BPred BTB BPred BTB BPred BTB BPred GHR B RAS

Branching in Fused Mode 13 BTB BPred BTB BPred BTB BPred BTB BPred GHR B X X X X X X X X X RAS

Centralized Renaming and Steering Centralized structure: Cores send predecoded info to Steering Management Unit (SMU) SMU steers and dispatches regular and copy instructions  Max. two regular + two copy instructions per core, cycle Eight extra pipeline stages (only fused mode) 14

Operand Communication via Copy Instructions 15 Copy-outCopy-inIssue Copy-inCopy-out Out In

Collective Commit (No Blocking Case) 16 i0 i1 i2 i3i5 i4 i7 i6 i1 i0 i3 i2 i5 i4 i7 i6 i1 i0 i3 i2 i5 i4 i7 i6 i1 i0 i3 i2 i5 i4 i7 i6 Pre-commit ROB Head Conventional ROB Head

Collective Commit (Blocked Case) 17 i0 i1 i2 i3i5 i4 i7 i6 i3 i2 i1 i0 i5 i4 i7 i6 i3 i2 i1 i0 i5 i4 i7 i6 i3 i2 i1 i0 i5 i4 i7 i6 Pre-commit ROB Head Conventional ROB Head

Collective Load/Store Queue LD/ST instructions bank-assigned to cores based on effective addresses  Distributed disambiguation PC-based steering prediction on which bank the ld/st should access  Re-steer on misprediction Core-fusion-aware indexing  Full utilization in fused and split modes  Cache coherence avoids flushing or shuffling 18 TagIndex Bank ID

Dynamic Reconfiguration Run-time control of granularity  Serial vs. parallel sections  Variable granularity in parallel sections Mechanism: Fusion, fission instructions in the ISA  Typically encapsulated in macros or directives (e.g., OpenMP sections)  Can be safely ignored (single execution model) Reconfiguration actions  Flush pipelines and i-caches  Reconfigure i-cache tags  Transfer architectural state as needed 19

Core Fusion Evaluation Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA

Core Fusion Evaluation 21

Single Thread Performance 22

Parallel Application Performance 23

Core Fusion vs. Tile-Small Symmetric CMP Core Fusion Advantages + Better single-thread performance when needed Disadvantages - Possibly lower parallel throughput (area spent for glue logic) - Reconfiguration overhead to dynamically create a large core (can reduce performance) - More complex design: glue logic between cores, reconfigurable I-cache 24

Core Fusion vs. Asymmetric CMP Core Fusion Advantages + Cores not fixed at design time: more adaptive + Possibly better parallel throughput (assuming ACMP does not use SMT) + Potentially higher frequency design: all cores are the same Disadvantages - Reconfiguration overhead to dynamically create a large core (can reduce performance) - Single-thread performance on the fused core less than that on a large core statically optimized for single thread - Additional stages, fine-grained operand communication between cores, collective operation constraints - Potentially complex design: glue logic between cores, reconfigurable I-cache 25

Core Fusion vs. Clustered Superscalar/OoO Core fusion: build small cores, add glue logic to combine them dynamically Clustered superscalar/OoO: build a large superscalar core in a scalable fashion (clustered scheduling windows, register files, and execution units)  Can use SMT to execute multiple threads in different clusters Both require:  Steering instructions to different clusters (cores)  Operand communication between clusters  Memory disambiguation 26

Core Fusion vs. Clustered Superscalar/OoO Some core fusion advantages + No resource contention between threads in non-fused mode + No need to build a wide fetch engine Some disadvantages - Single-thread performance can be less due to additional communication latencies in fetch and commit - I-cache not shared between threads in non-fused mode 27

Review: Performance Asymmetry What to do with it?  Improve serial performance (accelerate sequential bottleneck)  Reduce energy consumption – adapt to phase behavior  Optimize energy delay – adapt to phase behavior  Improve parallel performance (accelerate critical sections) How to build it?  Static Multiple different core microarchitectures or frequencies  Dynamic Combine cores Adapt frequency 28

Research in Asymmetric Multi-Core How to Design Asymmetric Cores  Static  Dynamic Can you fuse in-order cores easily to build an OoO core? How to divide the program to best take advantage of asymmetry?  Explicit vs. transparent How to match arbitrary program phases to the best-fitting core? Staged execution models. How to minimize code/data migration overhead? How to satisfy shared resource requirements of different cores? 29

Multithreading

Readings: Multithreading Required  Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session,  Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro  Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA  Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA Recommended  Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992  Smith, “A pipelined, shared resource MIMD computer,” ICPP  Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO  Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA

Multithreading (Outline) Multiple hardware contexts Purpose Initial incarnations  CDC 6600  HEP  Tera Levels of multithreading  Fine-grained (cycle-by-cycle)  Coarse grained (multitasking) Switch-on-event  Simultaneous Uses: traditional + creative (now that we have multiple contexts, why do we not do …) 32