Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Multi-Core Architectures and Shared Resource Management Prof. Onur Mutlu Bogazici University June 6, 2013.

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture: Multi-Core Evolution and Design Prof. Onur Mutlu Carnegie Mellon University.

Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture Lecture 32: Heterogeneous Systems Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/20/2015.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Computer Architecture Lecture 31: Asymmetric Multi-Core Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/30/2014.

COMP 740: Computer Architecture and Implementation

18-447: Computer Architecture Lecture 27: Multi-Core Potpourri

18-447: Computer Architecture Lecture 30B: Multiprocessors

Prof. Onur Mutlu Carnegie Mellon University

Architecting and Exploiting Asymmetry in Multi-Core Architectures

Computer Architecture: Parallel Processing Basics

CS5102 High Performance Computer Systems Thread-Level Parallelism

15-740/ Computer Architecture Lecture 17: Asymmetric Multi-Core

Parallel Processing - introduction

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

Prof. Gennady Pekhimenko University of Toronto Fall 2017

5.2 Eleven Advanced Optimizations of Cache Performance

Prof. Onur Mutlu Carnegie Mellon University 9/17/2012

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Samira Khan University of Virginia Sep 13, 2017

Computer Architecture: Multithreading (I)

18-740/640 Computer Architecture Lecture 17: Heterogeneous Systems

Levels of Parallelism within a Single Processor

Address-Value Delta (AVD) Prediction

Computer Architecture: Multithreading (IV)

Samira Khan University of Virginia Sep 12, 2018

Adaptive Single-Chip Multiprocessing

15-740/ Computer Architecture Lecture 10: Runahead and MLP

15-740/ Computer Architecture Lecture 14: Runahead Execution

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Prof. Onur Mutlu Carnegie Mellon University

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Computer Architecture Lecture 16: Heterogeneous Multi-Core

Lecture: Cache Hierarchies

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 20: Heterogeneous Multi-Core Systems II

Prof. Onur Mutlu Carnegie Mellon University 9/19/2012

Computer Architecture Lecture 19b: Heterogeneous Multi-Core Systems

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/29/2013

Samira Khan University of Virginia Jan 30, 2019

ASPLOS 2009 Presented by Lara Lazier

Presentation transcript:

Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted from CMU ECE 740

AGENDA Logistics Review from last lecture – Latency tolerant of OoO engine – Multi-core alternatives Asymmetric Multi-Core 2

LOGISTICS Exam I: Mar 17 Solve sample questions Exam will be similar in spirit Many more sample questions and solutions available online Milestone I Presentations: Mar 22, 24 Will provide sample milestone presentation slides Follow the format of the project proposal Problem, novelty, key ideas, mechanisms, results so far, plan for the rest of the semester Each group will have 10 minutes slot + 3 minutes of Q&A 3

EFFICIENT SCALING OF INSTRUCTION WINDOW SIZE One of the major research issues in out of order execution How to achieve the benefits of a large window with a small one (or in a simpler way)? – Runahead execution? Upon L2 miss, checkpoint architectural state, speculatively execute only for prefetching, re-execute when data ready – Continual flow pipelines? Upon L2 miss, deallocate everything belonging to an L2 miss dependent, reallocate/re-rename and re-execute upon data ready – Dual-core execution? One core runs ahead and does not stall on L2 misses, feeds another core that commits instructions 4

REVIEW: RUNAHEAD EXECUTION When the oldest instruction is a long-latency cache miss: – Checkpoint architectural state and enter runahead mode In runahead mode: – Speculatively pre-execute instructions Runahead mode ends when the original miss returns – Checkpoint is restored and normal execution resumes Advantages, disadvantages? 5

REVIEW: MULTI-CORE Idea: Put multiple processors on the same die. Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? – Have a bigger, more powerful core – Have larger caches in the memory hierarchy – Simultaneous multithreading – Integrate platform components on chip (e.g., network interface, memory controllers) 6

WHY MULTI-CORE? Other alternatives? – Dataflow? – Vector processors (SIMD)? – Integrating DRAM on chip? – Reconfigurable logic? (general purpose?) 7

WITH MULTIPLE CORES ON CHIP What we want: – N times the performance with N times the cores when we parallelize an application on N cores What we get: – Amdahl’s Law (serial bottleneck) – Bottlenecks in the parallel portion 8

CAVEATS OF PARALLELISM Amdahl’s Law – f: Parallelizable fraction of a program – N: Number of processors – Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel – Synchronization overhead (e.g., updates to shared data) – Load imbalance overhead (imperfect parallelization) – Resource sharing overhead (contention among N processors) Speedup = f f N 9

THE PROBLEM: SERIALIZED CODE SECTIONS Many parallel programs cannot be parallelized completely Causes of serialized code sections – Sequential portions (Amdahl’s “serial part”) – Critical sections – Barriers – Limiter stages in pipelined programs Serialized code sections – Reduce performance – Limit scalability – Waste energy 10

EXAMPLE FROM MYSQL Open database tables Perform the operations …. Critical Section Parallel Access Open Tables Cache Chip Area (cores) Speedup Today ??? 11

Demands in Different Code Sections What we want: In a serialized code section  one powerful “large” core In a parallel code section  many wimpy “small” cores These two conflict with each other: – If you have a single powerful core, you cannot have many cores – A small core is much more energy and area efficient than a large core 12

“LARGE” VS. “SMALL” CORES Out-of-order Wide fetch e.g. 4-wide Deeper pipeline Aggressive branch predictor (e.g. hybrid) Multiple functional units Trace cache Memory dependence speculation In-order Narrow Fetch e.g. 2-wide Shallow pipeline Simple branch predictor (e.g. Gshare) Few functional units Large Core Small Core Large Cores are power inefficient: e.g., 2x performance for 4x area (power) 13

LARGE VS. SMALL CORES Grochowski et al., “Best of both Latency and Throughput,” ICCD

MEET LARGE: IBM POWER4 Tendler et al., “POWER4 system microarchitecture,” IBM J R&D, A symmetric multi-core chip… Two powerful cores 15

IBM POWER4 2 cores, out-of-order execution 100-entry instruction window in each core 8-wide instruction fetch, issue, execute Large, local+global hybrid branch predictor 1.5MB, 8-way L2 cache Aggressive stream based prefetching 16

IBM POWER5 Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro

MEET SMALL: SUN NIAGARA (ULTRASPARC T1) Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro

NIAGARA CORE 4-way fine-grain multithreaded, 6-stage, dual- issue in-order Round robin thread selection (unless cache miss) Shared FP unit among cores 19

REMEMBER THE DEMANDS What we want: In a serialized code section  one powerful “large” core In a parallel code section  many wimpy “small” cores These two conflict with each other: – If you have a single powerful core, you cannot have many cores – A small core is much more energy and area efficient than a large core Can we get the best of both worlds? 20

PERFORMANCE VS. PARALLELISM Assumptions: 1. Small cores takes an area budget of 1 and has performance of 1 2. Large core takes an area budget of 4 and has performance of 2 21

TILE-LARGE APPROACH Tile a few large cores IBM Power 5, AMD Barcelona, Intel Core2Quad, Intel Nehalem + High performance on single thread, serial code sections (2 units) - Low throughput on parallel program portions (8 units) Large core Large core Large core “Tile-Large” 22

TILE-SMALL APPROACH Tile many small cores Sun Niagara, Intel Larrabee, Tilera TILE (tile ultra-small) + High throughput on the parallel part (16 units) - Low performance on the serial part, single thread (1 unit) Small core “Tile-Small” 23

CAN WE GET THE BEST OF BOTH WORLDS? Tile Large + High performance on single thread, serial code sections (2 units) - Low throughput on parallel program portions (8 units) Tile Small + High throughput on the parallel part (16 units) - Low performance on the serial part, single thread (1 unit), reduced single-thread performance compared to existing single thread processors Idea: Have both large and small on the same chip  Performance asymmetry 24

ASYMMETRIC MULTI-CORE

Asymmetric Chip Multiprocessor (ACMP) Provide one large core and many small cores + Accelerate serial part using the large core (2 units) + Execute parallel part on small cores and large core for high throughput (12+2 units) Small core Large core ACMP Small core “Tile-Small” Large core Large core Large core “Tile-Large” 26

ACCELERATING SERIAL BOTTLENECKS Small core Large core ACMP Approach Single thread  Large core 27

PERFORMANCE VS. PARALLELISM Assumptions: 1. Small cores takes an area budget of 1 and has performance of 1 2. Large core takes an area budget of 4 and has performance of 2 28

ACMP PERFORMANCE VS. PARALLELISM 29 Large core Large core Large core “Tile-Large” Large Cores 401 Small Cores Serial Performance 212 Parallel Throughput 2 x 4 = 81 x 16 = 161x2 + 1x12 = 14 Small core “Tile-Small” Small core Large core ACMP Area-budget = 16 small cores 29

AMDAHL’S LAW MODIFIED Simplified Amdahl’s Law for an Asymmetric Multiprocessor Assumptions: – Serial portion executed on the large core – Parallel portion executed on both small cores and large cores – f: Parallelizable fraction of a program – L: Number of large processors – S: Number of small processors – X: Speedup of a large processor over a small one Speedup = 1 + f S + X*L 1 - f X 30

CAVEATS OF PARALLELISM, REVISITED Amdahl’s Law – f: Parallelizable fraction of a program – N: Number of processors – Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel – Synchronization overhead (e.g., updates to shared data) – Load imbalance overhead (imperfect parallelization) – Resource sharing overhead (contention among N processors) Speedup = f f N 31

ACCELERATING PARALLEL BOTTLENECKS Serialized or imbalanced execution in the parallel portion can also benefit from a large core Examples: – Critical sections that are contended – Parallel stages that take longer than others to execute Idea: Dynamically identify these code portions that cause serialization and execute them on a large core 32

CONTENTION FOR CRITICAL SECTIONS 0 Critical Section Parallel Idle 12 iterations, 33% instructions inside the critical section P = 1 P = 3 P = 2 P = % in critical section 33

CONTENTION FOR CRITICAL SECTIONS 0 Critical Section Parallel Idle 12 iterations, 33% instructions inside the critical section P = 1 P = 3 P = 2 P = Accelerating critical sections increases performance and scalability Critical Section Accelerated by 2x 34

IMPACT OF CRITICAL SECTIONS ON SCALABILITY Contention for critical sections leads to serial execution (serialization) of threads in the parallel program portion Contention for critical sections increases with the number of threads and limits scalability MySQL (oltp-1) Chip Area (cores) Speedup Today Asymmetric 35

A CASE FOR ASYMMETRY Execution time of sequential kernels, critical sections, and limiter stages must be short It is difficult for the programmer to shorten these serialized sections – Insufficient domain-specific knowledge – Variation in hardware platforms – Limited resources Goal: A mechanism to shorten serial bottlenecks without requiring programmer effort Idea: Accelerate serialized code sections by shipping them to powerful cores in an asymmetric multi-core (ACMP) 36

AN EXAMPLE: ACCELERATED CRITICAL SECTIONS Idea: HW/SW ships critical sections to a large, powerful core in an asymmetric multi-core architecture Benefit: – Reduces serialization due to contended locks – Reduces the performance impact of hard-to-parallelize sections – Programmer does not need to (heavily) optimize parallel code  fewer bugs, improved productivity Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi- Core Architectures,” ASPLOS 2009, IEEE Micro Top Picks Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010, IEEE Micro Top Picks

ACCELERATED CRITICAL SECTIONS EnterCS() PriorityQ.insert(…) LeaveCS() Onchip- Interconnect Critical Section Request Buffer (CSRB) 1. P2 encounters a critical section (CSCALL) 2. P2 sends CSCALL Request to CSRB 3. P1 executes Critical Section 4. P1 sends CSDONE signal Core executing critical section P4P3P2 P1 38

Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi- Core Architectures,” ASPLOS A = compute() LOCK X result = CS(A) UNLOCK X print result Small Core Large Core A = compute() CSDONE Response CSCALL Request Send X, TPC, STACK_PTR, CORE_ID PUSH A CSCALL X, Target PC ……………… Acquire X POP A result = CS(A) PUSH result Release X CSRET X TPC: POP result print result …………………… ……………… Waiting in Critical Section Request Buffer (CSRB) ACCELERATED CRITICAL SECTIONS (ACS) 39

FALSE SERIALIZATION ACS can serialize independent critical sections Selective Acceleration of Critical Sections (SEL) – Saturating counters to track false serialization CSCALL (A) CSCALL (B) Critical Section Request Buffer (CSRB) 4 4 A B 32 5 To large core From small cores 40

ACS PERFORMANCE TRADEOFFS Pluses + Faster critical section execution + Shared locks stay in one place: better lock locality + Shared data stays in large core’s (large) caches: better shared data locality, less ping-ponging Minuses - Large core dedicated for critical sections: reduced parallel throughput - CSCALL and CSDONE control transfer overhead - Thread-private data needs to be transferred to large core: worse private data locality 41

ACS PERFORMANCE TRADEOFFS Fewer parallel threads vs. accelerated critical sections – Accelerating critical sections offsets loss in throughput – As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality – ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data 42

CACHE MISSES FOR PRIVATE DATA Private Data: NewSubProblems Shared Data: The priority heap PriorityHeap.insert(NewSubProblems) Puzzle Benchmark 43

ACS PERFORMANCE TRADEOFFS Fewer parallel threads vs. accelerated critical sections – Accelerating critical sections offsets loss in throughput – As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality – ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data – Cache misses reduce if shared data > private data This problem can be solved 44

Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted from CMU ECE 740