ASPLOS 2009 Presented by Lara Lazier

Slides:



Advertisements
Similar presentations
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
Chapter Hardwired vs Microprogrammed Control Multithreading
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
Chapter 18 Multicore Computers
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Computer Architecture: Parallel Task Assignment
18-447: Computer Architecture Lecture 27: Multi-Core Potpourri
18-447: Computer Architecture Lecture 30B: Multiprocessors
15-740/ Computer Architecture Lecture 3: Performance
Software Coherence Management on Non-Coherent-Cache Multicores
Architecting and Exploiting Asymmetry in Multi-Core Architectures
Computer Architecture: Parallel Processing Basics
CS5102 High Performance Computer Systems Thread-Level Parallelism
15-740/ Computer Architecture Lecture 17: Asymmetric Multi-Core
The University of Adelaide, School of Computer Science
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Hyperthreading Technology
Samira Khan University of Virginia Sep 13, 2017
Multi-Processing in High Performance Computer Architecture:
Computer Architecture: Multithreading (I)
18-740/640 Computer Architecture Lecture 17: Heterogeneous Systems
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Yiannis Nikolakopoulos
Fine-grained vs Coarse-grained multithreading
Prof. Onur Mutlu Carnegie Mellon University
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Computer Evolution and Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
15-740/ Computer Architecture Lecture 14: Prefetching
Computer Architecture Lecture 16: Heterogeneous Multi-Core
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
The University of Adelaide, School of Computer Science
Computer Architecture Lecture 20: Heterogeneous Multi-Core Systems II
The University of Adelaide, School of Computer Science
Prof. Onur Mutlu Carnegie Mellon University 9/19/2012
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Computer Architecture Lecture 19b: Heterogeneous Multi-Core Systems
Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/29/2013
CSC Multiprocessor Programming, Spring, 2011
Presented by Florian Ettinger
MICRO 2012 MorphCore An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater.
Presentation transcript:

ASPLOS 2009 Presented by Lara Lazier

Outline Introduction and Background Proposed Solution - ACS Methodology and Results Strengths Weaknesses Takeaways Questions and Discussion

Executive Summary PROBLEM: Critical sections limit both performance and scalability Accelerating Critical Sections (ACS): improves performance by moving the computation of the Critical Section to a larger and faster core. First approach to accelerate critical sections in Hardware. RESULTS: ACS improves performance on average by 34% compared to a Symmetric CMP and by 24% compared to a Asymmetric CMP ACS improves scalability of 7 out of 12 Workloads.

Introduction and Background

Background Large single core processors are complex and have high power consumption Chip-multiprocessors (CMP) are less complex and have less power consumption c large core c SymmetricChipMultiProcessor AsymmetricChipMultiProcessor To be able to extract high performance, programs must be split into threads

Background - Threads and Critical Sections Threads operate on different portions of the same problem Threads are not allowed to update shared data at the same time → MUTUAL EXCLUSION A critical section is a portion of a program that only one thread can execute at a given time Limits both performance and scalability Amdahl’s Law

Problem overview and Goal Core1 Contention increases when number of cores increases Core2 Core1 Core2 Core3 Core4 The goal is to accelerate the execution of critical sections

Key Insight Accelerating critical sections can provide significant performance improvement Asymmetric CMP can accelerate serial part using the large core Moving the computation of critical sections to the larger Core(s) could improve the performance

Accelerating Critical Sections ACS Accelerating Critical Sections

Overview P0 Homogeneous ISA Asymmetric CMP with cache coherence A = Compute(); Lock X result = CS(A); Unlock X return result; P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 Homogeneous ISA Asymmetric CMP with cache coherence

Implementation CSRB (Critical Section Request Buffer) CSCALL P0 CSDONE A = Compute(); Lock X r = CS(A); Unlock X return r; CSRB (Critical Section Request Buffer) P0 POP A r = CS(A) PUSH r CSCALL P1 LOCK_ADDR, TARGET_PC, STACK_PTR, CORE_ID PUSH A POP r CSDONE Lock and Shared Data do not need to be moved ACS reduces the number of L2 caches misses inside the critical sections by 20%

Implementation ISA Support Compiler Support: CSCALL and CSRET Compiler Support: insert CSCALL and CSRET and removes any register dependencies → function outlining 25 bytes Modification to small Core support for executing instructions remotely Modification to larger Core CSRB (Critical Section Request Buffer) Interconnect Extension CSCALL and CSDONE OS support allocates the large core to a single application sets the same program context for all cores

False Serialization P0 P3 P1 P2 CSCALL X Without ACS P2 CSCALL Y P1 P4 CSCALL Z P3 Time CSCALL X P4 With ACS P1 P2 P3 P4 Time

SEL HOW TO SOLVE FALSE SERIALIZATION? SEL (SELective Acceleration of Critical Sections) Storage Overhead of 36 bytes (16 counters of 6-bits and 16 ACS_DISABLE for each of the 12 small cores) estimates the occurrence of false serialization by adaptively deciding whether or not to execute the CS on the large core To implement SEL we need: On each small core: On the large core: ACS_DISABLE_N . ACS_DISABLE_1 ACS_DISABLE_0 CS_N . CS_1 CS_0 When ACS_DISABLED_i = 0 for a critical section i then the core sends a CSCALL to the larger core Saturating counter for each critical section: if there are at least 2 CS in the CSRB + #CS if there is 1 CS in the CSRB -1 Spiega Idea contare quanti di questi sono contemporanamente Nicht nested da sonst Deadlock gefahr ist einfach keine CSCALL← compiler + logic für das ganze Hashing lock adresses into a small number of sets. Besser mehr Grosse Cores. Counter is saturated ACS_DISABLE bits a reset and the values of the saturating counter are halved every 10 million cycles To implement SEL we need: a bit vector at each small core that contains the ACS_DISABLE bits (0 - low false serialization) logic to estimate false serialization a table of saturating counters, for each CS added to the CSRB

Performance Trade-offs Faster critical sections vs. fewer threads Reduced parallel throughput When the number of cores increases, loss of throughput decreases and increased contention benefits more from ACS CSCALL/CSDONE signals vs. lock acquire/release - The communication over the on-chip interconnect is an overhead + ACS keeps the lock at the large core and reduces cache misses Cache misses due to private data vs. cache misses due to shared data - worse private data locality + ACS eliminates the transfer of shared data by keeping it at the large core +faster critical section execution + faster locks stay in one place: better lock locality + better shared data locality -reduced parallel throughput because large core is dedicated for cs - CSCALL and CSDONE control transfer overhead -Thread-private data needs to be trasnfered: worse private data locality

Results

Methodology Simulating CMPs using a cycle-accurate x86 simulator. The large core occupies the same area as 4 smaller cores and they are modeled after the Intel Pentium-M. The smaller cores are modeled after the intel Pentium Processor.

Workloads The workloads are evaluated on: Symmetric CMP Asymmetric CMP with one large core with 2-way SMT Asymmetric CMP with ACS.

Results ACS reduces the average execution time by 34% compared to an equal-area baseline with 32-Core SCMP. ACS reduced the average execution time time by 23% compared to an equal-area ACMP. ACS improves scalability of 7 workloads.

Coarse Grained vs. Fine Grained Results 8 8 16 32 Area Budget

SEL On average, across all 12 workloads, ACS with SEL outperforms ACS without SEL by 15%.

Summary

Summary PROBLEM: Critical sections limit both performance and scalability Accelerating Critical Sections (ACS): improves performance by moving the computation of the Critical Section to a larger and faster core. First approach to accelerate critical sections in Hardware. RESULTS: ACS improves performance on average by 34% compared to a Symmetric CMP and by 24% compared to a Asymmetric CMP ACS improves scalability of 7 out of 12 Workloads.

Strengths

Strengths Novel, intuitive idea. First approach to accelerate critical sections directly in hardware The results are going to become more and more interesting Low hardware overhead The paper analyzes very well all possible trade offs The figures complement very well the explanations

Weaknesses

Weaknesses ACS only accelerate critical sections SEL might overcomplicating the problem. There might be some easier ideas that don’t need additional hardware The area budget to outperform both SCMP and ACMP make it less attractive for an everyday use Costly to implement : ISA, Compiler, interconnect... Aggiungi ancora qualcosa almeno

Thoughts and Ideas

Thoughts and Ideas How would it work with more than one large core? Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt, "Bottleneck Identification and Scheduling in Multithreaded Applications" (ASPLOS ‘12) How could we also accelerate other bottlenecks as barriers and slow pipeline stages? Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt, "Bottleneck Identification and Scheduling in Multithreaded Applications" (ASPLOS ‘12) Improving locality in staged execution M. Aater Suleman, Onur Mutlu, Jose A. Joao, Khubaib, and Yale N. Patt, "Data Marshaling for Multi-core Architectures", (ISCA ‘10) Accelerating more (BIS with Lagging Threads): Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt, "Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs" (ISCA ‘13)

Takeaways

Key Takeaways The idea of moving specialized sections of computation to a different “core” ( = accelerator, GPU...) has a lot of potential ACS is a novel way to accelerate critical section in hardware The key idea is very intuitive and easy to understand Software is not the only solution

Questions ? Multiple large cores [out of scope of this paper] Multiple Parallel Applications → SMT or time share the large core and ACS could be enabled or disabled. Handling Interrupts and Exceptions → outside of CS ← baseline else the large core disables ACS for all CS and pushes the CSRB on the stack SymAcs → only 4%

Discussion Starters Do you think the trend of specializing hardware is going to increase even more in future? What other things could be done? Do you think this could create new security threats? Can you imagine a way modularity could increase security? Could ACS be combined with MorphCore? Riscy Expedition The paper moves a special parts of the computation to a ‘special’ Core, similarly things are done today for example when doing special computation on GPU or special accelerators. The trend of modularity is increasing. Do you think this is going to increase in future? What other things could be done?

ASPLOS 2009 Presented by Lara Lazier Presento Acs… presentato a ASPLOS 2009. ASPLOS 2009 Presented by Lara Lazier

HPS Technical Report, TR-HPS-2008-003, September 2008. What more results? "An Asymmetric Multi-core Architecture for Accelerating Critical Sections" HPS Technical Report, TR-HPS-2008-003, September 2008. For example with / without prefetcher

Backup Slides

Hardware specialization? Adi Fuchs, David Wentzlaff, “Scaling Datacenter Accelerators With Compute-Reuse Architectures” (ISCA ‘18)

Coarse-Grained Workloads For an area budget of 8: ACS improves performance by 22%compared to SCMP and 11% compared to ACMP For an area budget of 16: ACS improves performance by 32%compared to SCMP and 22% compared to ACMP For an are budget of 32: ACS improves performance by 42% compared to SCMP and 31% compared to ACMP SPIEGA IL GRAFICO what -- > optimal number. try and fail ACMP outperfoms SCMP for a large serial part. ACS evotes the 2 SMT context it can only execute 4 parallel threads workloads with hight contention and more shared data than private data improve a lot qsort and tsp more private data and therefore reduces cache locality, Contention is low and workloads can use more performance more area budget the overhead of ACS decreases → more contention (click)

Fined Grained Workloads The area budget required to outperform SCMP or ACMP for all workloads is less than or equal to 24 cores. For an area budget of 8: ACS increases execution time for all workloads except for iploockup compared to SCMP and increases execution time for all workloads compared to ACMP For an area budget of 16: ACS improves performance by 2% compared to SCMP and 6% compared to ACMP For an are budget of 32: ACS improves performance by 17% compared to SCMP and 6% compared to ACMP fine grain reduces contention 24

Scalability For 7 out of 12 applications ACS improves Scalability scalability → executing each application with all possible threads 1 to 32. Better because ACs decreases contention Non intensivity benchmarks on NAS (from Nasa) and SPLASH suites. → with area budget of 32 1% improvement to ACMP and 2% decreasing to SCMP. Then similar to Scmp when more area budget