Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.

Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

Multi-Core Architectures and Shared Resource Management Prof. Onur Mutlu Bogazici University June 6, 2013.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

Sunpyo Hong, Hyesoon Kim

740: Computer Architecture Architecting and Exploiting Asymmetry (in Multi-Core Architectures) Prof. Onur Mutlu Carnegie Mellon University.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture: Multi-Core Evolution and Design Prof. Onur Mutlu Carnegie Mellon University.

Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University.

Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Computer Architecture Lecture 31: Asymmetric Multi-Core Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/30/2014.

Gwangsun Kim, Jiyun Jeong, John Kim

Computer Architecture: Parallel Task Assignment

18-447: Computer Architecture Lecture 30B: Multiprocessors

Seth Pugsley, Jeffrey Jestes,

Computer Architecture: Parallel Processing Basics

15-740/ Computer Architecture Lecture 17: Asymmetric Multi-Core

CARP: Compression Aware Replacement Policies

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

15-740/ Computer Architecture Lecture 5: Precise Exceptions

CARP: Compression-Aware Replacement Policies

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Prof. Onur Mutlu Carnegie Mellon University 9/24/2012

Computer Architecture Lecture 20: Heterogeneous Multi-Core Systems II

Prof. Onur Mutlu Carnegie Mellon University 9/19/2012

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/29/2013

ASPLOS 2009 Presented by Lara Lazier

MICRO 2012 MorphCore An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater.

Presentation transcript:

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

Asymmetric CMP (ACMP) One or a few large, out-of-order cores, fast Many small, in-order cores, power-efficient Critical code segments run on large cores The rest of the code runs on small cores 2 Large core Small core Small core Small core Small core Small core Small core Small core Small core Small core Small core Small core Small core ???

Bottlenecks T0 T1 T2 T3 Barrier 1Barrier 2 T0 T1 T2 T3 Barrier 1Barrier 2 Accelerating Critical Sections (ACS), Suleman et al., ASPLOS’09 Bottleneck Identification and Scheduling (BIS), Joao et al., ASPLOS’12 3

Lagging Threads T0 T1 T2 T3 Barrier 1 Progress P0 = P1 = P2 = P3 = Barrier 2t1Barrier 2 Previous work about progress of multithreaded applications:  Meeting points, Cai et al., PACT’08  Thread criticality predictors, Bhattacharjee and Martonosi, ISCA’09  Age-based scheduling (AGETS), Lakshminarayana at al., SC’09 t2 Lagging thread = potential future bottleneck 4 T2: Lagging thread Execution time reduction

Two problems 1) Do we accelerate bottlenecks or lagging threads? 2) Multiple applications: which application do we accelerate? T0 T1 T2 T3 Application 1 T0 T1 T2 T3 Application 2 t1 Acceleration decisions need to consider both: - the criticality of code segments - how much speedup they get for lagging threads and bottlenecks 5

Utility-Based Acceleration (UBA) Goal: identify performance-limiting bottlenecks or lagging threads from any running application and accelerate them on large cores of an ACMP Key insight: a Utility of Acceleration metric that combines speedup and criticality of each code segment Utility of accelerating code segment c of length t on an application of length T: 6 LRG

L: Local acceleration of c How much code segment c is accelerated Estimate S = estimate performance on a large core while running on a small core Performance Impact Estimation (PIE, Van Craeynest et al., ISCA’12) : considers both instruction-level parallelism (ILP) and memory-level parallelism (MLP) to estimate CPI t t LargeCore c running on small core c running on large core ∆t 7 S: speedup of c

R: Relevance of code segment c How relevant code segment c is for the application t lastQ T Q 8 Q t Q: scheduling quantum

G: Global effect of accelerating c How much accelerating c reduces total execution time ∆t ∆T G=1 c running on small core c running on large core t Criticality of c 9  Acceleration of c  Acceleration of application Single thread

G: Global effect of accelerating c How much accelerating c reduces total execution time Critical sections: classify into strongly-contended and weakly-contended and estimate G differently (in the paper) T1 T2 T3 Barrier Idle G=0 2 quanta to get the benefit of 1 G=1/2 10 Criticality of c  Acceleration of c  Acceleration of application

Utility-Based Acceleration (UBA) Bottleneck Identification Lagging Thread Identification Acceleration Coordination Set of Highest-Utility Lagging Threads Set of Highest-Utility Bottlenecks Large core control 11

Lagging thread identification Lagging threads are those that are making the least progress How to define and measure progress?  Application-specific problem  We borrow from Age-Based Scheduling (SC’09)  Progress metric (committed instructions)  Assumption: same number of committed instructions between barriers  But we could easily use any other progress metric… Minimum progress = minP Set of lagging threads = { any thread with progress < minP + ∆P } Compute Utility for each lagging thread 12

Utility-Based Acceleration (UBA) Bottleneck Identification Lagging Thread Identification Acceleration Coordination Set of Highest-Utility Lagging Threads Set of Highest-Utility Bottlenecks Large core control 1 per large core 13

Bottleneck identification Software: programmer, compiler or library  Delimit potential bottlenecks with BottleneckCall and BottleneckReturn instructions  Replace code that waits with a BottleneckWait instruction Hardware: Bottleneck Table  Keep track of threads executing or waiting for bottlenecks  Compute Utility for each bottleneck  Determine set of Highest-Utility Bottlenecks Similar to our previous work BIS, ASPLOS’12  BIS uses thread waiting cycles instead of Utility 14

Utility-Based Acceleration (UBA) Bottleneck Identification Lagging Thread Identification Acceleration Coordination Set of Highest-Utility Lagging Threads Set of Highest-Utility Bottlenecks Large core control 15

Acceleration coordination 16 Large Cores Small Cores LT1LT2LT3 LT4 U LT1 > U LT2 > U LT3 > U LT4 LT assigned to each large core every quantum Set of Highest-Utility Lagging Threads Bottleneck Acceleration Utility Threshold (BAUT) LT3 B1 U B1 > BAUT B2 U B2 > BAUT Scheduling Buffer LT3 No more bottlenecks  LT3 returns to large core LT: lagging threads U: utility B: bottlenecks Bottleneck B1 will preempt lagging thread LT3 Bottleneck B2 will be enqueued

Workloads  single-application: 9 multithreaded applications with different impact from bottlenecks  2-application: all 55 combinations of (9 MT + 1 ST)  4-application: 50 random combinations of (9 MT + 1 ST) Processor configuration  x86 ISA  Area of large core = 4 x Area of small core  Large core: 4GHz, out-of-order, 128-entry ROB, 4-wide, 12-stage  Small core: 4GHz, in-order, 2-wide, 5-stage  Private 32KB L1, private 256KB L2, shared 8MB L3  On-chip interconnect: Bi-directional ring, 2-cycle hop latency Methodology 17

Comparison points Single application  ACMP (Morad et al., Comp. Arch. Letters’06) only accelerates Amdahl’s serial bottleneck  Age-based scheduling (AGETS, Lakshminarayana et al., SC’09) only accelerates lagging threads  Bottleneck Identification and Scheduling (BIS, Joao et al., ASPLOS’12) only accelerates bottlenecks Multiple applications  AGETS+PIE: select most lagging thread with AGETS and use PIE across applications only accelerates lagging threads  MA-BIS: BIS with shared large cores across applications only accelerates bottlenecks 18

Single application, 1 large core Optimal number of threads, 28 small cores, 1 large core UBA outperforms both AGETS and BIS by 8% UBA’s benefit increases with area budget and number of large cores 19 Limiting critical sections: benefit from BIS and UBALagging threads: benefit from AGETS and UBANeither bottlenecks nor lagging threads

Multiple applications 2-application workloads, 60 small cores, 1 large core UBA improves Hspeedup over AGETS+PIE and MA-BIS by 2 to 9%

Summary To effectively use ACMPs:  Accelerate both fine-grained bottlenecks and lagging threads  Accelerate single and multiple applications Utility-Based Acceleration (UBA) is a cooperative software-hardware solution to both problems Our Utility of Acceleration metric combines a measure of acceleration and a measure of criticality to allow meaningful comparisons between code segments Utility is implemented for an ACMP but is general enough to be extended to other acceleration mechanisms UBA outperforms previous proposals for single applications and their aggressive extensions for multiple-application workloads UBA is a comprehensive fine-grained acceleration proposal for parallel applications without programmer effort 21

Thank You! Questions?

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University