University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Slides:

Advertisements

Similar presentations

LOTTERY SCHEDULING: FLEXIBLE PROPORTIONAL-SHARE RESOURCE MANAGEMENT

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Computer Abstractions and Technology

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

G Robert Grimm New York University Lottery Scheduling.

Tao Yang, UCSB CS 240B’03 Unix Scheduling Multilevel feedback queues –128 priority queues (value: 0-127) –Round Robin per priority queue Every scheduling.

What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

University of Michigan Electrical Engineering and Computer Science 1 Practical Lock/Unlock Pairing for Concurrent Programs Hyoun Kyu Cho 1, Yin Wang 2,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

SAGE: Self-Tuning Approximation for Graphics Engines

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Lottery Scheduling: Flexible Proportional-Share Resource Management Sim YounSeok C. A. Waldspurger and W. E. Weihl.

Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Emalayan Vairavanathan

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich,

Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Sunpyo Hong, Hyesoon Kim

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Using the VTune Analyzer on Multithreaded Applications

Multiple processor systems

Adaptive Cache Partitioning on a Composite Core

Resource Aware Scheduler – Initial Results

Timothy Zhu and Huapeng Zhou

Lottery Scheduling: Flexible Proportional-Share Resource Management

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Department of Computer Science University of California, Santa Barbara

Department of Computer Science University of California, Santa Barbara

Sculptor: Flexible Approximation with

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems Hyoun Kyu Cho and Scott Mahlke University of Michigan, Ann Arbor December 2, 2012

University of Michigan Electrical Engineering and Computer Science Critical Path 2 Longest path between source and sink in DAG

University of Michigan Electrical Engineering and Computer Science Critical Path 3 [Saidi`08]

University of Michigan Electrical Engineering and Computer Science Critical Path for Multithreaded Programs 4 Call Unlock StartLock EndLock (a) Mutex Lock T1 T2 Call ArBarrier (b) Barrier T1 T2 Call T3 ArBarrier LvBarrier [Hollingsworth`98]

University of Michigan Electrical Engineering and Computer Science Scalability of Multithreaded Programs 5 Some benchmarks does not scale very well!

University of Michigan Electrical Engineering and Computer Science CPU Time Wasted on Synchronizations 6 Synchronization is major bottleneck!

University of Michigan Electrical Engineering and Computer Science Arrival Time Variation 7

University of Michigan Electrical Engineering and Computer Science Accelerating Critical Path 8 ACS [Suleman et al. ASPLOS `09] –Critical sections Voltage Boosting [Dreslinski `11] –Transactional bottlenecks Booster [Miller et al. HPCA `12] –Alleviate performance variation –Reactive acceleration for barriers

University of Michigan Electrical Engineering and Computer Science Challenges and Opportunities of NTC 9 Poor single thread performance Very sensitive to process variation –Running at the slowest one leads to severe loss –Likely to have performance heterogeneity Potential for bigger frequency boosting

University of Michigan Electrical Engineering and Computer Science Objectives 10 Systematic way of identifying critical paths Dealing with performance variation Flexible control of core boosting

University of Michigan Electrical Engineering and Computer Science System Architecture 11 offlineonline Target Program Intermediate Representation Monitoring Logic Compilation Parallelism Analysis Instrumented Executable Monitor instrumentation Observe Adjust Priority Schedule Weighted Probabilistic Priority Scheduler

University of Michigan Electrical Engineering and Computer Science Lottery Scheduling 12 Each thread holds a number of tickets Scheduler select fast mode thread by picking a ticket Efficient implementation of proportional-share resource management Responsive, flexible control over relative execution rate [Waldspurger`94] 10 total = 20 random [0.. 19] = ∑ = 10 ∑ > 15? no ∑ = 12 ∑ > 15? no ∑ = 17 ∑ > 15? yes

University of Michigan Electrical Engineering and Computer Science Progress Monitoring 13 For data parallel threads Slower threads are more likely to be in critical path Divide task into multiple smaller chunks and instrument monitoring code Monitoring code reduce number of tickets

University of Michigan Electrical Engineering and Computer Science Example of Progress Monitoring 14 … pthread_barrier_wait(barrier); long PROGRESS_GRANULE = (k2 – k1) / NUM_STEPS; for ( i = k1 ; i < k2 ; i++ ) { float x_cost = dist(points->p[i],points->p[x],points->dim) * points->p[i].weight; float current_cost = points->p[i].cost; if ( x_cost < current_cost ) { switch_membership[i] = 1; cost_of_opening_x += x_cost – current_cost; } else { int assign = points->p[i].assign; lower[center_table[assign]] += current_cost – x_cost; } if ( (i – k1) % PROGRESS_GRANULE == 0 ) halve_priority_tickets(); } pthread_barrier_wait(barrier); … Loop Body

University of Michigan Electrical Engineering and Computer Science Priority Delegation 15 Thread holding a mutex is likely to be in critical path –Increase tickets when acquire mutex More likely to be in critical path if other threads are waiting –Temporarily transfer waiting thread’s ticket to the thread holding mutex

University of Michigan Electrical Engineering and Computer Science Performance Evaluation 16 Post processing traces –Generated on 32-core machine Four 8-core Intel Xeon X MB L3 cache per chip 32GB Total memory –Augmented progress time indication 1.5x, 2x, 5x, 10x acceleration for 1 fast mode core Varying scheduling quantum from 1us to 1ms

University of Michigan Electrical Engineering and Computer Science Speedup for Streamcluster 17 H/W OS User mode

University of Michigan Electrical Engineering and Computer Science Current Status 18 Target Program Intermediate Representation Monitoring Logic Compilation Parallelism Analysis Instrumented Executable Monitor instrumentation Observe Adjust Priority Schedule Weighted Probabilistic Priority Scheduler Normal Turbo Normal Turbo Cores

University of Michigan Electrical Engineering and Computer Science Conclusion & Future Work 19 Introduce S/W framework to improve multithreaded programs’ performance using core boosting Combines static analysis, dynamic monitoring, and probabilistic priority scheduling to predict critical paths Shows 5% ~ 27% performance improvement for streamcluster Better model the tradeoff between performance and energy Predicting critical paths on other type of parallelism

University of Michigan Electrical Engineering and Computer Science Thank you! 20