BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

NTPT: On the End-to-End Traffic Prediction in the On-Chip Networks Yoshi Shih-Chieh Huang 1, June 16, Department of Computer Science, National Tsing.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Self-Correlating Predictive Information Tracking for Large-Scale Production Systems Zhao, Tan, Gong, Gu, Wambolt Presented by: Andrew Hahn.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.

Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Defining Anomalous Behavior for Phase Change Memory

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Dynamic Cache Clustering for Chip Multiprocessors

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

SAN FRANCISCO, CA, USA Adaptive Energy-efficient Resource Sharing for Multi-threaded Workloads in Virtualized Systems Can HankendiAyse K. Coskun Boston.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Microprocessors and Microsystems Volume 35, Issue 2, March 2011, Pages 230–245 Special issue on Network-on-Chip Architectures and Design Methodologies.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

Sunpyo Hong, Hyesoon Kim

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Best detection scheme achieves 100% hit detection with

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Speculative Lock Elision

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Rahul Boyapati. , Jiayi Huang

Department of Computer Science University of California, Santa Barbara

Tosiron Adegbija and Ann Gordon-Ross+

Haishan Zhu, Mattan Erez

Phase Capture and Prediction with Applications

Hardware Counter Driven On-the-Fly Request Signatures

Program Phase Directed Dynamic Cache Way Reconfiguration

Department of Computer Science University of California, Santa Barbara

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers 2011, Ischia, Italy.

Program’s time-varying behavior. Bodytrack / 16-threads parallel execution Time Challenge: How to detect behavioral changes? NoC Traffic Adaptive CMP architectures can take advantage of this time varying behavior.

Tracking program behavior Traditionally, two methods for tracking program phases 1.Run-time monitoring of the program execution. –Observations are limited by the monitoring metric. –Cost of monitoring mechanisms. –Granularity of monitoring intervals? Fine- vs coarse- grain? 2.Profile based analysis. –Static program analysis, complicated algorithms. –Binary rewriting –Architectural support. Code-based metrics : not directly suitable for parallel workloads.

Overview of our proposal  Track the program behavior at Run Time.  Effective  Simple  Low-cost View the program execution on ‘epoch’ granularity.

Outline  Introduction  Program epochs and characterization  Run-time epoch change detection.  Case study  Summary

Observation / Motivation Natural alignment of barriers with the changes in program behavior. Intervals enclosed by barriers repeat with consistent behavior. NoC Traffic Time

NoC Traffic Time Program epochs Epoch: An execution interval between two consecutive barriers. AB epoch Barriers A B epoch A B A B A B A B A B

NoC Traffic Time Program epochs Barriers A B epoch BA epoch BA epoch BA epoch BA epoch BA epoch Epoch: An execution interval between two consecutive barriers.

NoC Traffic Time Program epochs D C epoch D C Epoch: An execution interval between two consecutive barriers.

NoC Traffic Time Program epochs Epoch: An execution interval between two consecutive barriers.

Epochs’ effectiveness: characterization Are epochs effective in characterizing the variability of program behavior?  How similar is program behavior among the different dynamic instances of the same epoch?  How different is the behavior across different epochs?  How the program behaves within the epochs?

Characterization across epochs. Error bars: variability across the dynamic instances of an epoch Dispersion across points: variability across different epochs LOW variability HIGH variability NoC Traffic

Characterization across epochs. fundamental correlation between epoch boundaries and changes in program behavior High predictability of behavior across epoch instances NoC TrafficL2 Miss RatioGlobal IPCC2C Tranfers Low variability across instances of an epoch High variability across different epochs

Characterization across epochs. Low variability across instances of an epoch High variability across different epochs Ratio = The smaller the ratio, the sharper the behavioral shifts on epoch boundaries the more predictable the program behavior across repeating epoch instances.

PARSEC and SPLASH2 programs. Less than 0.2 for most benchmarks.

Epochs’ effectiveness: characterization Are epochs effective in characterizing the variability of program behavior?  How similar is program behavior among the different dynamic instances of the same epoch?  How different is the behavior across different epochs?  How the program behaves within the epochs?

Characterization within epochs. Epochs may exhibit stable or other behavioral patterns within their boundaries. Internal behavior patterns reoccur and thus can be accurately predicted. Stable Unstable Multiphase

Characterization within epochs. bodytrack fluidan. streamcl. barnes fmm lu ocean radiosity water-ns average

Characterization within epochs. bodytrack fluidan. streamcl. barnes fmm lu ocean radiosity water-ns average

Characterization within epochs. Most epochs exhibit stable behavior within their boundaries. Close relation to classic definition of program phase. Reoccurring Internal patterns can be predictable. bodytrack fluidan. streamcl. barnes fmm lu ocean radiosity water-ns average

Epoch characterization summary  Epochs repeat in a consistent and predictable way providing a reliable granularity of the cyclic pattern of program behavior.  Epoch boundaries are likely to naturally indicate changes of program behavior  Most epochs exhibit stable behavior within their boundaries or other reoccurring predictable patterns.

Epochs: Advantages  Independent from the underlying architecture.  Naturally adopting variable-length intervals  Deterministic boundaries (global sync points).  Barriers can be easily captured at run time.  Many multithreaded workloads are written with barrier synchronizations.

Outline  Introduction  Program epochs and characterization  Run-time epoch change detection.  Case study  Summary

... barrier_wait(barrier)... barrier_wait(barrier)... Application’s source code Application’s Instruction stream Run-time epoch change detection. Reconfiguration units EPOCH ID Decision signature F bit Barrier A Epoch Table Barrier B Barrier A T Barrier B Config ABBarrier B T Config AB Barrier A T Barrier BConfig AB

Outline  Introduction  Program epochs and characterization  Run-time epoch change detection.  Case study.  Summary

Case study: Overview  Purpose: Demonstrate the applicability of the BarrierWatch approach in the context of dynamic adaptation.  Goal: Optimize energy/performance trade-off in a CMP architecture using BarrierWatch.  Adaptation Technique: DVFS applied to the NoC. epoch granularity)

Experimental methodology Benchmarks:  From PARSEC & Splash2 suites (pthread). Architectural Model  Full system simulator (simics) augmented with a cycle accurate memory hierarchy model.  Tile-based CMP model / 16 in-order cores / 2-issue width  Shared, physically distributed L2 Cache.  Mesh NoC, x-y routing.  Two-stage router pipeline, buffer size 2 per VC.

On-Chip DVFS On-Chip Power consumption Model  NoC power + Background power  NoC Voltage/Frequency levels: Frequency (GHz)Voltage (V)alias 30.8 f100% f75% f50% f25%

Evaluated schemes

Case study: Results

Case study: Results Run-time Epoch-based DVFS: 12.5% energy savings for 2.7% slowdown

Case study: Results Epoch-based dynamic schemes outperform all static scheme.

Outline  Introduction  Program epochs and characterization  Run-time epoch change detection.  Case study.  Summary.

Summary  Program-defined epochs represent well the repetitive and varying behavior of multithreaded programs.  BarrierWatch prominent method for effective run-time management in CMPs.  Desirable properties: 1. Simple and lightweight. 2. Effective at run-time. 3. Independent of the underlying architecture. 4. Well suited for Parallel applications.

Thank you! Computer Frontiers 2011, Ischia, Italy.