M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Slides:

Advertisements

Similar presentations

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Advertisements

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Runtime Software Power Estimation and Minimization Tao Li.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Computer Architecture: Parallel Task Assignment

FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers

Adaptive Cache Partitioning on a Composite Core

For Massively Parallel Computation The Chaotic State of the Art

Resource Aware Scheduler – Initial Results

Ching-Chi Lin Institute of Information Science, Academia Sinica

Babak Sorkhpour, Prof. Roman Obermaisser, Ayman Murshed

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Rahul Boyapati. , Jiayi Huang

Haishan Zhu, Mattan Erez

A Case for Interconnect-Aware Architectures

Presentation transcript:

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University of Pittsburgh

Prelude Heterogeneity in multicore processors will grow 1. Designers adopt asymmetry [Kumar et al., ’03] large, fast, high power small, slower, low power

Prelude Heterogeneity in multicore processors will grow 2. Processor variations render processor cores “unintentionally” different [Borkar, ’04] core 0 core 1 core 2 core 3 fast, high power slow, low power

Prelude Heterogeneity in multicore processors will grow 3. Imperfect resource management results in unbalanced and unfair resource usages core 0 core 1 [Iyer, ’04] shared cache

Prelude Heterogeneity in multicore processors will grow 4. Intermittent and permanent faults degrade a system core 0 core 1 [Borkar, ’04]

Our contributions Observation –Heterogeneity in computing resource grows –Need to manage resources differently M AESTRO : a system design framework –To better deal with heterogeneous resources in multicore chips; to better scale them Case study –Parallel program is split into “epochs” –Remember how each epoch behaved –Utilize past behavior to predict and control future

Deal with or not? Avg. Program Performance (relative to RND) σ/μ=0.08σ/μ=0.16 (When offered load is low) core 0 core 1 core 2 core 3

Avg. Program Performance (relative to RND) σ/μ=0.08σ/μ=0.16 (When offered load is low) Deal with or not? core 0 core 1 core 2 core 3 3%

Avg. Program Performance (relative to RND) σ/μ=0.08σ/μ=0.16 (When offered load is low) Deal with or not? core 0 core 1 core 2 core 3 3% 18%35%

A WARENESS is key… Two types of awareness: (1) execution environment; and (2) application behavior Most systems, however, are NOT aware of heterogeneity (except NUMA)!

M AESTRO : Vision 1.Learn environment automatically and annotate it 2.Learn application automatically and annotate it 3.System does better and better in matching an application with resources There are many “how”s we need to study –The paper lists many research questions

M AESTRO : Big picture execution environment w/ asymmetric resources … … applications ???

M AESTRO : Learning environment … … microbench “environment profiler”

M AESTRO : Learning application … … program run “application profiler”

program run M AESTRO : Leveraging annotations … … “resource manager”

Example problems Initial task mapping –Map a new task to a processor that fits the best at the time of mapping (c.f., random, round-robin, shortest queue, …) Last-level cache management –Allocate cache capacity based on prediction Power and energy management –Select a low-power core to minimize energy while meeting QoS

Research questions What parameters do we study? Dependency between resource parameters? Which resource to characterize? How to represent? Microbenchmark? Which level do we characterize an application? Program? Phase? Instruction? How? What architectural support will enable effective and efficient learning? See paper for details

Cadenza: Case study Purpose –Prove the concept of predictive resource management Goal –Evaluate “epoch”-based performance-energy adaptation of on-chip network Adaptation mechanism –All-router DVFS (dynamic voltage-frequency scaling)

Case study: Program epochs Time NoC Traffic epoch “A”epoch “B” …… [Demetriades and Cho, ’11]

Case study: Methodology Benchmark –PARSEC and SPLASH-2 (pthread) Simulation setting –Simics (full-system simulator) + cycle-accurate memory hierarchy module –16 2-issue in-order cores –Distributed shared L2 cache –2D mesh NoC, x-y routing –2-stage router pipeline, 2-entry buffer per VC

Case study: Power model Power consumption –NoC power + others (background) NoC power: DVFS Frequency (GHz)Voltage (V)alias 30.8 f 100% f 75% f 50% f 25%

Case study: Evaluation space Schemes with fixed NoC frequency –f 100% (baseline), f 75%, f 50%, f 25% Epoch-based DVFS (adaptive strategies) –f DVFS-dyn : Run-time adaptation –f DVFS-static : Statically (off-line) determined adaptation Best frequency: one that minimizes the energy- delay product

Case study: Results

Case study: Results

Run-time epoch-based DVFS shows 12.5% energy savings for 2.7% slowdown Case study: Results

Epoch-based strategies are robust and outperform all static schemes… Case study: Results

Postlude We predict and examine the impact of growing heterogeneity in processor resources We propose M AESTRO, a hypothetical system design framework to tackle heterogeneity with little manual intervention –We envision a system that perform better and better over time Our detailed case study reveals that learning an application can pay off

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University of Pittsburgh