Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun +, Gang Yao +, Rodolfo.

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Institute of Networking and Multimedia, National Taiwan University, Jun-14, 2014.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Computer System Architectures Computer System Software

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

NEURAL NETWORKS FOR DATA MINING

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Adaptive Cache Partitioning on a Composite Core

Xiaodong Wang, Shuang Chen, Jeff Setter,

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Application Slowdown Model

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Haishan Zhu, Mattan Erez

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

CARP: Compression-Aware Replacement Policies

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

(A Research Proposal for Optimizing DBMS on CMP)

The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors : A Machine Learning Approach

 Resource sharing problem in CMP  Increasing levels of pressure on shared system resources  Efficient sharing is necessary for high utilization and performance  Multiple interacting resources  Cache Space, DRAM Bandwidth and Power Budget  Allocation of a resource affects demands of other resources  Propose a resource allocation framework  At runtime, monitors the execution of each application and learns a predictive model of performance as a function of resource allocation decisions and periodically allocates resources to each core using the model Introduction

 Per-application HW performance model  Use Artificial Neural Networks (ANNs)  Predict each app’s performance as a function of the resources allocated to it  Global resource manager  At every interval, searches the possible resource allocations by querying the application performance model Resource Allocation Framework

 Use ANNs  Input units, hidden units and an output unit connected via a set of weighted edges  Hidden(output) unit calculates a weighted sum of their inputs(hidden values) based on edge weights  Edge weights are trained with training examples (data sets) How to Predict a Performance? (Artificial Neural Networks)

 Input units  L2 cache space, off-chip bandwidth, power budget  Number of read hits, read misses, write hits, and write misses over the last 20K inst and over the 1.5M inst  Fraction of cache ways that are dirty (the amount of WB traffic)  Activation function  Use sigmoid (integer to value in [0, 1])  Model performance as a function of its allocated resources and recent behavior  Training during first 1.2 billion cycle with randomly allocated resource  Always keep a training set consisting of 300 points  Retrained at every 2,500,000 cycle How to Predict a Performance? (Adaptation to per-APP Performance Model)

 Optimization  Prevent memorizing outliers in a sample data  Cross validation  Data set is divided into N equal-sized folds (N-1 training sets and 1 test set)  Ensemble consists of N ANN models  Performance is predicted averaging the predictions of all ANNs in the ensemble  Prediction error is estimated as a function of CoV of the predictions by each ANN in the ensemble (will be used for resource allocation) How to Predict a Performance? (Adaptation to per-APP Performance Model) Training Test Trning Test

 Make resource allocation decision (at every 500,000 cycle) using the trained per-application performance model  Discard queries involving an app with a high error estimate  Fairly distribute resources to the running applications  Predict the perf and compute the prediction error  If the performance is estimated to be inaccurate (error > 9%), app is excluded from global resource allocation  Search the space with stochastic hill climbing  It starts with a random solution, and iteratively makes small changes to the solution, each time improving it a little.  When the algorithm cannot see any improvement anymore, it terminates  2,000 trials produces the best tradeoff between search performance and overhead Resource Allocation

 HW implementation  Single HW ANN and multiplex edge weights on the fly to achieve 16 ‘virtual’ ANNs  12 * multipliers as many as weighted edges  50 entry-table-based quantized sigmoid function  Calculate in a pipelined manner  Prediction(search) takes 16 cycles for 16 virtual ANNs  Area, Power, and Delay  3% of the chip’s area  3W power consumption  Possible to make 2,000 queries within 5% of interval  OS Interface  Embed training set and the ANN weights to the process state  OS communicates the desired objective function through CR Implementation & Overhead

 Tools & architecture  Heavily modified version of SESC  With Wattch(power), HotSpot(temperature)  Baseline : Intel’s Core2Quad, DDR2-800  4-core CMP, frequency = 0.9GHz-4.0GHz(0.1GHz unit)  4MB, 16-way shared L2 cache  Distributed 60W power budget among 4 apps via per-core DVFS  Outs is limited to 57W  Statically allocate 5W  Partition L2 cache space at the granularity of cache ways  Allocate one way to each app  Distribute the remaining 12 ways  Each app statically allocated 800MB/s of off-chip DRAM bandwidth and the remaining 3.2GB/s is distributed Experimental Setup

 Metrics  Weighted speedup  Sum of IPCs  Harmonic mean of normalized IPCs  Weighted sum of IPCs  Workload  9 quad-core multi-programmed workloads from SPEC2000 and NAS suites  Classify into 3 categories  CPU-bound  Memory-bound  Cache Sensitive Experimental Setup

 Configurations  Unmanaged  Isolated Cache Management (Cache)  Utility-based cache partitioning, MICRO’2006  Distribute L2 cache ways to minimize miss rate  Isolated Power Management (Power)  An analysis of efficient multi-core global power management policies : Maximizing performance for a given power budget, MICRO’2006  Isolated Bandwidth Management (BW)  Fair Queuing Memory System, Micro ‘06  Uncoordinated Cache + Power, Cache + BW, Power + BW, Cache + Power + BW  Continuous Stochastic Hill-Climbing (Coordinated-HC)  Learning based SMT processor resource distribution(issue-queue, ROB, and register file), ISCA ’06  Fair-share  Proposed scheme (Coordinated-ANN)  ANN-based models of the applications’ IPC response to resource allocation are used to guide a stochastic hill-climbing search Experimental Setup

 Performance  Results are normalized to Fair-Share  14% average speedup over Fair-Share  Similar for other metrics Evaluation Results P,C,P,MM,C,P,MC,C,C,CP,C,M,CC,M,C,CC,P,C,MC,M,M,CP,C,P,MP,C,P,P

 Sensitivity to confidence threshold  Results are normalized to Fair-Share Evaluation Results P,C,P,MM,C,P,MC,C,C,CP,C,M,CC,M,C,CC,P,C,MC,M,M,CP,C,P,MP,C,P,P

 Confidence estimated mechanism  Fraction of the total execution time where the ANN could predict the resource allocation optimization for each application Evaluation Results P,C,P,MM,C,P,MC,C,C,CP,C,M,CC,M,C,CC,P,C,MC,M,M,CP,C,P,MP,C,P,P

 Proposed a resource allocation framework that Manages multiple shared CMP resources in a coordinated fashion through ANNs and periodic resource allocation scheme  Coordinated approach to multiple resource management is a key to delivering high performance in multiprogrammed workloads Conclusions

Extras P,C,P,MM,C,P,MC,C,C,CP,C,M,CC,M,C,CC,P,C,MC,M,M,CP,C,P,MP,C,P,P

Extras P,C,P,MM,C,P,MC,C,C,CP,C,M,CC,M,C,CC,P,C,MC,M,M,CP,C,P,MP,C,P,P

Extras P,C,P,MM,C,P,MC,C,C,CP,C,M,CC,M,C,CC,P,C,MC,M,M,CP,C,P,MP,C,P,P

Extras