Resource Aware Scheduler – Initial Results

Slides:



Advertisements
Similar presentations
1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Computer Abstractions and Technology
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Evaluating Performance
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Energy Model for Multiprocess Applications Texas Tech University.
1 Lecture 2: Metrics to Evaluate Systems Topics: Power and technology trends wrap-up, benchmark suites, performance equation, summarizing performance with.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
1 Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM  Video 1: Using AM.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with.
Sunpyo Hong, Hyesoon Kim
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Background Computer System Architectures Computer System Software.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Performance COE 301 / ICS 233 Computer Organization Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture: Benchmarks, Pipelining Intro Topics: Performance equations wrap-up, Intro to pipelining.
Using the VTune Analyzer on Multithreaded Applications
R-Storm: Resource Aware Scheduling in Storm
Lecture 2: Performance Today’s topics:
Lecture 2: Performance Evaluation
Multiprocessing.
Introduction to Operating Systems
Diskpool and cloud storage benchmarks used in IT-DSS
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Ramya Kandasamy CS 147 Section 3
Lecture Topics: 11/1 Processes Process Management
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Performance COE 301 Computer Organization
Some challenges in heterogeneous multi-core systems
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture 21: Introduction to Process Scheduling
Managing GPU Concurrency in Heterogeneous Architectures
(A Research Proposal for Optimizing DBMS on CMP)
Multithreaded Programming
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Lecture 21: Introduction to Process Scheduling
Request Behavior Variations
Chip&Core Architecture
Application-Specific Customization of Soft Processor Microarchitecture
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Resource Aware Scheduler – Initial Results Tomer Morad, Noam Shalev, Avinoam Kolodny, Idit Keidar, Uri Weiser May 8, 2013

Main Message: Balance Systems to Avoid Bottlenecks Motivation Different programs have different resource requirements: # of cores, cache, memory bandwidth, energy, branch prediction, etc. Hence, no computing system can be balanced Heterogeneous systems are even worse (unbalanced) Contention on resources wastes energy and usually degrades performance (for example: cache) Proposal: dynamically tune the workload to the (dynamically tuned) hardware in order to minimize the contention on the resources by balancing the system The OS scheduler can do this

CMP Shared Resource Effects Examples for shared resources: last level cache, memory bus, network bandwidth, disk bandwidth, etc. There are three effects observed when several threads access a shared resource Wasted Peripheral Energy (⬆ Energy) Observed when adding additional threads in a presence of a bottleneck For example: many floating point programs running in parallel in a Niagara processor (many cores with a shared floating point unit) Collisions (⬆ Energy, ⬇ Throughput) Observed when several threads access a shared resources, and the requests are queued In the example above, the service to the requests is slower Destructive Interference (⬆ Energy, ⬇ Throughput) Observed when threads destroy each others’ caches

Resource Aware OS scheduler Main Components: Sampling: Sample the resource usage of the tasks that have run so that the information will be available for the prediction stage Prediction: Predict each task’s resource usage based on the past resource usage Scheduling: Schedule only tasks that the system has enough resources to run (idle cores are OK) Implemented in Linux 3.2.0 Use performance counters for sampling

Memory Bandwidth – An Example Core count is increasing Core frequency does not decrease Pin count is not increasing Chip bandwidth demand is increasing, but Chip bandwidth to memory is not increasing We are approaching the memory bandwidth wall! No real remedies in the near future

Memory Bus Usage

SPEC-CPU2006 on the baseline scheduler Instance Instances Instances Instances

BW hungry program – Initial results Implemented a resource aware scheduler in the Linux 3.2.0 BW hungry program 5.58 sec, 132 Joules When run x4 times sequentially 22.3 sec, 526 Joules When run x4 times in parallel (4 core i5-2500) 27.86 sec (+25%), 1368 Joules (+160%) – over sequential Using the new scheduler with memory bandwidth limitation enforcement 23.71 sec (+6%), 569 Joules (+8%) – over sequential Baseline scheduler Vs Resource Aware Scheduler 17.5% speedup, 58% energy reduction Disclaimers: (a) Initial results; (b) energy sampled using performance counter (MSR_PKG_ENERGY_STATUS) that samples the power used by the package. Consistent results with Wattsup

SPEC-CPU2006 – Initial results Each run included four instances of identical SPEC-CPU2006 benchmarks Average: +3.3% throughput, -3.5% energy Notable results: 429: +106% throughput, -43% energy 473: +3.3% throughput, -13% energy Out of 25 benchmarks 16 consumed less energy (9 consumed more) 10 ran faster (11 slower) Other results Energy efficiency improved on average by 11% 15 benchmarks’ energy efficiency improved by 20% on average 10 benchmarks’ energy efficiency degraded by 3% on average Soft limit for the bandwidth anticipated to improve the results