The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Lecture 12 Reduce Miss Penalty and Hit Time

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1: Operating Systems Overview

Chapter 17 Parallel Processing.

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Multiprocessor Cache Coherency

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Synchronization and Scheduling in Multiprocessor Operating Systems

Chapter 18 Multicore Computers

LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Power Save Mechanisms for Multi-Hop Wireless Networks Matthew J. Miller and Nitin H. Vaidya University of Illinois at Urbana-Champaign BROADNETS October.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

ELECTIONEL ECTI ON ELECTION: Energy-efficient and Low- latEncy sCheduling Technique for wIreless sensOr Networks Shamim Begum, Shao-Cheng Wang, Bhaskar.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Martin Kruliš by Martin Kruliš (v1.1)1.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Self-Tuned Distributed Multiprocessor System Xiaoyan Bi CSC Operating Systems Dr. Mirela Damian.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

The University of Adelaide, School of Computer Science

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Rebound: Scalable Checkpointing for Coherent Shared Memory

Architecture and Design of AlphaServer GS320

Reactive Synchronization Algorithms for Multiprocessors

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Department of Computer Science University of California, Santa Barbara

Adaptive Single-Chip Multiprocessing

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Chapter 4 Multiprocessors

Main Memory Background

Department of Computer Science University of California, Santa Barbara

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering

The Thrifty Barrier – Li, Martínez, and Huang Motivation  Multiprocessor architectures sprouting everywhere large compute servers small servers, desktops chip multiprocessors  High energy consumption a problem – more so in MPs  Most power-aware techniques tailored at uniprocessors  Multiprocessors present unique challenges processor co-ordination, synchronization

The Thrifty Barrier – Li, Martínez, and Huang Case: Barrier Synchronization  Fast threads spin-wait for slower ones  Spin-wait wasteful by definition quick reaction but only last iteration useful spin-wait compute

The Thrifty Barrier – Li, Martínez, and Huang Proposal: Thrifty Barrier  Reduce spin-wait energy waste in barriers leverage existing processor sleep states (e.g. ACPI)  Minimize impact on execution time achieve timely wake-up conventionalthrifty

The Thrifty Barrier – Li, Martínez, and Huang Challenges  Should sleep? transition times (sleep + wake-up) non-negligible  What sleep state? more energy savings → longer transition times  When to wake up? early w.r.t. barrier release → may hurt energy savings late w.r.t. barrier release → may hurt performance Must predict barrier stall time accurately

The Thrifty Barrier – Li, Martínez, and Huang Findings  Many barrier stall times large enough to leverage sleep states  Stall times predictable discriminate through PC indexing predict indirectly using barrier interval times  Timely wake-up: combination of two mechanisms coherence message bounds wake-up latency watchdog timer anticipates wake-up

The Thrifty Barrier – Li, Martínez, and Huang Thrifty Barrier Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

The Thrifty Barrier – Li, Martínez, and Huang Sleep Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

The Thrifty Barrier – Li, Martínez, and Huang Predicting Stall Time  Splash-2’s FMM example: 3 important barriers, 4 iterations randomly picked thread (always the same)  PC indexing reduces variability  Interval time (BIT) more stable metric than stall time (BST)

The Thrifty Barrier – Li, Martínez, and Huang Stall Time vs. Interval Time  Barriers separate computation phases PC indexing reduces variability  Barrier stall time (BST) varies considerably even with PC indexing barrier-, but also thread-dependent – computation shifts among threads across invocations  Barrier interval time (BIT) varies much less quite stable if PC indexing used barrier-, but not thread-dependent last-value prediction ok for most applications

The Thrifty Barrier – Li, Martínez, and Huang Predicting Stall Time Indirectly  Can use BIT to predict BST indirectly compute time measurable upon arrival to barrier subtract from predicted BIT to derive predicted BST  How to manage time info? BIT BST t Compute t

The Thrifty Barrier – Li, Martínez, and Huang  Threads depart from barrier instance b-1 toward instance b  Each thread t has local record of release timestamp BRTS t,b-1  Assumptions: no global clock local wallclock active even if CPU sleeps – all CPUs same nominal clock frequency Managing Time Info b-1b BRTS t,b-1

The Thrifty Barrier – Li, Martínez, and Huang  Thread t arrives, knowing BRTS t,b-1, Compute t,b make prediction pBIT b derive pBST t,b = pBIT b – Compute t,b use pBST t,b to pick sleep state (if warranted) – best fit based on transition time Managing Time Info b-1b pBIT b pBST t,b Compute t,b BRTS t,b-1

The Thrifty Barrier – Li, Martínez, and Huang  Last thread u arrives, knowing BRTS u,b-1 derive actual BIT b = time( ) – BRTS u,b-1 update (shared) predictor with BIT b release barrier Managing Time Info b-1b BIT b BRTS u,b-1

The Thrifty Barrier – Li, Martínez, and Huang  Every thread t (possibly after waking up late) read BIT b from updated predictor compute actual BRTS t,b = BRTS t,b-1 + BIT b  Threads never use timestamps (BRTS) from other threads no global clock is needed Managing Time Info b-1b BIT b BRTS t,b-1 BRTS t,b *

The Thrifty Barrier – Li, Martínez, and Huang Thrifty Barrier Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

The Thrifty Barrier – Li, Martínez, and Huang Wake-up Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

The Thrifty Barrier – Li, Martínez, and Huang Wake-up Mechanism  Communicate barrier completion to sleeping CPUs signal sent to CPU pin options: external vs. internal wake-up  External (passive): initiated by processor that releases barrier leverage coherence protocol – invalidation to spinlock must supply spinlock address to cache controller  Internal (active): triggered by watchdog timer program with predicted BST before going to sleep

The Thrifty Barrier – Li, Martínez, and Huang Early vs. Late Wake-up  Early wake-up (underprediction) energy waste – residual spin  Late wake-up (overprediction) possible impact on execution time  External wake-up guarantees late wake-up (but bounded)  Internal wake-up can lead to both (late not bounded)  Our approach: hybrid wake-up external provides upper bound internal strives for timely wake-up using prediction

The Thrifty Barrier – Li, Martínez, and Huang Other Considerations (see paper)  Sleep states that do not snoop for coherence requests flush dirty data before sleeping defer invalidations to clean data  Overprediction threshold case of frequent, swinging BITs of modest size turn off prediction if overpredict beyond threshold  Interaction with context switching and I/O underprediction threshold  Time sharing issues: multiprogramming, overthreading

The Thrifty Barrier – Li, Martínez, and Huang Experimental Setup  Simultated system: 64-node CC-NUMA 6-way dynamic superscalar L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk 16B/4clk memory bus, 60ns SDRAM hypercube, wormhole, 4clk pipelined routers – 16clk pin to pin  Energy modeling: Wattch (CPU + L1 + L2) sleep states along lines of Pentium family

The Thrifty Barrier – Li, Martínez, and Huang Experimental Setup  All Splash-2 applications except: Raytrace – no barriers LU – better version w/o barriers widely available  Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10%

The Thrifty Barrier – Li, Martínez, and Huang Energy Savings

The Thrifty Barrier – Li, Martínez, and Huang Performance Impact

The Thrifty Barrier – Li, Martínez, and Huang Related Work Highlights  Quite a bit of work in uniprocessor domain  Elnozahy et al. server farms, clusters – thirfty barrier targets shared memory, parallel apps.  Moshovos et al., Saldanha and Lipasti energy-aware cache coherence – prob. compatible with and complementary to thrifty barrier

The Thrifty Barrier – Li, Martínez, and Huang Conclusions  Energy-aware MP mechanisms can and should be pursued  Case of energy-aware barrier synchronization simple indirect prediction of barrier stall time hybrid wake-up scheme to minimize impact on exec. time  Encouraging results; target applications 17% avg. energy savings 2% avg. performance impact

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering