Accelerating Asynchronous Programs through Event Sneak Peek

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

Computer Organization and Architecture The CPU Structure.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.

EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August,

Data Prefetching Smruti R. Sarangi.

Adaptive Cache Partitioning on a Composite Core

The University of Adelaide, School of Computer Science

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

5.2 Eleven Advanced Optimizations of Cache Performance

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Temporal Streaming of Shared Memory

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Hardware Multithreading

Lecture: Static ILP, Branch Prediction

Address-Value Delta (AVD) Prediction

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: Branch Prediction

Advanced Computer Architecture

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

15-740/ Computer Architecture Lecture 10: Runahead and MLP

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

The Vector-Thread Architecture

Dynamic Hardware Prediction

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science

Asynchronous programs are ubiquitous Mobile Web Internet-of-Things Servers (node.js) Sensor networks

Asynchronous programming hides I/O latency Sequential model Asynchronous model Task 1 Waiting for I/O Task 2 Task 3 speedup

Asynchronous programming is well-suited to handle wide array of asynchronous inputs Computation is driven by events The Hollywood Principle (“Don’t call us, we’ll call you”)

Illustration: Asynchronous Programming Model Pop an event for execution onClick Event Queue getLocation Web onImageLoad Waits on events Looper Thread

Conventional architecture is not optimized for asynchronous programs Short events execute varied tasks Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Asynchronous model Processor View Event Queue

Large performance improvement potential in asynchronous programs 9.8 8.4 6.3 24 2.3 0.7 4.4 5.5 1.3 Maximum Performance Improvement (%) 52 69 79

Execute asynchronous program on a specialized Event Sneak Peek (ESP) core Gray out other cores always CPU Heterogeneous Multi-core Processor

Asynchronous JavaScript Events Execute asynchronous program on a specialized Event Sneak Peek (ESP) core Asynchronous JavaScript Events Browser Engine Parse Parse CSS CSS Layout Layout Render Render Gray out other cores always Parse CSS CPU ESP Heterogeneous Multi-core Processor Layout Render WebCore Zhu & Reddi, ISCA ‘14

How to customize a core for asynchronous programs?

HTML5 asynchronous programming model guarantees sequential execution of events Event Queue Looper Thread

Opportunity: Event-Level Parallelism (ELP) Advance knowledge of future events Events are functionally independent How to exploit this ELP? Event Queue Previously unexplored kind of parallelism – ELP

#1: Parallel Execution Event Queue Not provably independent

#2: Optimistic Concurrency Speculative parallelization (e.g., transactions) Event Queue >99% of event pairs conflict Primarily, low-level memory dependencies Maintenance code Memory pool recycling … First blue Then second, third Then Blue box Then arrows

Observation 98% of events “match” with a 99% accuracy Speculative pre-execution Event Queue Normal green execution after spec exe Good match 98% of events “match” with a 99% accuracy Control flow paths Addresses

How to customize a core for asynchronous programs How to customize a core for asynchronous programs? Exploit ELP using speculative pre-execution Our solution – exploit ELP through speculative pre-execution

ESP Design: Expose event-queue to hardware Software Event Queue ISA Runtime is aware of future events because of the event-queue Now hardware gets to know H/W Event Queue Hardware

ESP Design: Speculatively pre-execute future events on stalls Memoize H/W Event Queue LLC miss Warm-Up Isolate LLC miss millions of instructions Having exposed event-queue to hardware, we’ll speculatively pre-execute future events lightweight switch to another hardware context In the end, tie Isolate, Memoize, Trigger together. Trigger speedup

Realizing ESP design Isolation Memoization Triggering Correctness Isolate speculative updates Performance Avoid destructive interference between execution contexts

Isolation of multiple execution contexts Register State Memory State Branch Predictor Core Pipeline RRAT Fetch Unit PC PC L1-I cache ESP

Isolation of multiple execution contexts Register State Memory State Branch Predictor Cachelets isolate speculative updates Performance: Avoid L1 pollution Capture 95% of reuse L1-I Cache L1-D Cache I-Cachelet D-Cachelet ESP

Isolation of multiple execution contexts Register State Memory State Branch Predictor PIR tracks path history Isolating PIR is adequate Branch Predictor PIR Predictor Tables PIR ESP

Realizing ESP design Isolation Memoization Triggering Warm-up during speculative pre-execution is ineffective Future events might execute millions of instructions later

Memoization of architectural bottlenecks Addresses Branches Record instruction and data addresses, along with instruction count I-List D-List L1-I Cache L1-D Cache I-Cachelet D-Cachelet ESP

Memoization of architectural bottlenecks Addresses Branches Record branch outcomes Branch address, directions and targets, instruction count Branch Predictor PIR Predictor Tables PIR B-List ESP

Realizing ESP design Isolation Memoization Triggering Use memoized lists Launch timely prefetches Warm-up branch predictor ahead of branches

Triggering timely prefetches using memoized information ESP Instr. Count Address Start Prefetches ~100 instr. Prefetch > Prefetch Current Instr. Count

Baseline Architecture Branch Predictor PIR Predictor Core Pipeline RRAT Fetch Unit PC NL-I NL-D,S L1-I Cache L1-D Cache L2 cache

ESP Architecture Branch Predictor PIR Predictor Core Pipeline RRAT Fetch Unit Event Queue PC ESP Mode NL-I NL-D,S L1-I Cache L1-D Cache L2 cache ESP

ESP Architecture I-Cachelet D-Cachelet Branch Predictor PIR Predictor Core Pipeline PIR RRAT Fetch Unit Event Queue PC PC ESP Mode NL-I NL-D,S L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP

ESP Architecture B-List I-List D-List I-Cachelet D-Cachelet Branch Predictor PIR Predictor Core Pipeline PIR RRAT B-List Fetch Unit Event Queue PC PC ESP Mode I-List D-List NL-I NL-D,S L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP

ESP Architecture Branch Predictor PIR Predictor PIR Core Pipeline RRAT B-List Fetch Unit Event Queue PC PC PC ESP Mode I-List NL-I NL-D,S D-List L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP-1 ESP-2

Methodology Timing: Trace-driven simulator, Sniper Sim Instrumented Chromium Collected and simulated traces of JavaScript events Energy: McPAT and CACTI Event size as part of result after methodology

Architectural Model Core: 4-wide issue, OoO, 1.66 GHz L1-(I,D) Cache: 32 KB, 2-way L2 Cache: 2 MB, 16-way Energy Modeling: Vdd = 1.2 V, 32 nm Remove some details

Limitations of Runahead [Dundas, et. al. ’97, Mutlu, et. al. ‘03] Data cache miss Speculative pre-execution Event Queue Reduces data cache misses Not a significant problem in web applications Cannot mitigate I-cache misses Does not exploit ELP No notion of events Future events are a rich source of independent instructions

Events are short Short events execute varied tasks Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Action # Events # Instructions Event Size (instr) Web App Buy headphones 7,787 433 million 55k amazon 53k bing 91k cnn 232k facebook 372k gdocs 472k gmaps 56k pixlr

ESP outperforms other designs Performance improvement w.r.t. no prefetching (%) ESP 21.8 Runahead 12.5 Baseline 14.0 Baseline : Next-line (NL) + Stride

ESP outperforms other designs Performance improvement w.r.t. no prefetching (%) ESP + NL 32.1 Runahead + NL 21.3 Baseline 14.0 Baseline : Next-line (NL) + Stride

Largest performance improvement comes from improved I-cache performance 52 69 79 21 28 32

ESP consumes less static energy, but expends more dynamic energy Energy consumed w.r.t. no prefetching ESP executes 21% more instructions, but consumes only 8% more energy

Hardware area overhead ESP-1 ESP-2 Cachelets Lists Registers 12.6 KB 1.2 KB

Summary Accelerators for asynchronous programs ESP exploits Event-Level Parallelism (ELP) Expose event queue to hardware Speculatively pre-execute future events Performance: 16%

Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science

Jumping ahead two events is sufficient

Impact of JS execution on response time JavaScript DOM CSS Network Server Chow, et. al., ’14

Client delay Chow, et. al., ’14