EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August,

Slides:

Advertisements

Similar presentations

High Performing Cache Hierarchies for Server Workloads

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Memory System Characterization of Big Data Workloads

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

University of Michigan Electrical Engineering and Computer Science Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

Accelerating Asynchronous Programs through Event Sneak Peek

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Computer performance.

Web Design Scripting and the Web. Books on Scripting.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.

Server-side Scripting Powering the webs favourite services.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

CNIT 133 Interactive Web Pags – JavaScript and AJAX JavaScript Environment.

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

COMP25212: System Architecture Lecturers Alasdair Rawsthorne Daniel Goodman

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Sampling Dead Block Prediction for Last-Level Caches

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

IMP: Indirect Memory Prefetcher

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.

Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation Mechanism ECE 751, Fall 2015 Peng Liu 1.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.

Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.

JavaScript Invented 1995 Steve, Tony & Sharon. A Scripting Language (A scripting language is a lightweight programming language that supports the writing.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Reza Yazdani Albert Segura José-María Arnau Antonio González

WWW and HTTP King Fahd University of Petroleum & Minerals

Lecture: Large Caches, Virtual Memory

5.2 Eleven Advanced Optimizations of Cache Performance

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

RegLess: Just-in-Time Operand Staging for GPUs

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Yingmin Li Ting Yan Qi Zhao

Browser Engine How it works…..

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August, 2014 University of Michigan Electrical Engineering and Computer Science 1

Evolution of the Web 2 Web 1.0Web 2.0 server client published content user generated content published content user generated content Static Web Pages Passively view content Dynamic Web Pages Collaborate and generate content

Evolution of Web 3 Web 1.0Web 2.0 server client published content user generated content published content user generated content Rich user experience compute

Evolution of the Web 4 Web 1.0Web 2.0 yahoo.com in 1996 yahoo.com in x more instructions executed Good client-side performance Rich User ExperienceBrowser responsiveness

Core Specialization 5 Private Caches Core 1 Core 2 Core 3 Core 4 Private Caches Core 1 Core 2 Core 3 Core 4 Multi-core processor

Web Core 6 Private Caches Core 1 Core 2 Core 3 Core 4 Private Caches Core 1 Core 2 Core 3 Core 4 WebBoost Web Core Multi-core processor

WebBoost Script performance: High L1-I cache misses Goal: Specialized instruction prefetcher for web client-side script Goal: Specialized instruction prefetcher for web client-side script Othe r Web client-side script performance Browser responsiveness Web browser computational components Web 1.0Web 2.0

Poor I-Cache Performance Web pages tend to support numerous functionalities – Large instruction footprint – Lack hot code 8 graphics effects image editing online forms document editing web personalization games audio & video Web client-side script inefficiencies : code bloat – JIT compiled by JS engine – Dynamic typing V8 IonMonkey Nitro Chakra

Lack of Hot Code 9 95% 86020,400

Poor I-Cache Performance Compared to conventional programs, JS code incurs many more L1-I misses Perfect I-Cache: 53% speedup 10

Problem Statement Problem: Poor web client-side script I-Cache performance Opportunity: Web client-side scripts are executed in an event-driven model Solution: – Specialized prefetcher that is customized for event-driven execution model – Identifies distinct events in the instruction stream 11

Outline 12 Event-driven Web Applications EFetch Facets of Instruction Prefetching Design and Architecture Methodology Results Conclusion

Web Browser Events 13 External Input Event Mouse Click On Load Internal Browser Event

Event-driven Web Applications 14 Renderer Thread Event Queue Popping an event for execution Events inserted in to the queue Events generate other events Executes on JS Engine Event Queue empty - Program waits Mouse Click Keyboard key press GPS events External Input Events Internal Events Timer event DOMContentLoaded E2E2 E3E3 E1E1 Hea d Poor I-Cache performance Different events tend to execute different code Events typically execute for a very short duration Poor I-Cache performance Different events tend to execute different code Events typically execute for a very short duration

EFetch 15 Renderer Thread E2E2 E3E3 E1E1 Event Fetch - Instruction Prefetcher for event-driven web applications Technique: – Uses an event ID to identify distinct events in the instruction stream – Event ID is augmented to create an event signature that predicts control flow well Event ID

Event Signature 16 Renderer Thread E2E2 E3E3 E1E1 Event Type Event Handler Event ID Formed by the browser Uniquely identifies an event Function Call Context Event Signature Formed in the hardware from context depth (3) ancestor functions in the Call Stack Correlates well the program control flow

Instruction Prefetcher: Facets 17 What to prefetch? When to prefetch? Instruction Prefetcher

What to Prefetch? Naïve solution: On a function call, prefetch the function body – But, this is too late Our approach: On a function call, predict its callees and prefetch their function body addresses 18 event ID Event Signature c 1 : c 2 : c 3 : c i - callee

Duplication of Addresses 19 f h g event A function can appear in two distinct event signatures Its body addresses might be duplicated event f h callee I-Cache addresses event g h

Compacting I-Cache Addresses 20 event f h g h I-Cache Addresses f h g ( 1, 1, 1, 0 ) f h g ( 1, 0, 1, 1 ) callee bit vector

Recording Callees and Function Bodies 21 c1c1 event signature Context Table Function Table callee bit vector c2c2 c2c2

Instruction Prefetcher: Facets 22 What to prefetch? When to prefetch? Instruction Prefetcher

When to Prefetch? When?: Important to prefetch sufficiently in advance, but not too early Goal: Prefetch the next predicted function – Able to hide LLC hit latency – Typically sufficient due to low instruction miss rate in LLC Our Design: Keep track of a speculative call stack – Predictor Stack 23

Predictor Stack Maintains the call stack as predicted by the prefetcher Helps prefetch the next function predicted to be called 24 f hi Predictor Stack f Function Prefetched h i h call Call Stack f hi call return i call return

Architecture 25 Call Stack Functio n Call Context Event-ID X Event Signature ci ci Context Table bv Function Table b 1 b 2 d EA Predicted callees, addresses Predictor Stack Prefetch Queue

Methodology Instrumented open source browser – Chromium – It uses the V8 JS engine shared with Google Chrome Browsing sessions of popular websites were studied – Their instruction traces were simulated with Sniper Sim Our focus was on JS code execution, which was simulated 26

Architectural Details 27 Modeled after Samsung Exynos 5250 Core: 4-wide OoO, 1.66 GHz L1-(I,D) Cache: 32 KB, 2-way L2 Cache: 2 MB, 16-way Energy Modeling: V dd = 1.2 V, 45 nm

Related Work We compare EFetch with the following designs: – L1I-64KB: Hardware overhead of EFetch provisioned towards extra L1-I cache capacity – 64 KB – N2L: Next-2 line prefetcher – CGP: Call Graph Prefetching – PIF: Proactive Instruction Fetch – RDIP: Return address stack Directed Instruction Prefetching 28 Annavaram, et. al. HPCA ‘01 Ferdman, et. al. MICRO ‘11 Kolli, et. al. MICRO ‘13

Prefetcher Efficacy 29

Performance 30

Energy Consumption 31 DesignCGPPIFRDIPEFetch Overhead (KB) Prefetching hardware structures consume little energy – Ranging from 0.01% of the total energy consumed for EFetch to 1.06% for PIF Erroneous prefetches consume significant fraction of energy Prefetching hardware structures consume little energy – Ranging from 0.01% of the total energy consumed for EFetch to 1.06% for PIF Erroneous prefetches consume significant fraction of energy

Energy, Performance, Area 32 EFetch PIF CGP RDIP N2L Performance Energy

Conclusion Web 2.0 places greater demands on client-side computing I-Cache performance is poor for web client-side script execution EFetch exploits the event-driven nature of web client-side script execution It achieves 29% performance improvement over no prefetching 33

EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August, 2014 University of Michigan Electrical Engineering and Computer Science 34

Performance Potential 35 Perfect I-Cache: 53% speedup

Web Core 36 Core Equipped with simple microarchitectural enhancements accelerating the browser like MMX extensions for multimedia Web Core

I-Cache addresses A BC WXYZ ADEC Duplication of Addresses 37 g1g1 h1h1 g2g2 Event e 1 event sign 1 : e 1 g 1 event sign 2 : e 1 g 2 A function can appear in two distinct event signatures Its body addresses might be duplicated

What to Prefetch? 38 event signature prefetch addresses Naïve solution: Keep track of all I-Cache blocks accessed for each event signature I-Cache addresses ABC WXYZ

Duplication of Addresses 39 A, B, C, D, E, L, M, N, O, W, X, Y, Z c1c1 A, B, C, D, E c2c2 L, M, N, O c3c3 W, X, Y, Z callees callee body addresses addresses aggregate over different contexts loses context information addresses aggregate over different contexts loses context information A function can appear in two distinct event signatures Its body addresses might be duplicated

Preserving Context Information 40 base address (b) bit vector (bv) I-Cache Block Addresses c1c1 A, B, C, D, E c2c2 L, M, N, O c3c3 W, X, Y, Z b 11 bv 11 b 12 bv 12 b 21 bv 21 b 22 bv 22 b 31 bv 31 b 32 bv 32 Constant for a callee for all event signatures Specific to an event signature and are stored together

Recording Callees and Function Bodies 41 c1c1 c2c2 c3c3 event signature Context Table Function Table callee address bv 11 bv 12 bv 21 bv 22 bv 31 bv 32 b1b1 b1b1 b2b2 b2b2

Predictor Stack Speculative call stack – Maintains the call stack as predicted by the prefetcher – Helps prefetch one function ahead of program execution Synchronized with the Call Stack after every function call and return 42 Predictor Stack Predicted callees, addresses Prefetch addresses

Evolution of Web 43 Web 1.0Web 2.0 server client published content user generated content published content user generated content static web pages dynamic web pages passively view content collaborate and generate content

Evolution of Web 44 Web 1.0Web 2.0 compute server client compute published content user generated content published content user generated content rich user experience client side performance matters