Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science
Asynchronous programs are ubiquitous Mobile Web Internet-of-Things Servers (node.js) Sensor networks
Asynchronous programming hides I/O latency Sequential model Asynchronous model Task 1 Waiting for I/O Task 2 Task 3 speedup
Asynchronous programming is well-suited to handle wide array of asynchronous inputs Computation is driven by events The Hollywood Principle (“Don’t call us, we’ll call you”)
Illustration: Asynchronous Programming Model Pop an event for execution onClick Event Queue getLocation Web onImageLoad Waits on events Looper Thread
Conventional architecture is not optimized for asynchronous programs Short events execute varied tasks Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Asynchronous model Processor View Event Queue
Large performance improvement potential in asynchronous programs 9.8 8.4 6.3 24 2.3 0.7 4.4 5.5 1.3 Maximum Performance Improvement (%) 52 69 79
Execute asynchronous program on a specialized Event Sneak Peek (ESP) core Gray out other cores always CPU Heterogeneous Multi-core Processor
Asynchronous JavaScript Events Execute asynchronous program on a specialized Event Sneak Peek (ESP) core Asynchronous JavaScript Events Browser Engine Parse Parse CSS CSS Layout Layout Render Render Gray out other cores always Parse CSS CPU ESP Heterogeneous Multi-core Processor Layout Render WebCore Zhu & Reddi, ISCA ‘14
How to customize a core for asynchronous programs?
HTML5 asynchronous programming model guarantees sequential execution of events Event Queue Looper Thread
Opportunity: Event-Level Parallelism (ELP) Advance knowledge of future events Events are functionally independent How to exploit this ELP? Event Queue Previously unexplored kind of parallelism – ELP
#1: Parallel Execution Event Queue Not provably independent
#2: Optimistic Concurrency Speculative parallelization (e.g., transactions) Event Queue >99% of event pairs conflict Primarily, low-level memory dependencies Maintenance code Memory pool recycling … First blue Then second, third Then Blue box Then arrows
Observation 98% of events “match” with a 99% accuracy Speculative pre-execution Event Queue Normal green execution after spec exe Good match 98% of events “match” with a 99% accuracy Control flow paths Addresses
How to customize a core for asynchronous programs How to customize a core for asynchronous programs? Exploit ELP using speculative pre-execution Our solution – exploit ELP through speculative pre-execution
ESP Design: Expose event-queue to hardware Software Event Queue ISA Runtime is aware of future events because of the event-queue Now hardware gets to know H/W Event Queue Hardware
ESP Design: Speculatively pre-execute future events on stalls Memoize H/W Event Queue LLC miss Warm-Up Isolate LLC miss millions of instructions Having exposed event-queue to hardware, we’ll speculatively pre-execute future events lightweight switch to another hardware context In the end, tie Isolate, Memoize, Trigger together. Trigger speedup
Realizing ESP design Isolation Memoization Triggering Correctness Isolate speculative updates Performance Avoid destructive interference between execution contexts
Isolation of multiple execution contexts Register State Memory State Branch Predictor Core Pipeline RRAT Fetch Unit PC PC L1-I cache ESP
Isolation of multiple execution contexts Register State Memory State Branch Predictor Cachelets isolate speculative updates Performance: Avoid L1 pollution Capture 95% of reuse L1-I Cache L1-D Cache I-Cachelet D-Cachelet ESP
Isolation of multiple execution contexts Register State Memory State Branch Predictor PIR tracks path history Isolating PIR is adequate Branch Predictor PIR Predictor Tables PIR ESP
Realizing ESP design Isolation Memoization Triggering Warm-up during speculative pre-execution is ineffective Future events might execute millions of instructions later
Memoization of architectural bottlenecks Addresses Branches Record instruction and data addresses, along with instruction count I-List D-List L1-I Cache L1-D Cache I-Cachelet D-Cachelet ESP
Memoization of architectural bottlenecks Addresses Branches Record branch outcomes Branch address, directions and targets, instruction count Branch Predictor PIR Predictor Tables PIR B-List ESP
Realizing ESP design Isolation Memoization Triggering Use memoized lists Launch timely prefetches Warm-up branch predictor ahead of branches
Triggering timely prefetches using memoized information ESP Instr. Count Address Start Prefetches ~100 instr. Prefetch > Prefetch Current Instr. Count
Baseline Architecture Branch Predictor PIR Predictor Core Pipeline RRAT Fetch Unit PC NL-I NL-D,S L1-I Cache L1-D Cache L2 cache
ESP Architecture Branch Predictor PIR Predictor Core Pipeline RRAT Fetch Unit Event Queue PC ESP Mode NL-I NL-D,S L1-I Cache L1-D Cache L2 cache ESP
ESP Architecture I-Cachelet D-Cachelet Branch Predictor PIR Predictor Core Pipeline PIR RRAT Fetch Unit Event Queue PC PC ESP Mode NL-I NL-D,S L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP
ESP Architecture B-List I-List D-List I-Cachelet D-Cachelet Branch Predictor PIR Predictor Core Pipeline PIR RRAT B-List Fetch Unit Event Queue PC PC ESP Mode I-List D-List NL-I NL-D,S L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP
ESP Architecture Branch Predictor PIR Predictor PIR Core Pipeline RRAT B-List Fetch Unit Event Queue PC PC PC ESP Mode I-List NL-I NL-D,S D-List L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP-1 ESP-2
Methodology Timing: Trace-driven simulator, Sniper Sim Instrumented Chromium Collected and simulated traces of JavaScript events Energy: McPAT and CACTI Event size as part of result after methodology
Architectural Model Core: 4-wide issue, OoO, 1.66 GHz L1-(I,D) Cache: 32 KB, 2-way L2 Cache: 2 MB, 16-way Energy Modeling: Vdd = 1.2 V, 32 nm Remove some details
Limitations of Runahead [Dundas, et. al. ’97, Mutlu, et. al. ‘03] Data cache miss Speculative pre-execution Event Queue Reduces data cache misses Not a significant problem in web applications Cannot mitigate I-cache misses Does not exploit ELP No notion of events Future events are a rich source of independent instructions
Events are short Short events execute varied tasks Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Action # Events # Instructions Event Size (instr) Web App Buy headphones 7,787 433 million 55k amazon 53k bing 91k cnn 232k facebook 372k gdocs 472k gmaps 56k pixlr
ESP outperforms other designs Performance improvement w.r.t. no prefetching (%) ESP 21.8 Runahead 12.5 Baseline 14.0 Baseline : Next-line (NL) + Stride
ESP outperforms other designs Performance improvement w.r.t. no prefetching (%) ESP + NL 32.1 Runahead + NL 21.3 Baseline 14.0 Baseline : Next-line (NL) + Stride
Largest performance improvement comes from improved I-cache performance 52 69 79 21 28 32
ESP consumes less static energy, but expends more dynamic energy Energy consumed w.r.t. no prefetching ESP executes 21% more instructions, but consumes only 8% more energy
Hardware area overhead ESP-1 ESP-2 Cachelets Lists Registers 12.6 KB 1.2 KB
Summary Accelerators for asynchronous programs ESP exploits Event-Level Parallelism (ELP) Expose event queue to hardware Speculatively pre-execute future events Performance: 16%
Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science
Jumping ahead two events is sufficient
Impact of JS execution on response time JavaScript DOM CSS Network Server Chow, et. al., ’14
Client delay Chow, et. al., ’14