Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerating Asynchronous Programs through Event Sneak Peek

Similar presentations


Presentation on theme: "Accelerating Asynchronous Programs through Event Sneak Peek"— Presentation transcript:

1 Accelerating Asynchronous Programs through Event Sneak Peek
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science

2 Asynchronous programs are ubiquitous
Mobile Web Internet-of-Things Servers (node.js) Sensor networks

3 Asynchronous programming hides I/O latency
Sequential model Asynchronous model Task 1 Waiting for I/O Task 2 Task 3 speedup

4 Asynchronous programming is well-suited to handle wide array of asynchronous inputs
Computation is driven by events The Hollywood Principle (“Don’t call us, we’ll call you”)

5 Illustration: Asynchronous Programming Model
Pop an event for execution onClick Event Queue getLocation Web onImageLoad Waits on events Looper Thread

6 Conventional architecture is not optimized for asynchronous programs
Short events execute varied tasks Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Asynchronous model Processor View Event Queue

7 Large performance improvement potential in asynchronous programs
9.8 8.4 6.3 24 2.3 0.7 4.4 5.5 1.3 Maximum Performance Improvement (%) 52 69 79

8 Execute asynchronous program on a specialized Event Sneak Peek (ESP) core
Gray out other cores always CPU Heterogeneous Multi-core Processor

9 Asynchronous JavaScript Events
Execute asynchronous program on a specialized Event Sneak Peek (ESP) core Asynchronous JavaScript Events Browser Engine Parse Parse CSS CSS Layout Layout Render Render Gray out other cores always Parse CSS CPU ESP Heterogeneous Multi-core Processor Layout Render WebCore Zhu & Reddi, ISCA ‘14

10 How to customize a core for asynchronous programs?

11 HTML5 asynchronous programming model guarantees sequential execution of events
Event Queue Looper Thread

12 Opportunity: Event-Level Parallelism (ELP)
Advance knowledge of future events Events are functionally independent How to exploit this ELP? Event Queue Previously unexplored kind of parallelism – ELP

13 #1: Parallel Execution Event Queue Not provably independent

14 #2: Optimistic Concurrency
Speculative parallelization (e.g., transactions) Event Queue >99% of event pairs conflict Primarily, low-level memory dependencies Maintenance code Memory pool recycling First blue Then second, third Then Blue box Then arrows

15 Observation 98% of events “match” with a 99% accuracy
Speculative pre-execution Event Queue Normal green execution after spec exe Good match 98% of events “match” with a 99% accuracy Control flow paths Addresses

16 How to customize a core for asynchronous programs
How to customize a core for asynchronous programs? Exploit ELP using speculative pre-execution Our solution – exploit ELP through speculative pre-execution

17 ESP Design: Expose event-queue to hardware
Software Event Queue ISA Runtime is aware of future events because of the event-queue Now hardware gets to know H/W Event Queue Hardware

18 ESP Design: Speculatively pre-execute future events on stalls
Memoize H/W Event Queue LLC miss Warm-Up Isolate LLC miss millions of instructions Having exposed event-queue to hardware, we’ll speculatively pre-execute future events lightweight switch to another hardware context In the end, tie Isolate, Memoize, Trigger together. Trigger speedup

19 Realizing ESP design Isolation Memoization Triggering Correctness
Isolate speculative updates Performance Avoid destructive interference between execution contexts

20 Isolation of multiple execution contexts
Register State Memory State Branch Predictor Core Pipeline RRAT Fetch Unit PC PC L1-I cache ESP

21 Isolation of multiple execution contexts
Register State Memory State Branch Predictor Cachelets isolate speculative updates Performance: Avoid L1 pollution Capture 95% of reuse L1-I Cache L1-D Cache I-Cachelet D-Cachelet ESP

22 Isolation of multiple execution contexts
Register State Memory State Branch Predictor PIR tracks path history Isolating PIR is adequate Branch Predictor PIR Predictor Tables PIR ESP

23 Realizing ESP design Isolation Memoization Triggering
Warm-up during speculative pre-execution is ineffective Future events might execute millions of instructions later

24 Memoization of architectural bottlenecks
Addresses Branches Record instruction and data addresses, along with instruction count I-List D-List L1-I Cache L1-D Cache I-Cachelet D-Cachelet ESP

25 Memoization of architectural bottlenecks
Addresses Branches Record branch outcomes Branch address, directions and targets, instruction count Branch Predictor PIR Predictor Tables PIR B-List ESP

26 Realizing ESP design Isolation Memoization Triggering
Use memoized lists Launch timely prefetches Warm-up branch predictor ahead of branches

27 Triggering timely prefetches using memoized information
ESP Instr. Count Address Start Prefetches ~100 instr. Prefetch > Prefetch Current Instr. Count

28 Baseline Architecture
Branch Predictor PIR Predictor Core Pipeline RRAT Fetch Unit PC NL-I NL-D,S L1-I Cache L1-D Cache L2 cache

29 ESP Architecture Branch Predictor PIR Predictor Core Pipeline RRAT
Fetch Unit Event Queue PC ESP Mode NL-I NL-D,S L1-I Cache L1-D Cache L2 cache ESP

30 ESP Architecture I-Cachelet D-Cachelet Branch Predictor PIR Predictor
Core Pipeline PIR RRAT Fetch Unit Event Queue PC PC ESP Mode NL-I NL-D,S L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP

31 ESP Architecture B-List I-List D-List I-Cachelet D-Cachelet
Branch Predictor PIR Predictor Core Pipeline PIR RRAT B-List Fetch Unit Event Queue PC PC ESP Mode I-List D-List NL-I NL-D,S L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP

32 ESP Architecture Branch Predictor PIR Predictor PIR Core Pipeline RRAT
B-List Fetch Unit Event Queue PC PC PC ESP Mode I-List NL-I NL-D,S D-List L1-I Cache L1-D Cache I-Cachelet D-Cachelet L2 cache ESP-1 ESP-2

33 Methodology Timing: Trace-driven simulator, Sniper Sim
Instrumented Chromium Collected and simulated traces of JavaScript events Energy: McPAT and CACTI Event size as part of result after methodology

34 Architectural Model Core: 4-wide issue, OoO, 1.66 GHz
L1-(I,D) Cache: 32 KB, 2-way L2 Cache: 2 MB, 16-way Energy Modeling: Vdd = 1.2 V, 32 nm Remove some details

35 Limitations of Runahead
[Dundas, et. al. ’97, Mutlu, et. al. ‘03] Data cache miss Speculative pre-execution Event Queue Reduces data cache misses Not a significant problem in web applications Cannot mitigate I-cache misses Does not exploit ELP No notion of events Future events are a rich source of independent instructions

36 Events are short Short events execute varied tasks
Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Action # Events # Instructions Event Size (instr) Web App Buy headphones 7,787 433 million 55k amazon 53k bing 91k cnn 232k facebook 372k gdocs 472k gmaps 56k pixlr

37 ESP outperforms other designs
Performance improvement w.r.t. no prefetching (%) ESP 21.8 Runahead 12.5 Baseline 14.0 Baseline : Next-line (NL) + Stride

38 ESP outperforms other designs
Performance improvement w.r.t. no prefetching (%) ESP + NL 32.1 Runahead + NL 21.3 Baseline 14.0 Baseline : Next-line (NL) + Stride

39 Largest performance improvement comes from improved I-cache performance
52 69 79 21 28 32

40 ESP consumes less static energy, but expends more dynamic energy
Energy consumed w.r.t. no prefetching ESP executes 21% more instructions, but consumes only 8% more energy

41 Hardware area overhead
ESP-1 ESP-2 Cachelets Lists Registers 12.6 KB 1.2 KB

42 Summary Accelerators for asynchronous programs
ESP exploits Event-Level Parallelism (ELP) Expose event queue to hardware Speculatively pre-execute future events Performance: 16%

43 Accelerating Asynchronous Programs through Event Sneak Peek
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science

44 Jumping ahead two events is sufficient

45 Impact of JS execution on response time
JavaScript DOM CSS Network Server Chow, et. al., ’14

46 Client delay Chow, et. al., ’14


Download ppt "Accelerating Asynchronous Programs through Event Sneak Peek"

Similar presentations


Ads by Google