Download presentation
Presentation is loading. Please wait.
Published byBruno Noyes Modified over 10 years ago
1
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Proje cts/VESPA University of Manchester http://intranet.cs.man.ac.uk/apt/proje cts/iTLS
2
Intl. Symp. on Workload Characterization - December 20102 Introduction Thermal/power constraints, complexity and time-to- market reasons lead to CMPs Many simple cores = high TLP but low ILP –Ok for throughput computing and embarrassingly parallel applications Problem: –No benefits for sequential applications –Parallel applications with large sequential parts are still limited by Amdahl => Thread Level Speculation (TLS)
3
Intl. Symp. on Workload Characterization - December 20103 Modivation Shortcoming of prior work in assessing TLS performance potential –Evaluations often tied to particular TLS architectural configuration –Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features –Workload choice often limited to one particular domain or programming style
4
Intl. Symp. on Workload Characterization - December 20104 Contributions In-depth implementation-independent study of TLS performance potential Evaluate TLS architectural features Evaluate workloads from a variety of domains Investigate load imbalance and coverage within the context of TLS
5
Intl. Symp. on Workload Characterization - December 20105 Outline Introduction Background Methodology Results Conclusions
6
Intl. Symp. on Workload Characterization - December 20106 Thread Level Speculation Compiler deals with: –Task selection –Code generation HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit Thread 1 Thread 2 Speculative Time
7
Intl. Symp. on Workload Characterization - December 20107 Architectural Extensions Multiversioned caches Support for out-of-order spawning Dynamic dependence synchronization Intermediate checkpointing Data value prediction
8
Intl. Symp. on Workload Characterization - December 20108 Outline Introduction Background Methodology Results Conclusions
9
Intl. Symp. on Workload Characterization - December 20109 Methodology Benchmarks –Imperative: SPEC CPU 2006 Mediabench II Instrumentation –GCC4 pass Annotate loop iterations and method bodies Mark induction, reduction variables and use of return values Operate after the intermediate optimizations –Object oriented: SPEC JVM 98 DaCapo –Jikes RVM modification
10
Intl. Symp. on Workload Characterization - December 201010 Methodology Trace Generation –Simics, full-system functional simulator –Non-intrusive trace of memory accesses Trace-Driven Simulation –In-house Simulator-tool Extracts threads out of loop iterations and/or method call cont. Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction
11
Intl. Symp. on Workload Characterization - December 201011 Methodology Task Selection –In-order loop-level speculation Innermost loops Best loops out of three dynamic depth levels –In-order method and Out-of-Order speculation Dynamic thread spawning policy favoring safer threads Maximum thread size heuristic –All loops and/or methods are candidates
12
Intl. Symp. on Workload Characterization - December 201012 Outline Introduction Background Methodology Results Conclusions
13
Intl. Symp. on Workload Characterization - December 201013 Loop-level speculation - Innermost Iter. 1 Iter. 2 Speculative Iter. n … for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { inner_loop_body1 for(k=0;k<n;k++) { spawn_thread(); innermost_loop_body } inner_loop_body2 } outer_loop_body1 }
14
Intl. Symp. on Workload Characterization - December 201014 Loop-level speculation - Innermost
15
Intl. Symp. on Workload Characterization - December 201015 Iter. 1 Iter. 2 Speculative Iter. n for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { spawn_thread(); inner_loop_body1 for(k=0;k<n;k++) { innermost_loop_body } inner_loop_body2 } outer_loop_body1 } … Loop-level speculation – Best loop depth
16
Intl. Symp. on Workload Characterization - December 201016 Loop-level speculation – Best loop depth
17
17 Method-level speculation - In-Order method Cont. Speculative pid = spawn_thread(); If(pid !=0) method(); method _Cont.
18
Intl. Symp. on Workload Characterization - December 201018 Method-level speculation - In-Order
19
19 Method-level speculation - OoO method1 method2 Cont. Speculative pid = spawn_thread(); If(pid !=0) method1(); method1 _Cont. method1() { method1_body1 pid = spawn_thread(); If(pid !=0) method1(); method2_cont } method1 Cont. Time
20
Intl. Symp. on Workload Characterization - December 201020 Method-level speculation - OoO
21
Intl. Symp. on Workload Characterization - December 201021 Mixed speculation - In-Order
22
Intl. Symp. on Workload Characterization - December 201022 Mixed speculation - OoO
23
Intl. Symp. on Workload Characterization - December 201023 Load Imbalance and Coverage
24
Intl. Symp. on Workload Characterization - December 201024 Results – Multi-versioning to the rescue?
25
Intl. Symp. on Workload Characterization - December 201025 Outline Introduction Background Methodology Results Conclusions
26
Intl. Symp. on Workload Characterization - December 201026 Conclusions Load imbalance and limited coverage important factors in realizing TLS performance Support for OoO spawning not providing significant benefits for the task policy employed Multi-versioned caches unlock performance in some cases but not panacea Task selection critical
27
Intl. Symp. on Workload Characterization - December 201027 Also in the paper In-depth analysis of high coverage loops for selected benchmarks Comparison of TLS loop-level speculation with a state- of-the-art auto-parallelizing compiler OoO Loop-level speculation Outline most of the proposed architectural and compiler extensions for TLS systems
28
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Proje cts/VESPA University of Manchester http://intranet.cs.man.ac.uk/apt/proje cts/iTLS
29
Intl. Symp. on Workload Characterization - December 201029 Backup slides – Auto parallelizing compiler comparison
30
Intl. Symp. on Workload Characterization - December 201030 Backup slides – OoO loop
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.