Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.

Similar presentations


Presentation on theme: "1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members."— Presentation transcript:

1 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005

2 2 Chip Multi-Processor (CMP) is now everywhere From all major companies: IBM:  Power 4  Power 5 … Intel:  Montecito  Smithfield … AMD:  Dual-core Opteron Sun:  MAJC Sony, Toshiba, IBM:  Cell … … Power 4 Dual-core Intel chip CellDual-core Opteron Abundant Chip Multiprocessors

3 3 Improving Throughput with a Chip Multi-Processor C C P C P C P C P C C P Multiprogramming Workload: Execution Time improve throughput Processor Caches Applications

4 4 Improving Single Application Performance with a Chip Multi-Processor C C P C P C P C P C C P Single Application: need parallel threads to reduce execution time C C P C P C P C P  Exec. Time

5 5 Using Chip Multi-Processor for improvements Improve throughput for multi-programming workload  Easy  CMP behaves like a normal MP Improve single-application performance  Hard  Control and Data Dependence  Proposed approach: Thread-Level Speculation (TLS) CMP trade-offs

6 6 Thread-Level Speculation (TLS) Enable compiler to create parallel threads despite the existence of ambiguous data dependence Optimistically parallelize at compile time Detect violations and recover at runtime Compile Time Parallelize without dependency detection Run Time Detect Violation Commit Modification Squash And Re-execute No Yes Optimistic at compile time, detect and recover at runtime

7 7 Example of Thread-Level Speculation Code to parallelize Un-parallelizable through paralleling compilers Uncertain dependence between *p and *q Might be runtime or user-input dependent for ( …){ … *p = …; … … = … … *q; … } Break loop iterations into threads, explore uncertainty in each thread

8 8 How Thread-Level Speculation works  exploit available thread-level parallelism Exec. Time TLS …  *q *p  … Recover …  *q  violation

9 9 Thread-Level Speculation quick summary Benefits  Reduce inter-thread communication time among cores  Scale  New parallel programming model Types of implementations  Hardware only  Combined with hardware and software  Software only Thread-Level Speculation is good for Chip Multi-Processor

10 10 Thread-Level Speculation Implementation Diagram Thread-Level Speculation HW-only approach SW-only approach Our approach Overall picture of Thread-Level Speculation

11 11 Thread-Level Speculation Implementation Comparison Hardware-only approach  Lots of research  Good speed up through simulation  Nobody builds it yet cost, risky, need both HW + SW at the same time  Outcome HW-only TLS looks promising Significant hardware changes Software-only approach: limited work, limited progress  Major problem: high overhead Buffer memory for speculative states Track each memory read + write: violation detection Recover from failed speculation: re-execution Quick summary on HW-only and SW-only approaches

12 12 Outline for the rest of the talk Hardware TLS schemes Software TLS schemes Our scheme  Our goals  Starting point  Potential applications Conclusion

13 13 Hardware-only Thread-Level Speculation Thread-Level Speculation HW-only approach SW-only approach Our approach Overall picture of HW-only TLS approach

14 14 Hardware Thread-Level Speculation Schemes Lots of hardware TLS research  CMU Stampede  Stanford Hydra  Wisconsin Multiscalar  UIUC IA-COMA  UMN Super-threaded architecture  … Convergence of hardware schemes  Use cache to buffer speculative state  Extend cache coherence protocol to track data dependence Convergence of HW-only Thread-Level Speculation

15 15 Hardware TLS Schemes: quick summary Result  TLS is promising  SPEC int improvement: 30% - 100% Depends on aggressiveness of the hardware support C ( non-speculative ) C P C P C P C P Sp-state CMP with hardware speculative buffer and enhanced cache consistence protocol Convergence of HW-only Thread-Level Speculation

16 16 Software-only Thread-Level Speculation Thread-Level Speculation HW-only approach SW-only approach Our approach Overall picture of SW-only TLS approach

17 17 Software-only Thread-Level Speculation Schemes  LRPD Test: UIUC  VM for dependence tracking: Spiros’s, CMU  Cintra’s SW TLS: U Edinburgh Problem of software-only approach: high overhead Try to reduce it overview of SW-only TLS approach

18 18 LRPD Test (UIUC) + implemented entirely in software – applies only to array-based code – no partial parallelism entire loop will re-execute sequentially if there is any dependence software dependence tracking was parallel execution safe? Exec. Time Pros + Cons of LRPD

19 19 Dependence tracking using Virtual Memory Exec. Time Software dependence tracking through VM pages Virtual Memory Synchronize: transfer VM pages ? Pros + Cons of VM Tracking

20 20 CMU Spiros ’ s approach -- Dependence tracking using Virtual Memory  Coarse-grain, software-only  Based on memory tracking virtual memory page protection mechanism use software DSM (TreadMarks) Synchronization through VM pages through cost analysis  Overhead is prohibitive 2 sec (seq) / 5 min (par) Not a viable approach on this level of coarse granularity SW-TLS through VM Tracking is not attractive

21 21 Cintra ’ s SW TLS: Memory tracking tuned for performance Exec. Time Efficient tracking for array references Efficient but custom-made for array only

22 22 Pros + Cons:  + advanced implementation of LRPD test  + implement entirely in software  + cover partial parallelism  – hand-crafted code for performance  – apply only to array-based code Cintra ’ s software-only Thread-Level Speculation: quick summary Features  Software simulation for extended cache coherence protocol Provide speculative state transition table  Violation detection through speculate state comparison  Instrument on each load and store Summary of Cintra’s work

23 23 Problems with Software Thread-Level Speculation High overhead Buffer speculative state Track data dependence for all memory reference Re-execute in case of failed speculation Potential speedup  largely unexplored Possible directions for future research  Reduce overhead  Achieve speedup from TLS parallelism Summary of Software TLS

24 24 Our current Thread-Level Speculation approach Thread-Level Speculation HW-only approach SW-only approach Our approach Overall position for our SW TLS approach

25 25 Long term future plan Goals  Target Chip Multi-Processors Tightly-coupled MPs  Apply to general-purpose code: not only arrays  Minimize overhead Capitalize on compiler analysis and optimizations  Idempotency analysis  Synchronization and communications  PPA: Probabilistic pointer analysis Framework (Jeff’s work)  Minimal backup and buffer retrieval analysis  … more analysis we will invent SW-only approach: room to improve Starting point: highly efficient software checkpointing Goals and Plans

26 26 Starting point: efficient software checkpointing  Some program points in source code  Buffer state change between current execution point and its latest check point  Execution can always efficiently rewind to its latest checkpointing program executio n   Buffer memory changes Buffer more memory changes Software checkpointing Introduce software checkpointing

27 27 Potential use of Software checkpointing Software Rollback  automatic software TLS support  foundation of future automatic TLS parallelization Debug  controlled rewind Enhance application reliability Speculative optimizations in uni-processor program  larger window size  deep branch speculation  speculative code motion what can software checkpointing do

28 28 Compiler analysis  Local: Basic Block level Backup only needed memory writes Optimize to minimize  number of backup  Number of buffer retrieval  Global: procedural level Populate buffers through control-flow graph Iterate until buffer stabilizes  Inter-procedural level Potential approaches for software backup  Undo backup  Todo backup Software checkpointing schemes build software checkpointing

29 29 Undo backup  Compile-time analysis  Backup once per distinct memory write per Basic Block  Program continue to operate on non-backup memory  Action upon execution completion Commit: trash buffer Rollback: restore from buffer undo backup properties

30 30 Undo backup example … a = 10; b = 12; … c = a + b; … (&a, [a]) (&b, [b]) (&c, [c]) Program, Basic Block levelUndo backup memory Undo backup action conflicts check Next Basic Block … trash undo memory N restore undo memory Y undo backup process

31 31 Todo backup  Perform at runtime  Happen on each single memory write inside Basic Block  Each following read might need to retrieve from buffer  Action upon completion (reverse of Undo type) Commit: write-back from buffer Rollback: trash buffer todo backup properties

32 32 Todo backup example … *p = a; *q = b; … … *p + *q; … (p, a) (q, b) Program, Basic Block leveltodo backup memory conflicts check Next Block … write todo backup to memory N trash todo backup Y todo backup process

33 33 Backup Comparison Undo  Pro: f ast  Few number of backups  No need to retrieve from buffer for read  Con: Memory address needs to be known statically  Scalar  Pointer to fixed location Todo  Pro Handle both scalar and general-purpose pointer cases  Con: s low  Backup once per memory write  Need to retrieve each following read from buffer In reality: both types are used pros + cons of undo and todo

34 34 An example in reality: mixed mode int a, b, c; int * p, * q; … (d) a = 1; (d) b = 2; (d) *p = 5; … (u) c = a + b; … (u) … = * q; … Code to execute (&a, [a]) (&b, [b]) (&c, [c]) (p, 5) Undo buffer Todo buffer combined-backup process in reality

35 35 Selection of backups in reality Combined approach  Undo: memory address known Scalars Pointers to fixed address Compile-time analysis  Todo: memory address unknown Normal pointers Run-time analysis Plan for implementation  put into SUIF, as a optimization pass  Minimize performance drop use both types together in reality

36 36 Conclusion Thread-Level Speculation is compelling  Potential large performance gains Challenge  Software overhead Limited SW TLS work  No previous SW TLS working on general-purpose programs  Killer advantage: compiler analyses  Modest starting point efficient software checkpointing summary

37 37 Questions and Answers

38 38 Concurrent HW-only Related Work ApproachCompositionCompiler-assisted or Translator-only DMTHW-only CSMPHW-only Trace ProcessorHW-only Krishnan99SW/HW HydraSW/HW SVCSW/HW SUDSSW/HW Zhang99SW/HW Cintra00SW/HW STAMPedeSW/HW An other view of HW-only Thread-Level Speculation Schemes


Download ppt "1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members."

Similar presentations


Ads by Google