Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Similar presentations


Presentation on theme: "Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan."— Presentation transcript:

1 Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University

2 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 2 - Motivation  Industry is delivering multithreaded processors  Improving throughput on multithreaded processors is straight forward How can we use multithreaded processors to improve the performance of a single application?  We need parallel programs Cache Proc IBM Power4 processor:  2 processor cores per die  4 dies per module  8 64-bit processors per unit

3 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 3 - Automatic Parallelization Finding independent threads from integer programs is limited by Finding independent threads from integer programs is limited by Complex control flow Complex control flow Ambiguous data dependences Ambiguous data dependences Runtime inputs Runtime inputs More fundamentally Parallelization is determined at compile time  Thread-Level Speculation Detecting data dependences at runtime

4 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 4 - Thread-Level Speculation (TLS)  Retry TLS T1 T2T3 Load Thread1 Thread2 Thread3 Load Store Time  How do we communicate value between threads?

5 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 5 - Speculation TiTi T i+3 while(1) { …= *q;... *p = …; }  Retry  Is efficient when the dependence occurs infrequently T i+1 T i+2 …=*q *p=… …=*q *p=… …=*q *p=… …=*q

6 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 6 - while(1) { …= a; a =…; } Synchronization a= =a TiTi T i+1  Is efficient when the dependence occurs frequently wait (stall) signal wait(a); signal(a);

7 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 7 - Critical Forwarding Path in TLS while(1) { wait(a); …= a; a =…; signal(a); } wait(a); …= a; a = …; signal(a); critical forwarding path

8 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 8 - Cost of Synchronization  The processors spend 43% of total execution time on synchronization Normalized Execution Time 0 50 100 gccgomcfparserperlbmktwolfvprAVERAGE sync other

9 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 9 - Outline  Compiler optimization  Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling  Performance  Conclusions

10 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 10 - Reducing the Critical Forwarding Path Long Critical PathShort Critical Path  shorter critical forwarding path  less execution time execution time

11 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 11 - A Simplified Example from GCC do { counter = 0; if(p->jmp) p = p->jmp; if(!p->real) { p = p->next; continue; } q = p; do { counter++; q = q->next; } while(q); p = p->next; } while(p); start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end jmp next P real jmp nextreal jmp nextreal jmp nextreal

12 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 12 - Insert Wait Before First Use of P start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

13 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 13 - Insert Wait Before First Use of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

14 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 14 - Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

15 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 15 - Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end

16 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 16 - Earlier Placement for Signals start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end How can we systematically find these insertion points?

17 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 17 - Outline  Compiler optimization Optimization opportunity  Conservative instruction scheduling  How to compute the forwarding value  Where to compute the forwarding value Aggressive instruction scheduling  Performance  Conclusions

18 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 18 - A Dataflow Algorithm  Given: Control flow graph entry exit  For each node in the graph:  Can we compute the forwarding value at this node?  If so, how do we compute the forwarding value?  Can we forward the value at an earlier node?

19 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 19 - Moving Instructions Across Basic Blocks p=p->next q = p signal p p=p->nextsignal p end signal p end (A) (B)(C)

20 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 20 - Example: (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p

21 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 21 - Example (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next

22 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 22 - Example (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next

23 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 23 - Example (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next

24 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 24 - Example (5) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

25 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 25 - Example (6) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

26 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 26 - Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

27 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 27 - Nodes with Multiple Successors signal p p=p->next signal p p=p->next signal p p=p->next

28 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 28 - Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

29 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 29 - Example (8) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

30 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 30 - Example (9) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

31 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 31 - Example (10) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

32 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 32 - Nodes with Multiple Successors signal p p=p->next signal p p=p->next p=p->jmp

33 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 33 - Example (10) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

34 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 34 - Example (11) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

35 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 35 - Finding the Earliest Insertion Point  Find at each node in the CFG:  Can we compute the forwarding value at this node?  If so, how do we compute the forwarding value?  Can we forward value at an earlier node?   Earliest analysis  A node is earliest, if on some execution path, no earlier node can compute the forwarding value

36 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 36 - The Earliest Insertion Point (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

37 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 37 - The Earliest Insertion Point (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

38 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 38 - The Earliest Insertion Point (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

39 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 39 - The Earliest Insertion Point (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

40 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 40 - The Resulting Graph start counter=0 wait(p) p->jmp? p=p->jmp p1=p->next signal(p1) p->real? p=p1 q = p counter++; q=q->next; q? p=p1 end p1=p->next signal(p1)

41 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 41 - Outline  Compiler optimization Optimization opportunity Conservative instruction scheduling  Aggressive instruction scheduling  Speculate on control flow  Speculate on data dependence  Performance  Conclusions

42 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 42 - We Cannot Compute the Forwarded Value When… Control DependenceAmbiguous Data Dependence signal p p=p->jmp signal p p=p->next update(p) signal p p=p->next  Speculation

43 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 43 - Speculating on Control Dependences violate(p) signal p p=p->next p=p->jmp signal p p=p->next signal p p=p->next  signal(p) violate(p) p=p->jmp p=p->next signal(p)

44 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 44 - Speculating on Data Dependence update(p) p=p->next signal(p) p=load(add); signal(p); store1 store3 store2 p=load(add); signal(p);  Hardware support

45 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 45 - Hardware Support store1 store3 store2 p = load(0xADE8); 0xADE8 mark_load(0xADE8) p = load(addr); mark_load(addr) store2(0x8438) store3(0x88E8) unmark_load(addr) unmark_load(0xADE8)

46 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 46 - Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling  Performance  Conservative instruction scheduling  Aggressive instruction scheduling  Hardware optimization for scalar value communication  Hardware optimization for all value communication  Conclusions

47 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 47 - Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 U=No Instruction Scheduling A=Conservative Instruction Scheduling  Improves performance by 15%

48 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 48 - Optimizing Induction Variable Only sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 U=No Instruction Scheduling I=Induction Variable Scheduling only A=Conservative Instruction Scheduling  Is responsible for 10% performance improvement

49 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 49 - Benefits from Global Analysis  Multiscalar instruction scheduling[Vijaykumar, Thesis’98]  Uses local analysis to schedule instructions across basic blocks  Does not allow scheduling of instruction across inner loops Normalized Execution Time 0 50 100 gccgo M AM A sync fail other busy M=Multiscalar Scheduling A=Conservative Scheduling

50 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 50 - Aggressively Scheduling Instructions Across Control Dependences sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control)  Has no performance improvement over conservative scheduling

51 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 51 - Aggressively Scheduling Instructions Across Control and Data Dependence gccgomcfparserperlbmktwolfvprAVERAGE sync fail other busy Normalized Execution Time 0 50 100 A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) D=Aggressive Instruction Scheduling(Control + Data)  Improves performance by 9% over conservative scheduling

52 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 52 - Hardware Optimization Over Non-Optimized Code sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 U=No Instruction Scheduling E=No Instruction Scheduling+Hardware Optimization  Improves performance by 4%

53 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 53 - Hardware Optimization + Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 A=Conservative Instruction Scheduling F=Conservative Instruction Scheduling+Hardware Optimization  Improves performance by 2%

54 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 54 - Hardware Optimization + Aggressive Instruction Scheduling gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 sync fail other busy D=Aggressive Instruction Scheduling O=Aggressive Instruction Scheduling+Hardware Optimization  Degrades performance slightly  Hardware optimization is less important with compiler optimization

55 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 55 - Hardware Optimization for Communicating Both Scalar and Memory Values  Reduces cost of violation by 27.6%, translating into 4.7% performance improvement Normalized Execution Time 0 50 100 gccgomcfparserperlbmktwolfvprcompresscraftygapgzipijpegm88kvortex sync fail other busy

56 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 56 - Impact of Hardware and Software Optimizations  6 benchmarks are improved by 6.2-28.5%, 4 others by 2.7-3.6% Normalized Execution Time 0 50 100 gcc gomcf parser perlbmk twolfvpr compress crafty gap gzip ijpeg m88k vortex

57 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 57 - Conclusions  Critical forwarding path is an important bottleneck in TLS  Loop induction variables serialize parallel threads  Can be eliminated with our instruction scheduling algorithm  Non-inductive scalars can benefit from conservative instruction scheduling  Aggressive instruction scheduling should be applied selectively  Speculating on control dependence alone is not very effective  Speculating on both control and data dependences can reduce synchronization significantly  GCC is the biggest benefactor  Hardware optimization is less important as compiler schedules instructions more aggressively  Critical forwarding path is best addressed by the compiler


Download ppt "Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan."

Similar presentations


Ads by Google