Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 2 - Motivation  Industry is delivering multithreaded processors  Improving throughput on multithreaded processors is straight forward How can we use multithreaded processors to improve the performance of a single application?  We need parallel programs Cache Proc IBM Power4 processor:  2 processor cores per die  4 dies per module  8 64-bit processors per unit

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 3 - Automatic Parallelization Finding independent threads from integer programs is limited by Finding independent threads from integer programs is limited by Complex control flow Complex control flow Ambiguous data dependences Ambiguous data dependences Runtime inputs Runtime inputs More fundamentally Parallelization is determined at compile time  Thread-Level Speculation Detecting data dependences at runtime

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 4 - Thread-Level Speculation (TLS)  Retry TLS T1 T2T3 Load Thread1 Thread2 Thread3 Load Store Time  How do we communicate value between threads?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 5 - Speculation TiTi T i+3 while(1) { …= *q;... *p = …; }  Retry  Is efficient when the dependence occurs infrequently T i+1 T i+2 …=*q *p=… …=*q *p=… …=*q *p=… …=*q

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 6 - while(1) { …= a; a =…; } Synchronization a= =a TiTi T i+1  Is efficient when the dependence occurs frequently wait (stall) signal wait(a); signal(a);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 7 - Critical Forwarding Path in TLS while(1) { wait(a); …= a; a =…; signal(a); } wait(a); …= a; a = …; signal(a); critical forwarding path

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 8 - Cost of Synchronization  The processors spend 43% of total execution time on synchronization Normalized Execution Time 0 50 100 gccgomcfparserperlbmktwolfvprAVERAGE sync other

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 9 - Outline  Compiler optimization  Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling  Performance  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 10 - Reducing the Critical Forwarding Path Long Critical PathShort Critical Path  shorter critical forwarding path  less execution time execution time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 11 - A Simplified Example from GCC do { counter = 0; if(p->jmp) p = p->jmp; if(!p->real) { p = p->next; continue; } q = p; do { counter++; q = q->next; } while(q); p = p->next; } while(p); start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end jmp next P real jmp nextreal jmp nextreal jmp nextreal

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 12 - Insert Wait Before First Use of P start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 13 - Insert Wait Before First Use of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 14 - Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 15 - Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 16 - Earlier Placement for Signals start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end How can we systematically find these insertion points?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 17 - Outline  Compiler optimization Optimization opportunity  Conservative instruction scheduling  How to compute the forwarding value  Where to compute the forwarding value Aggressive instruction scheduling  Performance  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 18 - A Dataflow Algorithm  Given: Control flow graph entry exit  For each node in the graph:  Can we compute the forwarding value at this node?  If so, how do we compute the forwarding value?  Can we forward the value at an earlier node?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 19 - Moving Instructions Across Basic Blocks p=p->next q = p signal p p=p->nextsignal p end signal p end (A) (B)(C)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 20 - Example: (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 21 - Example (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 22 - Example (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 23 - Example (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 24 - Example (5) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 25 - Example (6) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 26 - Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 27 - Nodes with Multiple Successors signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 28 - Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 29 - Example (8) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 30 - Example (9) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 32 - Nodes with Multiple Successors signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 35 - Finding the Earliest Insertion Point  Find at each node in the CFG:  Can we compute the forwarding value at this node?  If so, how do we compute the forwarding value?  Can we forward value at an earlier node?   Earliest analysis  A node is earliest, if on some execution path, no earlier node can compute the forwarding value

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 36 - The Earliest Insertion Point (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 40 - The Resulting Graph start counter=0 wait(p) p->jmp? p=p->jmp p1=p->next signal(p1) p->real? p=p1 q = p counter++; q=q->next; q? p=p1 end p1=p->next signal(p1)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 41 - Outline  Compiler optimization Optimization opportunity Conservative instruction scheduling  Aggressive instruction scheduling  Speculate on control flow  Speculate on data dependence  Performance  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 42 - We Cannot Compute the Forwarded Value When… Control DependenceAmbiguous Data Dependence signal p p=p->jmp signal p p=p->next update(p) signal p p=p->next  Speculation

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 43 - Speculating on Control Dependences violate(p) signal p p=p->next p=p->jmp signal p p=p->next signal p p=p->next  signal(p) violate(p) p=p->jmp p=p->next signal(p)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 44 - Speculating on Data Dependence update(p) p=p->next signal(p) p=load(add); signal(p); store1 store3 store2 p=load(add); signal(p);  Hardware support

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 45 - Hardware Support store1 store3 store2 p = load(0xADE8); 0xADE8 mark_load(0xADE8) p = load(addr); mark_load(addr) store2(0x8438) store3(0x88E8) unmark_load(addr) unmark_load(0xADE8)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 46 - Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling  Performance  Conservative instruction scheduling  Aggressive instruction scheduling  Hardware optimization for scalar value communication  Hardware optimization for all value communication  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 47 - Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 U=No Instruction Scheduling A=Conservative Instruction Scheduling  Improves performance by 15%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 48 - Optimizing Induction Variable Only sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 U=No Instruction Scheduling I=Induction Variable Scheduling only A=Conservative Instruction Scheduling  Is responsible for 10% performance improvement

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 49 - Benefits from Global Analysis  Multiscalar instruction scheduling[Vijaykumar, Thesis’98]  Uses local analysis to schedule instructions across basic blocks  Does not allow scheduling of instruction across inner loops Normalized Execution Time 0 50 100 gccgo M AM A sync fail other busy M=Multiscalar Scheduling A=Conservative Scheduling

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 50 - Aggressively Scheduling Instructions Across Control Dependences sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control)  Has no performance improvement over conservative scheduling

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 51 - Aggressively Scheduling Instructions Across Control and Data Dependence gccgomcfparserperlbmktwolfvprAVERAGE sync fail other busy Normalized Execution Time 0 50 100 A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) D=Aggressive Instruction Scheduling(Control + Data)  Improves performance by 9% over conservative scheduling

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 52 - Hardware Optimization Over Non-Optimized Code sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 U=No Instruction Scheduling E=No Instruction Scheduling+Hardware Optimization  Improves performance by 4%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 53 - Hardware Optimization + Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 A=Conservative Instruction Scheduling F=Conservative Instruction Scheduling+Hardware Optimization  Improves performance by 2%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 54 - Hardware Optimization + Aggressive Instruction Scheduling gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time 0 50 100 sync fail other busy D=Aggressive Instruction Scheduling O=Aggressive Instruction Scheduling+Hardware Optimization  Degrades performance slightly  Hardware optimization is less important with compiler optimization

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 55 - Hardware Optimization for Communicating Both Scalar and Memory Values  Reduces cost of violation by 27.6%, translating into 4.7% performance improvement Normalized Execution Time 0 50 100 gccgomcfparserperlbmktwolfvprcompresscraftygapgzipijpegm88kvortex sync fail other busy

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 56 - Impact of Hardware and Software Optimizations  6 benchmarks are improved by 6.2-28.5%, 4 others by 2.7-3.6% Normalized Execution Time 0 50 100 gcc gomcf parser perlbmk twolfvpr compress crafty gap gzip ijpeg m88k vortex

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… - 57 - Conclusions  Critical forwarding path is an important bottleneck in TLS  Loop induction variables serialize parallel threads  Can be eliminated with our instruction scheduling algorithm  Non-inductive scalars can benefit from conservative instruction scheduling  Aggressive instruction scheduling should be applied selectively  Speculating on control dependence alone is not very effective  Speculating on both control and data dependences can reduce synchronization significantly  GCC is the biggest benefactor  Hardware optimization is less important as compiler schedules instructions more aggressively  Critical forwarding path is best addressed by the compiler

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Similar presentations

Presentation on theme: "Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Similar presentations

Presentation on theme: "Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan."— Presentation transcript:

Similar presentations

About project

Feedback