Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Motivation Industry is delivering multithreaded processors Improving throughput on multithreaded processors is straight forward How can we use multithreaded processors to improve the performance of a single application? We need parallel programs Cache Proc IBM Power4 processor: 2 processor cores per die 4 dies per module 8 64-bit processors per unit
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Automatic Parallelization Finding independent threads from integer programs is limited by Finding independent threads from integer programs is limited by Complex control flow Complex control flow Ambiguous data dependences Ambiguous data dependences Runtime inputs Runtime inputs More fundamentally Parallelization is determined at compile time Thread-Level Speculation Detecting data dependences at runtime
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Thread-Level Speculation (TLS) Retry TLS T1 T2T3 Load Thread1 Thread2 Thread3 Load Store Time How do we communicate value between threads?
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Speculation TiTi T i+3 while(1) { …= *q;... *p = …; } Retry Is efficient when the dependence occurs infrequently T i+1 T i+2 …=*q *p=… …=*q *p=… …=*q *p=… …=*q
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… while(1) { …= a; a =…; } Synchronization a= =a TiTi T i+1 Is efficient when the dependence occurs frequently wait (stall) signal wait(a); signal(a);
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Critical Forwarding Path in TLS while(1) { wait(a); …= a; a =…; signal(a); } wait(a); …= a; a = …; signal(a); critical forwarding path
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Cost of Synchronization The processors spend 43% of total execution time on synchronization Normalized Execution Time gccgomcfparserperlbmktwolfvprAVERAGE sync other
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling Performance Conclusions
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Reducing the Critical Forwarding Path Long Critical PathShort Critical Path shorter critical forwarding path less execution time execution time
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… A Simplified Example from GCC do { counter = 0; if(p->jmp) p = p->jmp; if(!p->real) { p = p->next; continue; } q = p; do { counter++; q = q->next; } while(q); p = p->next; } while(p); start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end jmp next P real jmp nextreal jmp nextreal jmp nextreal
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Wait Before First Use of P start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Wait Before First Use of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Earlier Placement for Signals start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end How can we systematically find these insertion points?
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline Compiler optimization Optimization opportunity Conservative instruction scheduling How to compute the forwarding value Where to compute the forwarding value Aggressive instruction scheduling Performance Conclusions
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… A Dataflow Algorithm Given: Control flow graph entry exit For each node in the graph: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward the value at an earlier node?
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Moving Instructions Across Basic Blocks p=p->next q = p signal p p=p->nextsignal p end signal p end (A) (B)(C)
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example: (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (5) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (6) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Nodes with Multiple Successors signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (8) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (9) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (10) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Nodes with Multiple Successors signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (10) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (11) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Finding the Earliest Insertion Point Find at each node in the CFG: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward value at an earlier node? Earliest analysis A node is earliest, if on some execution path, no earlier node can compute the forwarding value
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Resulting Graph start counter=0 wait(p) p->jmp? p=p->jmp p1=p->next signal(p1) p->real? p=p1 q = p counter++; q=q->next; q? p=p1 end p1=p->next signal(p1)
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling Speculate on control flow Speculate on data dependence Performance Conclusions
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… We Cannot Compute the Forwarded Value When… Control DependenceAmbiguous Data Dependence signal p p=p->jmp signal p p=p->next update(p) signal p p=p->next Speculation
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Speculating on Control Dependences violate(p) signal p p=p->next p=p->jmp signal p p=p->next signal p p=p->next signal(p) violate(p) p=p->jmp p=p->next signal(p)
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Speculating on Data Dependence update(p) p=p->next signal(p) p=load(add); signal(p); store1 store3 store2 p=load(add); signal(p); Hardware support
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Support store1 store3 store2 p = load(0xADE8); 0xADE8 mark_load(0xADE8) p = load(addr); mark_load(addr) store2(0x8438) store3(0x88E8) unmark_load(addr) unmark_load(0xADE8)
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling Performance Conservative instruction scheduling Aggressive instruction scheduling Hardware optimization for scalar value communication Hardware optimization for all value communication Conclusions
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time U=No Instruction Scheduling A=Conservative Instruction Scheduling Improves performance by 15%
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Optimizing Induction Variable Only sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time U=No Instruction Scheduling I=Induction Variable Scheduling only A=Conservative Instruction Scheduling Is responsible for 10% performance improvement
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Benefits from Global Analysis Multiscalar instruction scheduling[Vijaykumar, Thesis’98] Uses local analysis to schedule instructions across basic blocks Does not allow scheduling of instruction across inner loops Normalized Execution Time gccgo M AM A sync fail other busy M=Multiscalar Scheduling A=Conservative Scheduling
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Aggressively Scheduling Instructions Across Control Dependences sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) Has no performance improvement over conservative scheduling
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Aggressively Scheduling Instructions Across Control and Data Dependence gccgomcfparserperlbmktwolfvprAVERAGE sync fail other busy Normalized Execution Time A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) D=Aggressive Instruction Scheduling(Control + Data) Improves performance by 9% over conservative scheduling
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization Over Non-Optimized Code sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time U=No Instruction Scheduling E=No Instruction Scheduling+Hardware Optimization Improves performance by 4%
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization + Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time A=Conservative Instruction Scheduling F=Conservative Instruction Scheduling+Hardware Optimization Improves performance by 2%
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization + Aggressive Instruction Scheduling gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time sync fail other busy D=Aggressive Instruction Scheduling O=Aggressive Instruction Scheduling+Hardware Optimization Degrades performance slightly Hardware optimization is less important with compiler optimization
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization for Communicating Both Scalar and Memory Values Reduces cost of violation by 27.6%, translating into 4.7% performance improvement Normalized Execution Time gccgomcfparserperlbmktwolfvprcompresscraftygapgzipijpegm88kvortex sync fail other busy
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Impact of Hardware and Software Optimizations 6 benchmarks are improved by %, 4 others by % Normalized Execution Time gcc gomcf parser perlbmk twolfvpr compress crafty gap gzip ijpeg m88k vortex
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Conclusions Critical forwarding path is an important bottleneck in TLS Loop induction variables serialize parallel threads Can be eliminated with our instruction scheduling algorithm Non-inductive scalars can benefit from conservative instruction scheduling Aggressive instruction scheduling should be applied selectively Speculating on control dependence alone is not very effective Speculating on both control and data dependences can reduce synchronization significantly GCC is the biggest benefactor Hardware optimization is less important as compiler schedules instructions more aggressively Critical forwarding path is best addressed by the compiler