Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Advertisements

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Programmability Issues
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry.
Multiscalar processors
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Thread-Level Speculation Karan Singh CS
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Lecture on Central Process Unit (CPU)
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Memory-Aware Compilation Philip Sweany 10/20/2011.
1 A Scalable Approach to Thread-Level SpeculationSteffan Carnegie Mellon A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher.
Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
CS 352H: Computer Systems Architecture
Multiscalar Processors
Static Single Assignment
CS203 – Advanced Computer Architecture
Automatic Detection of Extended Data-Race-Free Regions
CSC D70: Compiler Optimization
Antonia Zhai, Christopher B. Colohan,
Henk Corporaal TUEindhoven 2009
Software Cache Coherent Control by Parallelizing Compiler
Improving Value Communication for Thread-Level Speculation
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
CSC3050 – Computer Architecture
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Motivation  Industry is delivering multithreaded processors  Improving throughput on multithreaded processors is straight forward How can we use multithreaded processors to improve the performance of a single application?  We need parallel programs Cache Proc IBM Power4 processor:  2 processor cores per die  4 dies per module  8 64-bit processors per unit

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Automatic Parallelization Finding independent threads from integer programs is limited by Finding independent threads from integer programs is limited by Complex control flow Complex control flow Ambiguous data dependences Ambiguous data dependences Runtime inputs Runtime inputs More fundamentally Parallelization is determined at compile time  Thread-Level Speculation Detecting data dependences at runtime

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Thread-Level Speculation (TLS)  Retry TLS T1 T2T3 Load Thread1 Thread2 Thread3 Load Store Time  How do we communicate value between threads?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Speculation TiTi T i+3 while(1) { …= *q;... *p = …; }  Retry  Is efficient when the dependence occurs infrequently T i+1 T i+2 …=*q *p=… …=*q *p=… …=*q *p=… …=*q

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… while(1) { …= a; a =…; } Synchronization a= =a TiTi T i+1  Is efficient when the dependence occurs frequently wait (stall) signal wait(a); signal(a);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Critical Forwarding Path in TLS while(1) { wait(a); …= a; a =…; signal(a); } wait(a); …= a; a = …; signal(a); critical forwarding path

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Cost of Synchronization  The processors spend 43% of total execution time on synchronization Normalized Execution Time gccgomcfparserperlbmktwolfvprAVERAGE sync other

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline  Compiler optimization  Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling  Performance  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Reducing the Critical Forwarding Path Long Critical PathShort Critical Path  shorter critical forwarding path  less execution time execution time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… A Simplified Example from GCC do { counter = 0; if(p->jmp) p = p->jmp; if(!p->real) { p = p->next; continue; } q = p; do { counter++; q = q->next; } while(q); p = p->next; } while(p); start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end jmp next P real jmp nextreal jmp nextreal jmp nextreal

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Wait Before First Use of P start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Wait Before First Use of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Earlier Placement for Signals start counter=0 wait(p) p->jmp? p=p->jmp p->real? p=p->next signal(p) q = p counter++; q=q->next; q? p=p->next signal(p) end How can we systematically find these insertion points?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline  Compiler optimization Optimization opportunity  Conservative instruction scheduling  How to compute the forwarding value  Where to compute the forwarding value Aggressive instruction scheduling  Performance  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… A Dataflow Algorithm  Given: Control flow graph entry exit  For each node in the graph:  Can we compute the forwarding value at this node?  If so, how do we compute the forwarding value?  Can we forward the value at an earlier node?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Moving Instructions Across Basic Blocks p=p->next q = p signal p p=p->nextsignal p end signal p end (A) (B)(C)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example: (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (5) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (6) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Nodes with Multiple Successors signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (7) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (8) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (9) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (10) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Nodes with Multiple Successors signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (10) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Example (11) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Finding the Earliest Insertion Point  Find at each node in the CFG:  Can we compute the forwarding value at this node?  If so, how do we compute the forwarding value?  Can we forward value at an earlier node?   Earliest analysis  A node is earliest, if on some execution path, no earlier node can compute the forwarding value

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (1) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (2) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (3) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Earliest Insertion Point (4) start counter=0 p->jmp? p=p->jmp p->real? p=p->next q = p counter++; q=q->next; q? p=p->next end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… The Resulting Graph start counter=0 wait(p) p->jmp? p=p->jmp p1=p->next signal(p1) p->real? p=p1 q = p counter++; q=q->next; q? p=p1 end p1=p->next signal(p1)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline  Compiler optimization Optimization opportunity Conservative instruction scheduling  Aggressive instruction scheduling  Speculate on control flow  Speculate on data dependence  Performance  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… We Cannot Compute the Forwarded Value When… Control DependenceAmbiguous Data Dependence signal p p=p->jmp signal p p=p->next update(p) signal p p=p->next  Speculation

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Speculating on Control Dependences violate(p) signal p p=p->next p=p->jmp signal p p=p->next signal p p=p->next  signal(p) violate(p) p=p->jmp p=p->next signal(p)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Speculating on Data Dependence update(p) p=p->next signal(p) p=load(add); signal(p); store1 store3 store2 p=load(add); signal(p);  Hardware support

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Support store1 store3 store2 p = load(0xADE8); 0xADE8 mark_load(0xADE8) p = load(addr); mark_load(addr) store2(0x8438) store3(0x88E8) unmark_load(addr) unmark_load(0xADE8)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling  Performance  Conservative instruction scheduling  Aggressive instruction scheduling  Hardware optimization for scalar value communication  Hardware optimization for all value communication  Conclusions

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time U=No Instruction Scheduling A=Conservative Instruction Scheduling  Improves performance by 15%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Optimizing Induction Variable Only sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time U=No Instruction Scheduling I=Induction Variable Scheduling only A=Conservative Instruction Scheduling  Is responsible for 10% performance improvement

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Benefits from Global Analysis  Multiscalar instruction scheduling[Vijaykumar, Thesis’98]  Uses local analysis to schedule instructions across basic blocks  Does not allow scheduling of instruction across inner loops Normalized Execution Time gccgo M AM A sync fail other busy M=Multiscalar Scheduling A=Conservative Scheduling

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Aggressively Scheduling Instructions Across Control Dependences sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control)  Has no performance improvement over conservative scheduling

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Aggressively Scheduling Instructions Across Control and Data Dependence gccgomcfparserperlbmktwolfvprAVERAGE sync fail other busy Normalized Execution Time A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) D=Aggressive Instruction Scheduling(Control + Data)  Improves performance by 9% over conservative scheduling

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization Over Non-Optimized Code sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time U=No Instruction Scheduling E=No Instruction Scheduling+Hardware Optimization  Improves performance by 4%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization + Conservative Instruction Scheduling sync fail other busy gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time A=Conservative Instruction Scheduling F=Conservative Instruction Scheduling+Hardware Optimization  Improves performance by 2%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization + Aggressive Instruction Scheduling gccgomcfparserperlbmktwolfvprAVERAGE Normalized Execution Time sync fail other busy D=Aggressive Instruction Scheduling O=Aggressive Instruction Scheduling+Hardware Optimization  Degrades performance slightly  Hardware optimization is less important with compiler optimization

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Hardware Optimization for Communicating Both Scalar and Memory Values  Reduces cost of violation by 27.6%, translating into 4.7% performance improvement Normalized Execution Time gccgomcfparserperlbmktwolfvprcompresscraftygapgzipijpegm88kvortex sync fail other busy

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Impact of Hardware and Software Optimizations  6 benchmarks are improved by %, 4 others by % Normalized Execution Time gcc gomcf parser perlbmk twolfvpr compress crafty gap gzip ijpeg m88k vortex

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Scalar Value Communication… Conclusions  Critical forwarding path is an important bottleneck in TLS  Loop induction variables serialize parallel threads  Can be eliminated with our instruction scheduling algorithm  Non-inductive scalars can benefit from conservative instruction scheduling  Aggressive instruction scheduling should be applied selectively  Speculating on control dependence alone is not very effective  Speculating on both control and data dependences can reduce synchronization significantly  GCC is the biggest benefactor  Hardware optimization is less important as compiler schedules instructions more aggressively  Critical forwarding path is best addressed by the compiler