Antonia Zhai, Christopher B. Colohan,

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Advertisements

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Cost Effective Dynamic Program Slicing Xiangyu Zhang Rajiv Gupta The University of Arizona.

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan.

Multiscalar processors

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Thread-Level Speculation Karan Singh CS

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Ghent University Veerle Desmet Lieven Eeckhout Koen De Bosschere Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction.

CS 352H: Computer Systems Architecture

Topics to be covered Instruction Execution Characteristics

Static Slicing Static slice is the set of statements that COULD influence the value of a variable for ANY input. Construct static dependence graph Control.

Multiscalar Processors

Static Single Assignment

The University of Adelaide, School of Computer Science

CS203 – Advanced Computer Architecture

Automatic Detection of Extended Data-Race-Free Regions

Henk Corporaal TUEindhoven 2009

Instruction Scheduling for Instruction-Level Parallelism

Software Cache Coherent Control by Parallelizing Compiler

Department of Computer Science University of California, Santa Barbara

EE 382N Guest Lecture Wish Branches

Improving Value Communication for Thread-Level Speculation

Efficient software checkpointing framework for speculative techniques

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

Chapter 12 Pipelining and RISC

CSC3050 – Computer Architecture

Loop-Level Parallelism

Presentation transcript:

Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University

 We need parallel programs Motivation Industry is delivering multithreaded processors Improving throughput on multithreaded processors is straight forward IBM Power4 processor: 2 processor cores per die 4 dies per module  8 64-bit processors per unit Proc Proc Cache How can we use multithreaded processors to improve the performance of a single application?  We need parallel programs Compiler Optimization of Scalar Value Communication… - 2 -

Automatic Parallelization Finding independent threads from integer programs is limited by Complex control flow Ambiguous data dependences Runtime inputs More fundamentally Parallelization is determined at compile time  Thread-Level Speculation Detecting data dependences at runtime Compiler Optimization of Scalar Value Communication… - 3 -

Thread-Level Speculation (TLS) Load Store    Retry Load  Time TLS Thread2 Thread3  How do we communicate value between threads? Compiler Optimization of Scalar Value Communication… - 4 -

  Is efficient when the dependence occurs infrequently     Speculation Ti Ti+1 Ti+2 Ti+3 …=*q …=*q …=*q …=*q while(1) { …= *q; . . . *p = …; } *p=… *p=… *p=…  Retry    …=*q   Is efficient when the dependence occurs infrequently Compiler Optimization of Scalar Value Communication… - 5 -

 Is efficient when the dependence occurs frequently Synchronization while(1) { …= a; a =…; } wait(a); signal(a); Ti Ti+1 wait a= (stall) signal =a  Is efficient when the dependence occurs frequently Compiler Optimization of Scalar Value Communication… - 6 -

Critical Forwarding Path in TLS while(1) { wait(a); …= a; a =…; signal(a); } wait(a); …= a; critical forwarding path a = …; signal(a); Compiler Optimization of Scalar Value Communication… - 7 -

Cost of Synchronization Normalized Execution Time 50 100 sync other gcc go mcf parser perlbmk twolf vpr AVERAGE The processors spend 43% of total execution time on synchronization Compiler Optimization of Scalar Value Communication… - 8 -

Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling Performance Conclusions Compiler Optimization of Scalar Value Communication… - 9 -

Reducing the Critical Forwarding Path Long Critical Path Short Critical Path execution time execution time  shorter critical forwarding path  less execution time Compiler Optimization of Scalar Value Communication… - 10 -

A Simplified Example from GCC do { counter = 0; if(p->jmp) p = p->jmp; if(!p->real) { p = p->next; continue; } q = p; do { counter++; q = q->next; } while(q); } while(p); start P counter=0 p->jmp? jmp p=p->jmp real next p->real? jmp real next q = p jmp real next p=p->next counter++; q=q->next; q? jmp real next p=p->next end Compiler Optimization of Scalar Value Communication… - 11 -

Insert Wait Before First Use of P start counter=0 p->jmp? p=p->jmp p->real? q = p p=p->next counter++; q=q->next; q? p=p->next end Compiler Optimization of Scalar Value Communication… - 12 -

Insert Wait Before First Use of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next counter++; q=q->next; q? p=p->next end Compiler Optimization of Scalar Value Communication… - 13 -

Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next counter++; q=q->next; q? p=p->next end Compiler Optimization of Scalar Value Communication… - 14 -

Insert Signal After Last Definition of P start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next signal(p) counter++; q=q->next; q? p=p->next signal(p) end Compiler Optimization of Scalar Value Communication… - 15 -

Earlier Placement for Signals start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next signal(p) counter++; q=q->next; q? How can we systematically find these insertion points? p=p->next signal(p) end Compiler Optimization of Scalar Value Communication… - 16 -

Outline Compiler optimization Optimization opportunity Conservative instruction scheduling How to compute the forwarding value Where to compute the forwarding value Aggressive instruction scheduling Performance Conclusions Compiler Optimization of Scalar Value Communication… - 17 -

A Dataflow Algorithm entry exit Given: Control flow graph For each node in the graph: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward the value at an earlier node? Compiler Optimization of Scalar Value Communication… - 18 -

Moving Instructions Across Basic Blocks signal p p=p->next signal p signal p q = p p=p->next end signal p signal p end end (A) (B) (C) Compiler Optimization of Scalar Value Communication… - 19 -

Example: (1) start counter=0 p->jmp? p=p->jmp p->real? q = p q=q->next; q? p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 20 -

Example (2) start counter=0 p->jmp? p=p->jmp p->real? q = p signal p p=p->next q = p counter++; q=q->next; q? p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 21 -

Example (3) start counter=0 p->jmp? p=p->jmp p->real? q = p signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 22 -

Example (4) start counter=0 p->jmp? p=p->jmp p->real? signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 23 -

Example (5) start counter=0 p->jmp? p=p->jmp p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 24 -

Example (6) start counter=0 p->jmp? p=p->jmp signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 25 -

Example (7) start counter=0 p->jmp? p=p->jmp signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 26 -

Nodes with Multiple Successors signal p p=p->next signal p p=p->next signal p p=p->next Compiler Optimization of Scalar Value Communication… - 27 -

Example (7) start counter=0 p->jmp? p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 28 -

Example (8) start signal p p=p->next counter=0 p->jmp? p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 29 -

Example (9) start p=p->jmp signal p p=p->next counter=0 p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 30 -

Example (10) start p=p->jmp signal p p=p->next counter=0 p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 31 -

Nodes with Multiple Successors p=p->jmp signal p p=p->next p=p->next signal p Compiler Optimization of Scalar Value Communication… - 32 -

Example (10) start p=p->jmp signal p p=p->next counter=0 p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 33 -

Example (11) start p=p->jmp signal p p=p->next counter=0 p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 34 -

Finding the Earliest Insertion Point Find at each node in the CFG: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward value at an earlier node?    Earliest analysis A node is earliest, if on some execution path, no earlier node can compute the forwarding value Compiler Optimization of Scalar Value Communication… - 35 -

The Earliest Insertion Point (1) start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 36 -

The Earliest Insertion Point (2) start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 37 -

The Earliest Insertion Point (3) start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 38 -

The Earliest Insertion Point (4) start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 39 -

The Resulting Graph start counter=0 wait(p) p->jmp? p=p->jmp p1=p->next signal(p1) p1=p->next signal(p1) p->real? q = p p=p1 counter++; q=q->next; q? p=p1 end Compiler Optimization of Scalar Value Communication… - 40 -

Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling Speculate on control flow Speculate on data dependence Performance Conclusions Compiler Optimization of Scalar Value Communication… - 41 -

We Cannot Compute the Forwarded Value When… Control Dependence Ambiguous Data Dependence update(p) signal p p=p->next signal p p=p->jmp signal p p=p->next  Speculation Compiler Optimization of Scalar Value Communication… - 42 -

Speculating on Control Dependences p=p->next signal(p) signal p p=p->next  violate(p) p=p->jmp violate(p) p=p->jmp p=p->next signal(p) signal p p=p->next p=p->next signal p Compiler Optimization of Scalar Value Communication… - 43 -

Speculating on Data Dependence p=load(add); signal(p); store1 store2 update(p) store3 p=p->next signal(p) p=load(add); signal(p);  Hardware support Compiler Optimization of Scalar Value Communication… - 44 -

Hardware Support p = load(0xADE8); mark_load(0xADE8) p = load(addr); unmark_load(0xADE8) mark_load(addr) store1 store2 store2(0x8438) store3 store3(0x88E8) unmark_load(addr) Compiler Optimization of Scalar Value Communication… - 45 -

Outline Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling Performance Hardware optimization for scalar value communication Hardware optimization for all value communication Conclusions Compiler Optimization of Scalar Value Communication… - 46 -

Conservative Instruction Scheduling 100 sync fail other Normalized Execution Time 50 busy gcc go mcf parser perlbmk twolf vpr AVERAGE U=No Instruction Scheduling A=Conservative Instruction Scheduling  Improves performance by 15% Compiler Optimization of Scalar Value Communication… - 47 -

Optimizing Induction Variable Only 100 sync fail other Normalized Execution Time 50 busy gcc go mcf parser perlbmk twolf vpr AVERAGE U=No Instruction Scheduling I=Induction Variable Scheduling only A=Conservative Instruction Scheduling  Is responsible for 10% performance improvement Compiler Optimization of Scalar Value Communication… - 48 -

Benefits from Global Analysis Multiscalar instruction scheduling[Vijaykumar, Thesis’98] Uses local analysis to schedule instructions across basic blocks Does not allow scheduling of instruction across inner loops 100 sync fail M=Multiscalar Scheduling A=Conservative Scheduling other Normalized Execution Time 50 busy M A M A gcc go Compiler Optimization of Scalar Value Communication… - 49 -

Aggressively Scheduling Instructions Across Control Dependences 100 sync fail other Normalized Execution Time 50 busy gcc go mcf parser perlbmk twolf vpr AVERAGE A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control)  Has no performance improvement over conservative scheduling Compiler Optimization of Scalar Value Communication… - 50 -

Normalized Execution Time Aggressively Scheduling Instructions Across Control and Data Dependence gcc go mcf parser perlbmk twolf vpr AVERAGE sync fail other busy Normalized Execution Time 50 100 A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) D=Aggressive Instruction Scheduling(Control + Data)  Improves performance by 9% over conservative scheduling Compiler Optimization of Scalar Value Communication… - 51 -

Hardware Optimization Over Non-Optimized Code Normalized Execution Time 50 100 sync fail other busy gcc go mcf parser perlbmk twolf vpr AVERAGE U=No Instruction Scheduling E=No Instruction Scheduling+Hardware Optimization  Improves performance by 4% Compiler Optimization of Scalar Value Communication… - 52 -

Hardware Optimization + Conservative Instruction Scheduling Normalized Execution Time 50 100 sync fail other busy gcc go mcf parser perlbmk twolf vpr AVERAGE A=Conservative Instruction Scheduling F=Conservative Instruction Scheduling+Hardware Optimization  Improves performance by 2% Compiler Optimization of Scalar Value Communication… - 53 -

Hardware Optimization + Aggressive Instruction Scheduling Normalized Execution Time 50 100 sync fail other busy gcc go mcf parser perlbmk twolf vpr AVERAGE D=Aggressive Instruction Scheduling O=Aggressive Instruction Scheduling+Hardware Optimization  Degrades performance slightly  Hardware optimization is less important with compiler optimization Compiler Optimization of Scalar Value Communication… - 54 -

Hardware Optimization for Communicating Both Scalar and Memory Values Normalized Execution Time 50 100 sync fail other busy compress crafty gap gcc go gzip ijpeg m88k mcf parser perlbmk twolf vortex vpr  Reduces cost of violation by 27.6%, translating into 4.7% performance improvement Compiler Optimization of Scalar Value Communication… - 55 -

Impact of Hardware and Software Optimizations Normalized Execution Time 50 100 compress gap go ijpeg mcf perlbmk vortex crafty gcc gzip m88k parser twolf vpr  6 benchmarks are improved by 6.2-28.5%, 4 others by 2.7-3.6% Compiler Optimization of Scalar Value Communication… - 56 -

 Critical forwarding path is best addressed by the compiler Conclusions Critical forwarding path is an important bottleneck in TLS Loop induction variables serialize parallel threads Can be eliminated with our instruction scheduling algorithm Non-inductive scalars can benefit from conservative instruction scheduling Aggressive instruction scheduling should be applied selectively Speculating on control dependence alone is not very effective Speculating on both control and data dependences can reduce synchronization significantly GCC is the biggest benefactor Hardware optimization is less important as compiler schedules instructions more aggressively  Critical forwarding path is best addressed by the compiler Compiler Optimization of Scalar Value Communication… - 57 -