Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

Slides:



Advertisements
Similar presentations
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Multiscalar processors
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Thread-Level Speculation Karan Singh CS
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 A Scalable Approach to Thread-Level SpeculationSteffan Carnegie Mellon A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher.
Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Ghent University Veerle Desmet Lieven Eeckhout Koen De Bosschere Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS 352H: Computer Systems Architecture
Concepts and Challenges
Multiscalar Processors
The University of Adelaide, School of Computer Science
‘99 ACM/IEEE International Symposium on Computer Architecture
CS203 – Advanced Computer Architecture
Antonia Zhai, Christopher B. Colohan,
Morgan Kaufmann Publishers The Processor
Address-Value Delta (AVD) Prediction
Improving Value Communication for Thread-Level Speculation
Ka-Ming Keung Swamy D Ponpandi
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
15-740/ Computer Architecture Lecture 14: Prefetching
The University of Adelaide, School of Computer Science
CSC3050 – Computer Architecture
rePLay: A Hardware Framework for Dynamic Optimization
The University of Adelaide, School of Computer Science
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

Improving Performance of a Single Application C C P C P C P C P  Finding parallel threads is difficult T1T2T3T4 Chip Boundary Load 0x86 Store 0x86 dependence 

Compiler Makes Conservative Decisions Cons  Make conservative decisions for ambiguous dependences –Complex control flow –Pointer and indirect references –Runtime input Pros  Examine the entire program …  0x80 0x86  … Iteration #1 Iteration #2Iteration #3Iteration #4 …  0x60 0x66  … …  0x90 0x96  … …  0x50 0x56  … for (i = 0; i < N; i++) { …  *q *p  … }

Using Hardware to Exploit Parallelism  Search for independent instructions within a window Cons  Exploit parallelism among a small number of instructions Instruction Window Pros  Disambiguate dependence accurately at runtime  How to exploit parallelism with both hardware and compiler? Instruction Window Load 0x88 Store 0x66 Add a1, a2 Instruction Window Load 0x88 Store 0x66 Add a1, a2

Thread-Level Speculation (TLS) Compiler Creates parallel threads Hardware Disambiguates dependences   My goal: Speed up programs with the help of TLS Load *q Store *p  dependence Sequential Speculatively parallel Store *p Load *q Store 0x86 Load 0x86 Time

Single Chip Multiprocessor (CMP) C P C P C P Interconnect Chip Boundary Memory  Our support and optimization for TLS  Built upon CMP  Applied to other architectures that support multiple threads Replicated processor cores –Reduce design cost Scalable and decentralized design –Localize wires –Eliminate centralized structures Infrastructure transparency –Handle legacy codes

Thread-Level Speculation (TLS) Load *q Store *p  dependence Speculatively parallel Time Support thread level speculation Recover from failed speculation  Buffer speculative writes from the memory Track data dependences  Detect data dependence violation

Buffering Speculative Writes from the Memory Memory C P C P C P C P Data contents Directory to maintain cache coherence Speculative state  Extending cache coherence protocol to support TLS Interconnect Chip Boundary

Detecting Dependence Violations C P C P Interconnect store address  violation detected ProducerConsumer Producer forwards all store addresses Consumer detects data dependence violation  Extending cache coherence protocol [ISCA’00]

Synchronizing Scalars …=a a=… …=a a=…  Identifying scalars causing dependences is straightforward Time Producer Consumer

Synchronizing Scalars …=a a=… …=a a=…  Dependent scalars should be synchronized Time Signal(a) Wait(a) Producer Consumer Use forwarded value

Reducing the Critical Forwarding Path Long Critical PathShort Critical Path  Instruction scheduling can reduce critical forwarding path Critical Path wait …=a a = … signal Time

Potential

Compiler Infrastructure Loops were targeted. –Selected loops were discarded Low coverage (<0.1%) Less than 30 instructions / iteration >16384 instructions / iteration Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results. Compiler inserts instructions to create and manage epochs. Compiler allocates forward variables on stack (forwarding frame) Compiler inserts wait and signal instructions.

Synchronization Constraints Wait before first use Signal after last definition Signal on every possible path Wait should be as late as possible Signal should be as early as possible.

Data flow Analysis for synchronization The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included. CFG is modeled as a graph, with BB as nodes.

Instruction Scheduling Conservative scheduling Aggressive scheduling –Control Dependences –Data dependences

Instruction Scheduling … = a a = … Initial Synchronization wait(a) signal(a) Scheduling Instructions Speculatively Scheduling Instructions

Instruction Scheduling Scheduling Instructions wait(a) … = a a = … signal(a) … = a a = … wait(a) Initial Synchronization signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions

Instruction Scheduling Scheduling Instructions wait(a) … = a a = … signal(a) … = a a = … wait(a) Initial Synchronization Speculatively Scheduling Instructions a = … signal(a) wait(a) … = a *q=… a = … signal(a) *q=… signal(a)

Instruction Scheduling Dataflow Analysis: Handles complex control flow We Define Two Dataflow Analyses: “Stack” analysis: finds instructions needed to compute the forwarded value “ “Earliest” analysis: finds the earliest node to compute the forwarded value 

Computation Stack  Stores the instructions to compute a forwarded value Associating a stack to every node for every forwarded scalar We don’t know how to compute the forwarded value We know how to compute the forwarded value signal a a = a*11 Not yet evaluated

A Simplified Example from GCC do { } while(p);... start p=p->jmp q = p; p->real? p=p->next counter++; q=q->next; q? end counter=0 wait(p) p->jmp?

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p counter=0 wait(p) p->jmp?

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp?

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next signal p p=p->next

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit

Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit signal p p=p->next signal p p=p->next signal p p=p->next Solution is consistent

Scheduling Instructions Dataflow Analysis: Handles complex control flow We Define Two Dataflow Problems: “Stack” analysis: finds instructions needed to compute the forwarded value. “ “Earliest” analysis: finds the earliest node to compute the forwarded value. 

The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated

The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated

The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated

The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated

The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest

Code Transformation start q = p p->real? p=p->next counter++; q=q->next; q? end counter=0 wait(p) p->jmp? p2=p->jmp p1=p2->next Signal(p1) p1=p->next Signal(p1) Earliest Not Earliest

Instruction Scheduling Scheduling Instructions wait(a) … = a a = … signal(a) Speculatively Scheduling Instructions a = … signal(a) wait(a) … = a *q=… a = … signal(a) *q=… … = a a = … wait(a) Initial Synchronization signal(a)

Aggressive scheduling Optimize common case by moving signal up. On wrong path, send violate_epoch to next thread and then forward correct value. If instructions are scheduled past branches, then on occurrence of an exception, a violation should be sent and non-speculative code should be executed. Changes: –Modify Meet operator to compute stack for frequently executed paths –Add a new node for infrequent paths and insert violate_epoch signals. Earliest is set to true for these nodes.

Aggressive scheduling Two new instructions –Mark_load – tells H/W to remember the address of location Placed when load moves past store –Unmark_load – clears the mark. placed at original position of load instruction. In the meet operation, conflicting load marks are merged using logical or.

Speculating Beyond a Control Dependence signal p p=p->next signal p p=p->jmp p=p->next end signal p Frequently Executed Path end violate(p) p=p->jmp signal(p) p1=p->next signal(p1) Frequently Executed Path

Speculating Beyond a Potential Data Dependence *q = NULL signal p p=p->next  Hardware support Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] signal p p = p->next end Profiling Information *q = NULL end p = p->next signal(p) Speculative Load

Experimental Framework Benchmarks –SPECint95 and SPECint2000, -O3 optimization Underlying architecture –4-processor, single-chip multiprocessor –speculation supported by coherence Simulator –superscalar, similar to MIPS R14K –simulates communication latency –models all bandwidth and contention  detailed simulation C C P C P Interconnect C P C P

Impact of Synchronization Stalls for Scalars  Performance bottleneck: synchronization (40% of execution time) gcc go mcf parser perlbmk twolf vpr compress crafty gap gzip ijpeg m88ksim vortex Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Synchronization Stall Other Norm. Region Exec. Time

Instruction Scheduling U=No Instruction Scheduling A=Instruction Scheduling  Improves performance by 18% gcc go mcf parser perlbmk twolf vpr compress crafty gap gzip ijpeg m88ksim vortex UAUAUAUAUAUAUAUAUAUAUAUAUAUA Still room for improvement   Synchronization Stall Failed Speculation Other Busy Norm. Region Exec. Time 5%22%40%

Aggressively Scheduling Instructions A=Instruction Scheduling S=Speculating Across Control & Data Dependences  Significantly for some benchmarks gcc mcf parser perlbmk twolf Synchronization Stall Failed Speculation Other Busy A AA A A SSSSS Norm. Region Exec. Time 17% 19% 15%

Conclusions 6 of 14 applications, performance improvement by % Synchronization and Parallelization exposes some new data dependences. Speculative scheduling past control and data dependences gives a better performance.

Memory Variables Difficulty due to –Aliasing : Traditional data flow analysis doesn’t help –No clear way of defining location of last definition or first use. Potential gain in performance by reducing the failed cycles can be seen below.

Synchronizing Hardware Signal both the address and value to the next thread. Producer requires a signal address buffer for ensuring correct execution. –If a store address is found in the signal address buffer -> misspeculation. Consumer has a local flag (use forward flag) to decide whether to load the value from the speculative cache or the local memory –The flag is reset if the same thread writes to the memory location before reading. –NULL address is handled as any other address, when an exception is caused, non-speculative code is used.

Compiler support Each load and store are assigned an ID based on the call stack and are profiled for dependence. A dependence graph is constructed and all Ids accessing the same location are grouped. All loads and stores belonging to the same group are synchronized by the compiler. The compiler then clones all the procedures on the call stack containing frequent data dependences, so that synchronization is executed only in the context of the call stack. The original code is then modified to include these cloned procedures. Data flow analysis is performed similar to scalar variables to insert a signal for the last store at the end of every path.

Analysis of data dependence patterns. Potential study to decide when to speculate and when to synchronize. –When inter-epoch dependences in more than 5% of all epochs are predicted correctly, to obtain significant performance improvement. –Dependence distances of 1 epoch have significant effect on speedup.

Impact of Failed Speculation on Memory-Resident Values  Next performance bottleneck: failed speculation (30% of execution) Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Failed Speculation Other Norm. Region Exec. Time go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp bzip2_decomp twolf

Failed Speculation Synchronization Stall Other Busy U=No synchronization inserted C=Compiler-Inserted Synchronization  Seven benchmarks speedup by 5% to 46% Compiler-Inserted Synchronization go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp U C U C U C U C U C U C U C U C U C U C U C U C U C 10%46%13%5%8%5%21% Norm. Region Exec. Time

Compiler- vs. Hardware-Inserted Synchronization go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp C H C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization  Compiler and hardware each benefits different benchmarks Norm. Region Exec. Time Failed Speculation Synchronization Stall Other Busy Hardware does better Compiler does better

Combining Hardware and Compiler Synchronization C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both  The combination is more robust go m88ksim gzip_comp gzip_decomp perlbmk gap C H B Norm. Region Exec. Time Failed Speculation Synchronization Stall Other Busy

Conclusion There is performance improvement by compiler inserted memory synchronization The hardware synchronization and compiler synchronization target different memory instructions and together do a better job.

Some other Issues Register pressure due to scheduling scalar and memory instructions. Profiling time for obtaining frequency of data dependences. Heuristics : choice of thresholds.