Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Optimizing compilers Managing Cache Bercovici Sivan.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
CSE 830: Design and Theory of Algorithms
Multiscalar processors
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Scheduling Parallel Task
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
A Behavioral Memory Model for the UPC Language Kathy Yelick Joint work with: Dan Bonachea, Jason Duell, Chuck Wallace.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Chapter 1 Algorithm Analysis
1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Instructor Kostas Kontogiannis.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Evaluation of Memory Consistency Models in Titanium.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Code Optimization 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a.
Dataflow Analysis Topic today Data flow analysis: Section 3 of Representation and Analysis Paper (Section 3) NOTE we finished through slide 30 on Friday.
1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
CSC 413/513: Intro to Algorithms NP Completeness.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
The Complexity of Distributed Algorithms. Common measures Space complexity How much space is needed per process to run an algorithm? (measured in terms.
Data Structure Introduction.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.
Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Approximation Algorithms based on linear programming.
Code Optimization.
Memory Consistency Models
Memory Consistency Models
Algorithm Analysis CSE 2011 Winter September 2018.
Amir Kamil and Katherine Yelick
Objective of This Course
Integer Programming (정수계획법)
Integer Programming (정수계획법)
Memory Consistency Models
Amir Kamil and Katherine Yelick
Programming with Shared Memory Specifying parallelism
Loop-Level Parallelism
Presentation transcript:

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick

Motivation Compiler and hardware optimizations are legal only if the resulting execution is indistinguishable from one that follows program order In terms of memory accesses:  Uniprocessor: Never reorder accesses with data- dependencies  Multiprocessor: Not enough to just satisfy local dependencies Programmers intuitively rely on the notion of Sequential Consistency for parallel programs

Examples of SC violation T1 //initially A, B = 0 A = 1 If (B == 0) critical section T2 //initially A, B = 0 B = 1 If (A == 0) critical section If we consider only one thread, Access to A,B can be reordered But means both threads can enter critical section SC prevents this by restricting reordering on shared memory accesses Reordering not allowed on either T1 or T2, because the other thread may observe the effect

Problem with Sequential Consistency SC is easy to understand, but expensive to enforce Naïve approach – insert fences between every consecutive pair of shared accesses Bad for performance:  Most compiler optimizations (pipelining, prefetching, code motion) need to reorder memory accesses  Fences are expensive (drain processor pipeline) Especially for GAS (global address space) languages  Alternative to MPI for distributed memory machines  Threads can read and write remote memory directly  Overlapping communication overhead is critical Goal: Find the minimal amount of ordering needed to guarantee sequential consistency

Problem Statement Input: SPMD program with program order P  Represented as a graph, with shared accesses as its nodes Delay: for u, v on the same thread, guarantees that u happens before v  i.e. there is a fence between u and v  A “delay set” is a subset of P Output: Find the minimal delay set D s.t. any execution consistent with it satisfies SC Use the idea of Cycle Detection

Cycle Detection Conflict Accesses: for u, v on different threads  u, v are conflicting if they access the same shared variable, with at least one write A parallel execution instance E defines a happens- before relation for accesses to the same shared memory location.  E – memory centric, P – thread centric E is correct iff it’s consistent with P  In other words, can’t have cycles in P U E But we don’t know E  Use C, the set of conflict edges, to approximate E

Example SC restriction: (x,y) on P2 can’t be (0,1) Analysis finds a critical cycle  enforces all delays on the cycle. Figure-eight shape – only way to get cycle for straight-line code Write X = 1 Read XWrite Y = 1 Read Y P edges C edges (initially x = y = 0) P1 P2 Delays

Example II No restrictions by SC: (x,y) on P2 can be either (0,0), (0,1) (1,0), (1,1) Analysis finds no cycles in the graph  no delays are necessary Write X = 1 Read YWrite Y = 1 Read X P edges C edges (initially x = y = 0) P1 P2

Cycle Detection for SPMD programs Krishnamurthy and Yelick created polynomial time algorithms for SPMD programs Keep two copies P L and P R of P Add internal C (conflict) edges to P R Remove all edges from P L Consider the conflict graph P L U C U P R  For each pair (u L, v L ) in P, check if we can find a back-path (v L, u L )  Algorithm takes O(n 3 ) time – one depth-first search for each node (n is number of shared accesses)  Computes minimal delay set for programs with scalar variables

Algorithm at Work Read numTrans Write turn Write fund Write turn Write fund Write numTrans Read numTrans While (turn != MYPROC); numTrans++; fund = c; turn++; The Code Read turn P edges C edges PLPL PRPR

Faster SPMD Cycle Detection For each P edge (u L, v L ), we want to know if u L is reachable from v L in the conflict graph Since graph is static, we can use strongly- connected-components to cache the reachability  For (u L, v L ), back-path (v L, u L ) exists  C(u L ) reachable from C(v L ) (or they are the same) A O(n 2 ) running time Compute same delay set as Krishnamurthy and Yelick’s Algorithm

Extending Cycle Detection to Array Accesses Previous algorithm has many false delays due to array accesses in loops  Cycle detection finds backpath from S2 to S1  But each S1, S2 accesses different memory location, and  Threads iterate the loop in the same order  no SC violation We can improve the accuracy by incorporating array indices into our analysis for (i = 0; i < N; i++) { A[i] = 1; (S1) B[i] = 2; (S2) } S1 S2 S1 P edges C edges T1T2

Concept behind SPMD Cycle Detection for Arrays Imagine if a loop is fully unrolled  All cycles have figure-eight shape  A conflict edge means two array accesses have the same subscript  For a cycle of (u L, v L ) with backpath (v L, v R,…, u R, u L ): index(v L ) == index(v R ), index(u L ) == index(u R ), iter(v L ) >= iter(u L ), iter(u R ) >= iter(v R )  Iteration information is encoded in edge direction  How do we incorporate information about the index into the conflict graph?

Augmenting Conflict Graph with Weights For edge (A[f(i)], B[g(i)]), assign its weight to be g(i) – f(i) For loop back edge, use loop increment Conflict edges always have zero weight New Goal: for (u L, v L ), find backpath (v R, u R ) s.t. W(u L, v L ) + W(v R, u R ) == 0 for (i = 0; i < N; i++) { A[i] = 1; (S1) B[i] = 2; (S2) } S1 S2 S1 P edges T1T

Three Polynomial-time Algorithms Zero cycle detection  When all edge weights are constants  Graph theory to detect zero cycles (simple and non- simple)  O(n 3 ) if no negative cycles, O(n 5 ) otherwise Data-flow analysis  Use the signs of the edges to approximate answer  O(n 3 ) time Integer Programming with 4 variables  Useful for generic affine terms  For each (u, v), find all possible pairs of (C(u), C(v))  Create linear systems with 4 equations  O(n 4 ) time

Data-flow Analysis Approximation Check if a cycle must have non-zero weight (u L, v L ) is NOT a delay if  sgn( u L, v L ) ∏ sgn( v L, u L ) is in {+,–} O(n 3 ) time  Only 3 * n initial conditions for data-flow analysis T + – 0 OUT(B) = IN(B) IN(B) = ∏ (Sgn(P,B) ∏ OUT(P)), where P is in pred(B).

Integer Programming Example S1 (i) S4 (j) S2 (i’) S3 (j’) S1 -> S4: i = j – 2S2 -> S3: i’ + 1 = j’ S1 -> S2: i’ = i + 3k 1, k 1 >= 0S3 -> S4: j = j’ + 2k 2, k 2 >= 0 S2 -> S1: i = i’ + 3k 1, k 1 >= 1S4 -> S3: j’ = j + 2k 2, k 2 >= 1 if (MYTHREAD == 1) for (i = 0; i < N; i+= 3) { A[i] = c1; (S1) B[i+1] = c2; (S2) } if (MYTHREAD == 2) for (j = 2; j < N; j+=2) { B[j] = c3; (S3) A[j-2] = c4; (S4) } For (S1,S2), any cycle must include (S2,S3), (S4, S1) System has no solution  (S1, S2) is not a delay Zero cycle  no delay, data-flow  delay

Algorithm Evaluation Speed: data-flow > IP4 > zero Accuracy: zero > IP4 > data-flow Applicability: data-flow > IP4 > zero Ease of implementation: data-flow > zero > IP4 Possible implementation strategy:  Use data-flow for most cases  Use zero cycle detection when it’s applicable, and for “hot spot” of the program  Use Integer Programming to deal with complex affine terms

Conclusion Cycle detection is important for efficiently enforcing sequential consistency We present a O(n 2 ) algorithm for handling scalar accesses in SPMD programs We present three polynomial time algorithms to support array accesses Plan to experiment our techniques on UPC, a global address space SPMD language  Communication scheduling (prefetching, pipelining)