Amir Kamil and Katherine Yelick

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
Pointer Analysis – Part I Mayur Naik Intel Research, Berkeley CS294 Lecture March 17, 2009.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,
Timed Automata.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
CPSC 322, Lecture 9Slide 1 Search: Advanced Topics Computer Science cpsc322, Lecture 9 (Textbook Chpt 3.6) January, 23, 2009.
1 Hierarchical Pointer AnalysisAmir Kamil Hierarchical Pointer Analysis for Distributed Programs Amir Kamil U.C. Berkeley December 7, 2005.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter
Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Precision Going back to constant prop, in what cases would we lose precision?
Thread-modular Abstraction Refinement Thomas A. Henzinger, et al. CAV 2003 Seonggun Kim KAIST CS750b.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Evaluation of Memory Consistency Models in Titanium.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Adding Concurrency to a Programming Language Peter A. Buhr and Glen Ditchfield USENIX C++ Technical Conference, Portland, Oregon, U. S. A., August 1992.
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Data Flow Analysis Suman Jana
Memory Consistency Models
Static Single Assignment
The minimum cost flow problem
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Memory Consistency Models
Programming Models for SimMillennium
L21: Putting it together: Tree Search (Ch. 6)
Threads and Memory Models Hal Perkins Autumn 2011
Making Sequential Consistency Practical in Titanium
Topic 10: Dataflow Analysis
Integer Programming (정수계획법)
What to do when you don’t know anything know nothing
Threads and Memory Models Hal Perkins Autumn 2009
Parallelism and Concurrency
Shared Memory Consistency Models: A Tutorial
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Abstraction.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Memory Consistency Models
Amir Kamil and Katherine Yelick
Type Systems For Distributed Data Sharing
Programming with Shared Memory - 3 Recognizing parallelism
Foundations and Definitions
Programming with Shared Memory Specifying parallelism
Shared memory programming
Maximizing Speedup through Self-Tuning of Processor Allocation
Introduction to CILK Some slides are from:
Procedure Linkages Standard procedure linkage Procedure has
Pointer analysis John Rollinson & Kaiyuan Li
Multidisciplinary Optimization
Presentation transcript:

Amir Kamil and Katherine Yelick Concurrency Analysis for Parallel Programs with Textually Aligned Barriers Amir Kamil and Katherine Yelick Titanium Group http://titanium.cs.berkeley.edu U.C. Berkeley October 20, 2005

Motivation/Goals Many program analyses and optimizations benefit from knowing which expressions can run concurrently Develop basic concurrency analysis for Titanium programs Refine analysis to ignore infeasible paths Evaluate analysis using two applications Race detection Memory model enforcement

Barrier Alignment Many parallel languages make no attempt to ensure that barriers line up Example code that is legal but will deadlock: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else ; // odd ID threads

Structural Correctness Aiken and Gay introduced structural correctness (POPL’98) Ensures that every thread executes the same number of barriers Example of structurally correct code: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads

Textual Barrier Alignment Titanium has textual barriers: all threads must execute the same textual sequence of barriers Stronger guarantee than structural correctness – this example is illegal: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads Single-valued expressions used to enforce textual barriers

Single-Valued Expressions A single-valued expression has the same value on all threads when evaluated Example: Ti.numProcs() > 1 All threads guaranteed to take the same branch of a conditional guarded by a single-valued expression Only single-valued conditionals may have barriers Example of legal barrier use: if (Ti.numProcs() > 1) Ti.barrier(); // multiple threads else ; // only one thread

Concurrency Graph Represents concurrency as a graph Nodes are program expressions If a path exists between two nodes, they can run concurrently Generated from control flow graph x++; Ti.barrier(); if (b) z--; else z++; if (c [single]) w--; else y++; foo(); x++ b call foo z-- z++ method foo c ret foo w-- y++

Barriers Barriers prevent code before and after from running concurrently Nodes for barrier expressions removed from concurrency graph x++ x++; Ti.barrier(); ... barrier ...

Non-Single Conditionals Branches of a non-single conditional can run concurrently Different threads can take different branches Edge added in the concurrency graph between branches b if (b) z--; else z++; ... z-- z++ ...

Single Conditionals Branches of a single conditional cannot run concurrently All threads take the same branch No edge added in the concurrency graph between branches c if (c [single]) w--; else y++; ... w-- y++ ...

Method Calls Method call nodes split into a call and return node Edges added from call node to target method’s subgraph, and from target method to return node ... call foo ... foo(); method foo ret foo

Concurrency Algorithm Two accesses can run concurrently if at least one is reachable from the other Concurrent accesses computed by doing N depth first searches x++; Ti.barrier(); if (b) z--; else z++; if (c [single]) w--; else y++; foo(); x++ b call foo z-- z++ method foo c ret foo w-- y++

Infeasible Paths (I) Handling of method calls allows infeasible control flow paths Path exists into one call site and out of another method bar method baz call foo call foo method foo ret foo ret foo

Infeasible Paths (II) Solution – label call and return edges with matching parentheses Only follow paths that correspond to balanced parentheses method bar method baz call foo call foo method foo ret foo ret foo

Bypass Edges (I) Reachability now depends on context Inefficient to revisit method in every context Solution – add edges to bypass method calls call foo call foo ret foo method foo ret foo method foo call foo call foo ret foo ret foo

Bypass Edges (II) Can only bypass method calls that can actually complete (without executing a barrier) Iteratively compute set of methods that can complete CanComplete  φ Do (until a fixed point is reached): CanComplete  CanComplete È all methods that can complete by only calling methods in CanComplete

Possible final values of x: Static Race Detection Two heap accesses compose a data race if they can concurrently access the same location, and at least one is a write Alias and concurrency analysis used to statically compute set of possible data races Analyses are sound, so all real races are detected Goal is to minimize number of false races detected Initially, x = 0 Possible final values of x: -1, 0, 1 T1 T2 set x = x + 1 set x = x - 1

Sequential Consistency Definition: A parallel execution must behave as if it were an interleaving of the serial executions by individual threads, with each individual execution sequence preserving the program order [Lamport79]. Initially, flag = data = 0 T1 T2 Legal execution: a x y b Illegal execution: x y b a a [set data = 1] y [read flag] x [set flag = 1] b [read data] Critical cycle Titanium and most other languages do not provide sequential consistency due to the (perceived) cost of enforcing it.

Enforcing Sequential Consistency Compiler and architecture must not reorder memory accesses that are part of a critical cycle Fences inserted into program to enforce order Potentially costly – can prevent optimizations such as code motion and communication aggregation At runtime, can cost an RTT on a distributed machine Goal is to minimize number of inserted fences

Benchmarks Benchmark Lines1 Description lu-fact 420 Dense linear algebra gsrb 1090 Computational fluid dynamics kernel spmv 1493 Sparse matrix-vector multiply pps 3673 Parallel Poisson equation solver gas 8841 Hyperbolic solver for gas dynamics 1 Line counts do not include the reachable portion of the 1 37,000 line Titanium/Java 1.0 libraries

Analysis Levels We tested analyses of varying levels of precision All levels use alias analysis to distinguish memory locations Analysis Description base All expressions assumed concurrent concur Basic concurrency analysis feasible Feasible paths concurrency analysis

Race Detection Results

Sequential Consistency Results

Conclusion Textual barriers and single-valued expressions allow for simple but precise concurrency analysis Concurrency analysis is useful both for detecting races and for enforcing sequential consistency Not sufficient for race detection – too many false positives Good enough for sequential consistency to be provided at low cost (SC|05) Ignoring infeasible paths significantly improves analysis results