Making Sequential Consistency Practical in Titanium

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Pallavi Joshi  Chang-Seo Park  Koushik Sen  Mayur Naik ‡  Par Lab, EECS, UC Berkeley‡

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Pointer Analysis – Part I Mayur Naik Intel Research, Berkeley CS294 Lecture March 17, 2009.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz April 7 th, 2010 Youngjoon Jo.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Evaluation of Memory Consistency Models in Titanium.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.

Communication Optimizations in Titanium Programs Jimmy Su.

CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick

Single Instruction Multiple Threads

Auburn University

TensorFlow– A system for large-scale machine learning

Static Analysis of Object References in RMI-based Java Software

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

Optimizing Compilers Background

Memory Consistency Models

Compositional Pointer and Escape Analysis for Java Programs

Threads Cannot Be Implemented As a Library

Static Single Assignment

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

3- Parallel Programming Models

Memory Consistency Models

Programming Models for SimMillennium

Optimization Code Optimization ©SoftMoore Consulting.

Martin Rinard Laboratory for Computer Science

Henk Corporaal TUEindhoven 2009

Amir Kamil and Katherine Yelick

L21: Putting it together: Tree Search (Ch. 6)

Threads and Memory Models Hal Perkins Autumn 2011

Machine-Independent Optimization

Performance Optimization for Embedded Software

Threads and Memory Models Hal Perkins Autumn 2009

Background and Motivation

Gary M. Zoppetti Gagan Agrawal

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Memory Consistency Models

Amir Kamil and Katherine Yelick

Type Systems For Distributed Data Sharing

Foundations and Definitions

Programming with Shared Memory Specifying parallelism

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Pointer analysis John Rollinson & Kaiyuan Li

Presentation transcript:

Making Sequential Consistency Practical in Titanium Amir Kamil, Jimmy Su, and Katherine Yelick Titanium Group http://titanium.cs.berkeley.edu U.C. Berkeley November 15, 2005

Reordering in Sequential Programs Two accesses can be reordered as long as the reordering does not violate a local dependency. Initially, flag = data = 0 data = 1 flag = 1 flag = 1 data = 1 In both orderings, the end result is {data == flag == 1}.

Reordering in Parallel Programs In parallel programs, a reordering can change the semantics even if no local dependencies exist. Initially, flag = data = 0 T1 T1 data = 1 flag = 1 T2 T2 f = flag f = flag d = data d = data flag = 1 data = 1 {f == 1, d == 0} is a possible result in the reordered code, but not in the original code.

Memory Models In relaxed consistency, reordering is allowed if no local dependencies or synchronization operations are violated In sequential consistency, a reordering is illegal if it can be observed by another thread Titanium, Java, UPC, and many other languages do not provide sequential consistency due to the (perceived) cost of enforcing it

Software and Hardware Reordering Compiler can reorder accesses as part of an optimization Example: copy propagation Logical fences inserted where reordering is illegal – optimizations respect these fences Hardware can reorder accesses Examples: out of order execution, remote accesses Fence instructions inserted into generated code – waits until all prior memory operations have completed Can cost a complete round trip time due to remote accesses

Conflicts Reordering of an access is observable only if it conflicts with some other access: The accesses can be to the same memory location At least one access is a write The accesses can run concurrently Fences only need to be inserted around accesses that conflict T1 T2 data = 1 f = flag flag = 1 d = data Conflicts

Sequential Consistency in Titanium Minimize number of fences – allow same optimizations as relaxed model Concurrency analysis identifies concurrent accesses Relies on Titanium’s textual barriers and single-valued expressions Alias analysis identifies accesses to the same location Relies on SPMD nature of Titanium

Barrier Alignment Many parallel languages make no attempt to ensure that barriers line up Example code that is legal but will deadlock: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else ; // odd ID threads

Structural Correctness Aiken and Gay introduced structural correctness (POPL’98) Ensures that every thread executes the same number of barriers Example of structurally correct code: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads

Textual Barrier Alignment Titanium has textual barriers: all threads must execute the same textual sequence of barriers Stronger guarantee than structural correctness – this example is illegal: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads Single-valued expressions used to enforce textual barriers

Single-Valued Expressions A single-valued expression has the same value on all threads when evaluated Example: Ti.numProcs() > 1 All threads guaranteed to take the same branch of a conditional guarded by a single-valued expression Only single-valued conditionals may have barriers Example of legal barrier use: if (Ti.numProcs() > 1) Ti.barrier(); // multiple threads else ; // only one thread total

Concurrency Analysis (I) Graph generated from program as follows: Node added for each code segment between barriers and single-valued conditionals Edges added to represent control flow between segments 1 // code segment 1 if ([single]) // code segment 2 else // code segment 3 // code segment 4 Ti.barrier() // code segment 5 2 3 4 barrier 5

Concurrency Analysis (II) Two accesses can run concurrently if: They are in the same node, or One access’s node is reachable from the other access’s node without hitting a barrier Algorithm: remove barrier edges, do DFS 1 Concurrent Segments 1 2 3 4 5 X 2 3 4 barrier 5

Alias Analysis Allocation sites correspond to abstract locations (a-locs) All explicit and implict program variables have points-to sets A-locs are typed and have points-to sets for each field of the corresponding type Arrays have a single points-to set for all indices Analysis is flow,context-insensitive Experimental call-site sensitive version – doesn’t seem to help much

Thread-Aware Alias Analysis Two types of abstract locations: local and remote Local locations reside in local thread’s memory Remote locations reside on another thread Exploits SPMD property Results are a summary over all threads Independent of the number of threads at runtime

Alias Analysis: Allocation Creates new local abstract location Result of allocation must reside in local memory class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); A-locs 1, 2 Points-to Sets a b c

Alias Analysis: Assignment Copies source abstract locations into points-to set of target class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); A-locs 1, 2 Points-to Sets a 1 b c 1.z 2

Alias Analysis: Broadcast Produces both local and remote versions of source abstract location Remote a-loc points to remote analog of what local a-loc points to class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); A-locs 1, 2, 1r Points-to Sets a 1 b 1, 1r c 1.z 2 1r.z 2r

Alias [Across Threads]: Aliasing Results Two variables A and B may alias if: $ xÎpointsTo(A). xÎpointsTo(B) Two variables A and B may alias across threads if: R(x)ÎpointsTo(B), (where R(x) is the remote counterpart of x) Points-to Sets a 1 b 1, 1r c Alias [Across Threads]: a b, c [b] b a, c [a, c] c a, b [b]

Benchmarks Benchmark Lines1 Description pi 56 Monte Carlo integration demv 122 Dense matrix-vector multiply sample-sort 321 Parallel sort lu-fact 420 Dense linear algebra 3d-fft 614 Fourier transform gsrb 1090 Computational fluid dynamics kernel gsrb* 1099 Slightly modified version of gsrb spmv 1493 Sparse matrix-vector multiply gas 8841 Hyperbolic solver for gas dynamics 1 Line counts do not include the reachable portion of the 1 37,000 line Titanium/Java 1.0 libraries

Analysis Levels We tested analyses of varying levels of precision Description naïve All heap accesses sharing All shared accesses concur Concurrency analysis + type-based AA concur/saa Concurrency analysis + sequential AA concur/taa Concurrency analysis + thread-aware AA concur/taa/cycle Concurrency analysis + thread-aware AA + cycle detection

Static (Logical) Fences GOOD Percentages are for number of static fences reduced over naive

Dynamic (Executed) Fences GOOD Percentages are for number of dynamic fences reduced over naive

Dynamic Fences: gsrb gsrb relies on dynamic locality checks slight modification to remove checks (gsrb*) greatly increases precision of analysis GOOD

Two Example Optimizations Consider two optimizations for GAS languages Overlap bulk memory copies Communication aggregation for irregular array accesses (i.e. a[b[i]]) Both optimizations reorder accesses, so sequential consistency can inhibit them Both are addressing network performance, so potential payoff is high

Array Copies in Titanium Array copy operations are commonly used dst.copy(src); Content in the domain intersection of the two arrays is copied from dst to src Communication (possibly with packing) required if arrays reside on different threads Processor blocks until the operation is complete. src dst

Non-Blocking Array Copy Optimization Automatically convert blocking array copies into non-blocking array copies Push sync as far down the instruction stream as possible to allow overlap with computation Interprocedural: syncs can be moved across method boundaries Optimization reorders memory accesses – may be illegal under sequential consistency

Communication Aggregation on Irregular Array Accesses (Inspector/Executor) A loop containing indirect array accesses is split into phases Inspector examines loop and computes reference targets Required remote data gathered in a bulk operation Executor uses data to perform actual computation Can be illegal under sequential consistency schd = inspect(remote, b); tmp = get(remote, schd); for (...) { a[i] = tmp[i]; // other accesses } for (...) { a[i] = remote[b[i]]; // other accesses }

Relaxed + SC with 3 Analyses We tested performance using analyses of varying levels of precision Name Description relaxed Uses Titanium’s relaxed memory model naïve Uses sequential consistency, puts fences around every heap access sharing Uses sequential consistency, puts fences around every shared heap access concur/taa/cycle Uses sequential consistency, uses our most aggressive analysis

Dense Matrix Vector Multiply Non-blocking array copy optimization applied Strongest analysis is necessary: other SC implementations suffer relative to relaxed

Sparse Matrix Vector Multiply Inspector/executor optimization applied Strongest analysis is again necessary and sufficient

Conclusion Titanium’s textual barriers and single-valued expressions allow for simple but precise concurrency analysis Sequential consistency can eliminate nearly all fences for the benchmarks tested On two linear algebra kernels, sequential consistency can be provided with little or no performance cost with our analysis Analysis allows the same optimizations to be performed as in the relaxed memory model