Program Analysis and Synthesis of Parallel Systems Roman ManevichBen-Gurion University.

Slides:



Advertisements
Similar presentations
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Advertisements

Automated Theorem Proving Lecture 1. Program verification is undecidable! Given program P and specification S, does P satisfy S?
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
Predicate Abstraction and Canonical Abstraction for Singly - linked Lists Roman Manevich Mooly Sagiv Tel Aviv University Eran Yahav G. Ramalingam IBM T.J.
Shape Analysis by Graph Decomposition R. Manevich M. Sagiv Tel Aviv University G. Ramalingam MSR India J. Berdine B. Cook MSR Cambridge.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
ICE1341 Programming Languages Spring 2005 Lecture #6 Lecture #6 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information and Communications University.
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
Reduction, abstraction, and atomicity: How much can we prove about concurrent programs using them? Serdar Tasiran Koç University Istanbul, Turkey Tayfun.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
Background for “KISS: Keep It Simple and Sequential” cs264 Ras Bodik spring 2005.
Rigorous Software Development CSCI-GA Instructor: Thomas Wies Spring 2012 Lecture 13.
A survey of techniques for precise program slicing Komondoor V. Raghavan Indian Institute of Science, Bangalore.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Chapter 23 Minimum Spanning Trees
Tirgul 12 Algorithm for Single-Source-Shortest-Paths (s-s-s-p) Problem Application of s-s-s-p for Solving a System of Difference Constraints.
Formalisms and Verification for Transactional Memories Vasu Singh EPFL Switzerland.
Program analysis Mooly Sagiv html://
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Describing Syntax and Semantics
Betweenness Centrality: Algorithms and Implementations Dimitrios Prountzos Keshav Pingali The University of Texas at Austin.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Counterexample Guided Invariant Discovery for Parameterized Cache Coherence Verification Sudhindra Pandav Konrad Slind Ganesh Gopalakrishnan.
Invisible Invariants: Underapproximating to Overapproximate Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A.
Comparison Under Abstraction for Verifying Linearizability Daphna Amit Noam Rinetzky Mooly Sagiv Tom RepsEran Yahav Tel Aviv UniversityUniversity of Wisconsin.
Program Analysis Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 4: SMT-based Bounded Model Checking of Concurrent Software.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Thread-modular Abstraction Refinement Thomas A. Henzinger, et al. CAV 2003 Seonggun Kim KAIST CS750b.
Elixir : A System for Synthesizing Concurrent Graph Programs
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Race Checking by Context Inference Tom Henzinger Ranjit Jhala Rupak Majumdar UC Berkeley.
Ethan Jackson, Nikolaj Bjørner and Wolfram Schulte Research in Software Engineering (RiSE), Microsoft Research 1. A FORMULA for Abstractions and Automated.
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
Checking Reachability using Matching Logic Grigore Rosu and Andrei Stefanescu University of Illinois, USA.
Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.
Lectures on Greedy Algorithms and Dynamic Programming
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
Symbolic Execution with Abstract Subsumption Checking Saswat Anand College of Computing, Georgia Institute of Technology Corina Păsăreanu QSS, NASA Ames.
On the Relation between SAT and BDDs for Equivalence Checking Sherief Reda Rolf Drechsler Alex Orailoglu Computer Science & Engineering Dept. University.
Verification & Validation By: Amir Masoud Gharehbaghi
Reasoning about the Behavior of Semantic Web Services with Concurrent Transaction Logic Presented By Dumitru Roman, Michael Kifer University of Innsbruk,
Compositionality Entails Sequentializability Pranav Garg, P. Madhusudan University of Illinois at Urbana-Champaign.
Program Analysis and Verification Spring 2015 Program Analysis and Verification Lecture 13: Abstract Interpretation V Roman Manevich Ben-Gurion University.
Static Techniques for V&V. Hierarchy of V&V techniques Static Analysis V&V Dynamic Techniques Model Checking Simulation Symbolic Execution Testing Informal.
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
Quantified Data Automata on Skinny Trees: an Abstract Domain for Lists Pranav Garg 1, P. Madhusudan 1 and Gennaro Parlato 2 1 University of Illinois at.
1 Numeric Abstract Domains Mooly Sagiv Tel Aviv University Adapted from Antoine Mine.
Operational Semantics Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
Symbolic Model Checking of Software Nishant Sinha with Edmund Clarke, Flavio Lerda, Michael Theobald Carnegie Mellon University.
Diagnostic Information for Control-Flow Analysis of Workflow Graphs (aka Free-Choice Workflow Nets) Cédric Favre(1,2), Hagen Völzer(1), Peter Müller(2)
Program Analysis and Verification Spring 2015 Program Analysis and Verification Lecture 8: Static Analysis II Roman Manevich Ben-Gurion University.
Spring 2017 Program Analysis and Verification
Presentation Title 2/4/2018 Software Verification using Predicate Abstraction and Iterative Refinement: Part Bug Catching: Automated Program Verification.
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
SS 2017 Software Verification Bounded Model Checking, Outlook
Spring 2016 Program Analysis and Verification
Compositional Pointer and Escape Analysis for Java Programs
Lectures on Network Flows
Amir Kamil and Katherine Yelick
Amir Kamil and Katherine Yelick
Presentation transcript:

Program Analysis and Synthesis of Parallel Systems Roman ManevichBen-Gurion University

Three papers 1. A Shape Analysis for Optimizing Parallel Graph Programs [POPL’11] 2. Elixir: a System for Synthesizing Concurrent Graph Programs [OOPSLA’12] 3. Parameterized Verification of Transactional Memories [PLDI’10]

What’s the connection? A Shape Analysis for Optimizing Parallel Graph Programs [POPL’11] Elixir: a System for Synthesizing Concurrent Graph Programs [OOPSLA’12] Parameterized Verification of Transactional Memories [PLDI’10] From analysis to language design Creates opportunities for more optimizations. Requires other analyses Similarities between abstract domains

What’s the connection? A Shape Analysis for Optimizing Parallel Graph Programs [POPL’11] Elixir: a System for Synthesizing Concurrent Graph Programs [OOPSLA’12] Parameterized Verification of Transactional Memories [PLDI’10] From analysis to language design Creates opportunities for more optimizations. Requires other analyses Similarities between abstract domains

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of Computer Science, The University of Texas at Austin 2: Institute for Computational Engineering and Sciences, The University of Texas at Austin

Motivation 6 Graph algorithms are ubiquitous Goal: Compiler analysis for optimization of parallel graph algorithms Computational biology Social Networks Computer Graphics

Minimum Spanning Tree Problem 7 cd ab ef g

8 cd ab ef g

Boruvka’s Minimum Spanning Tree Algorithm 9 Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node cd ab ef g d a,c b ef g lt

Parallelism in Boruvka 10 cd ab ef g Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node

Non-conflicting iterations 11 cd ab Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node ef g 4 6

Non-conflicting iterations 12 Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node d a,c b e f,g 6

Conflicting iterations 13 cd ab ef g Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node

Algorithm = repeated application of operator to graph – Active node: Node where computation is needed – Activity: Application of operator to active node – Neighborhood: Sub-graph read/written to perform activity – Unordered algorithms: Active nodes can be processed in any order Amorphous data-parallelism – Parallel execution of activities, subject to neighborhood constraints Neighborhoods are functions of runtime values – Parallelism cannot be uncovered at compile time in general Parallelism in Boruvka i1i1 i2i2 i3i3

Optimistic parallelization in Galois Programming model – Client code has sequential semantics – Library of concurrent data structures Parallel execution model – Thread-level speculation (TLS) – Activities executed speculatively Conflict detection – Each node/edge has associated exclusive lock – Graph operations acquire locks on read/written nodes/edges – Lock owned by another thread  conflict  iteration rolled back – All locks released at the end Two main overheads – Locking – Undo actions i1i1 i2i2 i3i3

Generic optimization structure Program Annotated Program Program Analyzer Program Transformer Optimized Program

Overheads (I): locking Optimizations – Redundant locking elimination – Lock removal for iteration private data – Lock removal for lock domination ACQ(P): set of definitely acquired locks per program point P Given method call M at P: Locks(M)  ACQ(P)  Redundant Locking

Overheads (II): undo actions 18 Lockset Grows Lockset Stable Failsafe … foreach (Node a : wl) { … … } foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Program point P is failsafe if:  Q : Reaches(P,Q)  Locks(Q)  ACQ(P)

GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Lockset analysis Redundant Locking Locks(M)  ACQ(P) Undo elimination  Q : Reaches(P,Q)  Locks(Q)  ACQ(P) Need to compute ACQ(P) : Runtime overhead

The optimization technically Each graph method m(arg 1,…,arg k, flag) contains optimization level flag flag=LOCK– acquire locks flag=UNDO– log undo (backup) data flag=LOCK_UNOD– (default) acquire locks and log undo flag=NONE– no extra work Example: Edge e = g.getEdge(lt, n, NONE)

Analysis challenges The usual suspects: – Unbounded Memory  Undecidability – Aliasing, Destructive updates Specific challenges: – Complex ADTs: unstructured graphs – Heap objects are locked – Adapt abstraction to ADTs We use Abstract Interpretation [CC’77] – Balance precision and realistic performance

Shape analysis overview HashMap-Graph Tree-based Set ………… Graph edges … } Graph Spec Concrete ADT Implementations in Galois library Predicate Discovery Shape Analysis Boruvka.java Optimized Boruvka.java Set cont … } Set Spec ADT Specifications

ADT specification Graph set set edges Set neighbors(Node n); } Graph Spec... Set S1 = g.neighbors(n);... Boruvka.java Abstract ADT state by virtual set + n.rev(src) + n.rev(src).dst + n.rev(dst) + nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Assumption: Implementation satisfies Spec

Graph set set + n.rev(src) + n.rev(src).dst + n.rev(dst) + nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Set neighbors(Node n); } Modeling ADTs c ab Graph Spec dst src dst src

Modeling ADTs c ab nodes edges Abstract State cont ret nghbrs Graph Spec dst src dst src Graph set set + n.rev(src) + n.rev(src).dst + n.rev(dst) + nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Set neighbors(Node n); }

Abstraction scheme cont S1S2 L(S1.cont) L(S2.cont) (S1 ≠ S2) ∧ L(S1.cont) ∧ L(S2.cont) Parameterized by set of LockPaths: L(Path)   o. o ∊ Path  Locked(o) – Tracks subset of must-be-locked objects Abstract domain elements have the form: Aliasing-configs  2 LockPaths  … 

  Joining abstract states 27 Aliasing is crucial for precision May-be-locked does not enable our optimizations #Aliasing-configs : small constant (  6)

lt GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Example invariant in Boruvka 28 The immediate neighbors of a and lt are locked a ( a ≠ lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst) ∧ L(a.rev(dst).src) ∧ L(lt) ∧ L(lt.rev(dst)) ∧ L(lt.rev(src)) ∧ L(lt.rev(dst).src) ∧ L(lt.rev(src).dst) …..

Heuristics for finding LockPaths Hierarchy Summarization (HS) – x.( fld )* – Type hierarchy graph acyclic  bounded number of paths – Preflow-Push: L( S.cont) ∧ L(S.cont.nd) Nodes in set S and their data are locked Set S Node NodeData cont nd

Footprint graph heuristic Footprint Graphs (FG)[Calcagno et al. SAS’07] – All acyclic paths from arguments of ADT method to locked objects – x.( fld | rev(fld) )* – Delaunay Mesh Refinement: L(S.cont) ∧ L(S.cont.rev(src)) ∧ L(S.cont.rev(dst)) ∧ L(S.cont.rev(src).dst) ∧ L(S.cont.rev(dst).src) – Nodes in set S and all of their immediate neighbors are locked Composition of HS, FG – Preflow-Push: L(a.rev(src).ed) 30 FG HS

Experimental evaluation Implement on top of TVLA – Encode abstraction by 3-Valued Shape Analysis [SRW TOPLAS’02] Evaluation on 4 Lonestar Java benchmarks Inferred all available optimizations # abstract states practically linear in program size BenchmarkAnalysis Time (sec) Boruvka MST6 Preflow-Push Maxflow7 Survey Propagation12 Delaunay Mesh Refinement16

Impact of optimizations for 8 threads 8-core Intel 3.00 GHz

Note 1 How to map abstract domain presented so far to TVLA? – Example invariant: (x≠y  L(y.nd))  (x=y  L(x.nd)) – Unary abstraction predicate x(v) for pointer x – Unary non-abstraction predicate L[x.p] for pointer x and path p – Use partial join – Resulting abstraction similar to the one shown

Note 2 How to come up with abstraction for similar problems? 1.Start by constructing a manual proof Hoare Logic 2.Examine resulting invariants and generalize into a language of formulas May need to be further specialized for a given program – interesting problem (machine learning/refinement) – How to get sound transformers?

Note 3 How did we avoid considering all interleavings? Proved non-interference side theorem

Elixir : A System for Synthesizing Concurrent Graph Programs Dimitrios Prountzos 1 Roman Manevich 2 Keshav Pingali 1 1. The University of Texas at Austin 2. Ben-Gurion University of the Negev

Goal Allow programmer to easily implement correct and efficient parallel graph algorithms Graph algorithms are ubiquitous Social network analysis, Computer graphics, Machine learning, … Difficult to parallelize due to their irregular nature Best algorithm and implementation usually – Platform dependent – Input dependent Need to easily experiment with different solutions Focus: Fixed graph structure Only change labels on nodes and edges Each activity touches a fixed number of nodes

Problem Formulation – Compute shortest distance from source node S to every other node Many algorithms – Bellman-Ford (1957) – Dijkstra (1959) – Chaotic relaxation (Miranker 1969) – Delta-stepping (Meyer et al. 1998) Common structure – Each node has label dist with known shortest distance from S Key operation – relax-edge(u,v) Example: Single-Source Shortest-Path A A B B C C D D E E F F G G S S A A C C 3 if dist(A) + W AC < dist(C) dist(C) = dist(A) + W AC

Scheduling of relaxations: Use priority queue of nodes, ordered by label dist Iterate over nodes u in priority order On each step: relax all neighbors v of u – Apply relax-edge to all (u,v) Dijkstra’s algorithm A A B B C C D D E E F F G G S S

Chaotic relaxation Scheduling of relaxations: Use unordered set of edges Iterate over edges (u,v) in any order On each step: – Apply relax-edge to edge (u,v) A A B B C C D D E E F F G G S S (S,A) (B,C) (C,D) (C,E)

Insights behind Elixir What should be done How it should be done Unordered/Ordered algorithms Operator Delta : activity Parallel Graph Algorithm Operators Schedule Order activity processing Identify new activities Identify new activities Static Schedule Static Schedule Dynamic Schedule “TAO of parallelism” PLDI 2011

Insights behind Elixir Parallel Graph Algorithm Operators Schedule Order activity processing Identify new activities Identify new activities Static Schedule Static Schedule Dynamic Schedule Dijkstra-style Algorithm q = new PrQueue q.enqueue(SRC) while (! q.empty ) { a = q.dequeue for each e = (a,b,w) { if dist(a) + w < dist(b) { dist(b) = dist(a) + w q.enqueue(b) }

Contributions Language – Operators/Schedule separation – Allows exploration of implementation space Operator Delta Inference – Precise Delta required for efficient fixpoint computations Automatic Parallelization – Inserts synchronization to atomically execute operators – Avoids data-races / deadlocks – Specializes parallelization based on scheduling constraints Parallel Graph Algorithm Operators Schedule Order activity processing Identify new activities Static Schedule Dynamic Schedule Synchronization

SSSP in Elixir Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int) ] relax = [ nodes(node a, dist ad) nodes(node b, dist bd) edges(src a, dst b, wt w) bd > ad + w ] ➔ [ bd = ad + w ] sssp = iterate relax ≫ schedule Graph type Operator Fixpoint Statement 44

Operators Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int) ] relax = [ nodes(node a, dist ad) nodes(node b, dist bd) edges(src a, dst b, wt w) bd > ad + w ] ➔ [ bd = ad + w ] sssp = iterate relax ≫ schedule Redex pattern Guard Update b b a a if bd > ad + w ad w bd b b a a ad w ad+w Cautious by construction – easy to generalize

Fixpoint statement Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int) ] relax = [ nodes(node a, dist ad) nodes(node b, dist bd) edges(src a, dst b, wt w) bd > ad + w ] ➔ [ bd = ad + w ] sssp = iterate relax ≫ schedule Apply operator until fixpoint Scheduling expression 46

Scheduling examples Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int) ] relax = [ nodes(node a, dist ad) nodes(node b, dist bd) edges(src a, dst b, wt w) bd > ad + w ] ➔ [ bd = ad + w ] sssp = iterate relax ≫ schedule Locality enhanced Label-correcting group b ≫ unroll 2 ≫ approx metric ad Locality enhanced Label-correcting group b ≫ unroll 2 ≫ approx metric ad Dijkstra-style metric ad ≫ group b Dijkstra-style metric ad ≫ group b q = new PrQueue q.enqueue(SRC) while (! q.empty ) { a = q.dequeue for each e = (a,b,w) { if dist(a) + w < dist(b) { dist(b) = dist(a) + w q.enqueue(b) }

Operator Delta Inference Parallel Graph Algorithm Operators Schedule Order activity processing Identify new activities Static Schedule Static Schedule Dynamic Schedule

Identifying the delta of an operator b b a a relax 1 ? ?

Delta Inference Example b b a a SMT Solver assume (da + w 1 < db) assume ¬(dc + w 2 < db) db_post = da + w 1 assert ¬(dc + w 2 < db_post) Query Program relax 1 c c w2w2 w1w1 relax 2 (c,b) does not become active

assume (da + w 1 < db) assume ¬(db + w 2 < dc) db_post = da + w 1 assert ¬(db_post + w 2 < dc) Query Program Delta inference example – active SMT Solver b b a a relax 1 c c w1w1 relax 2 w2w2 Apply relax on all outgoing edges (b,c) such that: dc > db +w 2 and c ≄ a Apply relax on all outgoing edges (b,c) such that: dc > db +w 2 and c ≄ a

Influence patterns b=c a a d d b b a=c d d a=d c c b b b=d a=c b=c a=d b=d a a c c

System architecture Elixir Galois/OpenMP Parallel Runtime Algorithm Spec Parallel Thread-Pool Graph Implementations Worklist Implementations Synthesize code Insert synchronization C++ Program

Experiments Explored Dimensions GroupingStatically group multiple instances of operator UnrollingStatically unroll operator applications by factor K Dynamic SchedulerChoose different policy/implementation for the dynamic worklist... Compare against hand-written parallel implementations

SSSP results 24 core Intel 2 GHz USA Florida Road Network (1 M nodes, 2.7 M Edges) Group + Unroll improve locality Implementation Variant

Breadth-First search results Scale-Free Graph 1 M nodes, 8 M edges USA road network 24 M nodes, 58 M edges

Conclusion Graph algorithm = Operators + Schedule – Elixir language : imperative operators + declarative schedule Allows exploring implementation space Automated reasoning for efficiently computing fixpoints Correct-by-construction parallelization Performance competitive with hand- parallelized code

Parameterized Verification of Software Transactional Memories Michael EmmiRupak Majumdar Roman Manevich

Motivation Transactional memories [Herlihy ‘93] – Programmer writes code w. coarse-grained atomic blocks – Transaction manager takes care of conflicts providing illusion of sequential execution Strict serializability – correctness criterion – Formalizes “illusion of sequential execution” Parameterized verification – Formal proof for given implementation – For every number of threads – For every number of memory objects – For every number and length of transactions 59

STM terminology Statements: reads, writes, commit, abort Transaction: reads and writes of variables followed by commit (committing transaction) or abort (aborting transaction) Word: interleaved sequence of transactions of different threads Conflict: two statements conflict if – One is a read of variable X and other is a commit of a transaction that writes to X – Both are commits of transactions that write to X 60

Safety property: strict serializability There is a serialization for the committing threads such that order of conflicts is preserved Order of non-overlapping transactions remains the same 61

Safety property: strict serializability There is a serialization for the committing threads such that order of conflicts is preserved Order of non-overlapping transactions remains the same Example word: (rd X t1), (rd Y t2), (wr X t2), (commit t2), (commit t1) => Can be serialized to : (rd X t1), (commit t1), (rd Y t2), (wr X t2), (commit t2) conflict 62

Main results First automatic verification of strict serializability for transactional memories – TPL, DSTM, TL2 New proof technique: – Template-based invisible invariant generation – Abstract checking algorithm to check inductive invariants 63 Challenging – requires reasoning on both universal and existential properties

Outline Strict serializability verification approach Automating the proof Experiments Conclusion Related work 64

Proof roadmap 1 Goal: prove model M is strictly serializable 1.Given a strict-serializability reference model RSS reduce checking strict-serializability to checking that M refines RSS 2.Reduce refinement to checking safety – Safety property SIM: whenever M can execute a statement so can RSS – Check SIM on product system M  RSS 65

Proof roadmap 2 3.Model STMs M and RSS in first-order logic TM models use set data structures and typestate bits 4.Check safety by generating strong enough candidate inductive invariant and checking inductiveness – Use observations on structure of transactional memories – Use sound first-order reasoning 66

Reference strict serializability model Guerraoui, Singh, Henzinger, Jobstmann [PLDI’08] RSS : Most liberal specification of strictly serializable system – Allows largest language of strictly-serializable executions M is strictly serializable iff every word of M is also a word of RSS – Language(M)  Language(RSS) – M refines RSS 67

Modeling transactional memories M n,k =(predicates,actions) – Predicate : ranked relation symbol p(t), q(t,v),… – Binary predicates used for sets so instead of rs(t,v) I’ll write v  rs(t) – Action : a(t,v) = if pre(a) then p’(v)=…, q’(u,v)=… Universe = set of k thread individuals and n memory individuals State S = a valuation to the predicates 68

Reference model (RSS) predicates Typestates: – RSS.finished(t), RSS.started(t), RSS.pending(t), RSS.invalid(t) Read/write sets – RSS.rs(t,v), RSS.ws(t,v) Prohibited read/write sets – RSS.prs(t,v), RSS.pws(t,v) Weak-predecessor – RSS.wp(t 1,t 2 ) 69

DSTM predicates Typestates: – DSTM.finished(t), DSTM.validated(t), DSTM.invalid(t), DSTM.aborted(t) Read/own sets – DSTM.rs(t,v), DSTM.os(t,v) 70

RSS commit(t) action 71 if  RSS.invalid(t)   RSS.wp(t,t) then  t 1,t 2. RSS.wp’(t 1,t 2 )  t 1  t  t 2  t  (RSS.wp(t 1,t 2 )  RSS.wp(t,t 2 )  (RSS.wp(t 1,t)   v. v  RSS.ws(t 1 )  v  RSS.ws(t))) … post-state predicate current-state predicate action precondition write-write conflict executing thread

DSTM commit(t) action if DSTM.validated(t) then  t 1. DSTM.validated’(t 1 )  t 1  t  DSTM.validated(t 1 )    v. v  DSTM.rs(t 1 )  v  DSTM.os(t 1 ) … 72 read-own conflict

FOTS states and execution v rd v t1 DSTM t1 t2 73 memory location individual thread individual state S1

FOTS states and execution v rd v t1 DSTM.rs predicate evaluation DSTM.rs(t1,v)=1 DSTM t1 t2 DSTM.started 74 predicate evaluation DSTM.started(t1)=1 state S2

FOTS states and execution v wr v t2 DSTM.rs DSTM t1 t2 DSTM.ws DSTM.started 75 state S3

Product system The product of two systems: A  B Predicates = A.predicates  B.predicates Actions = commit(t) = { if (A.pre  B.pre) then … } rd(t,v) = { if (A.pre  B.pre) then … } … M refines RSS iff on every execution SIM holds:  action a M.pre(a)  RSS.pre(a) 76

Checking DSTM refines RSS The only precondition in RSS is for commit(t) We need to check SIM =  t.DSTM.validated(t)   RSS.invalid(t)   RSS.wp(t,t) holds for DSTM  RSS for all reachable states Proof rule: 77 DSTM  RSS  SIM DSTM refines RSS how do we check this safety property?

Checking safety by invisible invariants How do we prove that property  holds for all reachable states of system M? Pnueli, Ruah, Zuck [TACAS’01] Come up with inductive invariant  that contains reachable states of M and strengthens SIM: 78 I1: Initial   I2:   transition   ’ I3:    M  

Strict serializability proof rule Proof roadmap: 1.Divine candidate invariant  2.Prove I1, I2, I3 I1: Initial   I2:   transition   ’ I3:   SIM DSTM  RSS  SIM 79

Two challenges Proof roadmap: 1.Divine candidate invariant  2.Prove I1, I2, I3 But how do we find a candidate  ? infinite space of possibilities given candidate  how do we check the proof rule? checking A  B is undecidable for first-order logic I1: Initial   I2:   transition   ’ I3:   SIM DSTM  RSS  SIM 80

Our solution Proof roadmap: 1.Divine candidate invariant  2.Prove I1, I2, I3 But how do we find a candidate  ? use templates and iterative weakening given candidate  how do we check the proof rule? use abstract checking I1: Initial   I2:   transition   ’ I3:   SIM DSTM  RSS  SIM 81 utilize insights on transactional memory implementations

Invariant for DSTM  RSS P1:  t, t 1. RSS.wp(t,t 1 )   RSS.invalid(t)   RSS.pending(t)   v. v  RSS.ws(t 1 )  v  RSS.ws(t) P2:  t, v. v  RSS.rs(t)   DSTM.aborted(t)  v  DSTM.rs(t) P3:  t, v. v  RSS.ws(t)   DSTM.aborted(t)  v  DSTM.os(t) P4:  t. DSTM.validated(t)   RSS.wp(t,t) P5:  t. DSTM.validated(t)   RSS.invalid(t) P6:  t. DSTM.validated(t)   RSS.pending(t) Inductive invariant involving only RSS – can use for all future proofs

Templates for DSTM  RSS P1:  t, t 1.  1 (t,t 1 )    2 (t)    3 (t)   v. v   4 (t 1 )  v   5 (t) P2:  t, v. v   1 (t)    2 (t)  v   3 (t) P3:  t, v. V   1 (t)    2 (t)  v   3 (t) P4:  t.  1 (t)    2 (t,t) P5:  t.  1 (t)    2 (t) P6:  t.  1 (t)    2 (t)

Templates for DSTM  RSS  t, t 1.  1 (t,t 1 )    2 (t)    3 (t)   v. v   4 (t 1 )  v   5 (t)  t, v. v   1 (t)    2 (t)  v   3 (t)  t.  1 (t)    2 (t,t) Why templates? Makes invariant separable Controls complexity of invariants Adding templates enables refinement

Mining candidate invariants Use predefined set of templates to specify structure of candidate invariants –  t,v.  1   2   3 –  1,  2,  3 are predicates of M or their negations – Existential formulas capturing 1-level conflicts  v. v   4 (t 1 )  v   5 (t 2 ) Mine candidate invariants from concrete execution 85

Iterative invariant weakening Initial candidate invariant C 0 =P1  P2  …  Pk Try to prove I2:   transition   ’ C 1 = { Pi | I0  transition  Pi for Pi  I0} If C 1 =C 0 then we have an inductive invariant Otherwise, compute C 2 = { Pi | C 1  transition  Pi for Pi  C 1 } Repeat until either – found inductive invariant – check I3: C k  SIM – Reached top {} – trivial inductive invariant I1: Initial   I2:   transition   ’ I3:   SIM DSTM  RSS  SIM 86

Weakening illustration P1P2 P3 P1P2 P3

Abstract proof rule I1:  (Initial)   I2: abs_transition(  (  ))   ’ I3:  (  )  SIM DSTM  RSS  SIM I1: Initial   I2:   transition   ’ I3:   SIM DSTM  RSS  SIM Formula abstraction Abstract transformer Approximate entailment 88

Conclusion Novel invariant generation using templates – extends applicability of invisible invariants Abstract domain and reasoning to check invariants without state explosion Proved strict-serializability for TPL, DSTM, TL2 – BLAST and TVLA failed 89

Verification results propertyTPLDSTMTL2RSS Bound for invariant gen. (2,1) No. cubes Bounded time Invariant mining time #templates28 #candidates #proved #minimal485- avg, time per invariant avg. abs. size k2.86k Total time3.5m54.3m129.3m30.9m   90

Insights on transactional memories Transition relation is symmetric – thread identifiers not used p’(t,v)  …  t 1 …  t 2 Executing thread t interacts only with arbitrary thread or conflict-adjacent thread Arbitrary thread:  v. v  TL2.rs(t 1 )  v  TL2.ws(t 1 ) Conflict adjacent:  v. v  DSTM.rs(t 1 )  v  DSTM.ws(t) 91

v1v1 read set write set v2v2 t2t2 t t3t3 read-write conflict write-write conflict Conflict adjacency 92  v. v  rs(t)  v  DSTM.ws(t 2 )  v. v  ws(t 1 )  v  DSTM.ws(t 2 )

... t2t2 t t3t3 Conflict adjacency 93

Related work Reduction theorems Guerarroui et al. [PLDI’08,CONCUR’08] Manually-supplied invariants – fixed number of threads and variables + PVS Cohen et al. [FMCAS’07] Predicate Abstraction + Shape Analysis – SLAM, BLAST, TVLA 94

Related work Invisible invariants Arons et al. [CAV’01] Pnueli et al. [TACAS’01] Templates – for arithmetic constraints Indexed predicate abstraction Shuvendu et al. [TOCL’07] Thread quantification Berdine et al. [CAV’08] Segalov et al. [APLAS’09] 95

Thank You! 96