Correctness of parallel programs Shaz Qadeer Research in Software Engineering CSEP 506 Spring 2011.

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.

Operating Systems Part III: Process Management (Process Synchronization)

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.

Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Chapter 6 Process Synchronization Bernard Chen Spring 2007.

Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Concurrency 101 Shared state. Part 1: General Concepts 2.

Background for “KISS: Keep It Simple and Sequential” cs264 Ras Bodik spring 2005.

WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.

Cormac Flanagan and Stephen Freund PLDI 2009 Slides by Michelle Goodstein 07/26/10.

Intro to Threading CS221 – 4/20/09. What we’ll cover today Finish the DOTS program Introduction to threads and multi-threading.

“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.

October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois

1 Lecture 6 Performance Measurement and Improvement.

Review: Process Management Objective: –Enable fair multi-user, multiprocess computing on limited physical resources –Security and efficiency Process: running.

OS Spring 2004 Concurrency: Principles of Deadlock Operating Systems Spring 2004.

C. Flanagan1Atomicity for Reliable Concurrent Software - PLDI'05 Tutorial Atomicity for Reliable Concurrent Software Joint work with Stephen Freund Shaz.

OS Fall’02 Concurrency: Principles of Deadlock Operating Systems Fall 2002.

CS533 - Concepts of Operating Systems 1 Class Discussion.

1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Cormac Flanagan UC Santa Cruz Velodrome: A Sound and Complete Dynamic Atomicity Checker for Multithreaded Programs Jaeheon Yi UC Santa Cruz Stephen Freund.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Logical Time Steve Ko Computer Sciences and Engineering University at Buffalo.

C. FlanaganType Systems for Multithreaded Software1 Cormac Flanagan UC Santa Cruz Stephen N. Freund Williams College Shaz Qadeer Microsoft Research.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Copyright © 1997 – 2014 Curt Hill Concurrent Execution of Programs An Overview.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

CSE 486/586 CSE 486/586 Distributed Systems Logical Time Steve Ko Computer Sciences and Engineering University at Buffalo.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.

Data races, informally [More formal definition to follow] “race condition” means two different things Data race: Two threads read/write, write/read, or.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Degrees of Isolation – A Theoretical Formulation Presented by Balaji Sethuraman.

Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.

CS265: Dynamic Partial Order Reduction Koushik Sen UC Berkeley.

Debugging Threaded Applications By Andrew Binstock CMPS Parallel.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

CMSC 330: Organization of Programming Languages Threads.

Deadlock Bug Detection Techniques Prof. Moonzoo Kim CS KAIST CS492B Analysis of Concurrent Programs 1.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

System Hardware FPU – Floating Point Unit –Handles floating point and extended integer calculations 8284/82C284 Clock Generator (clock) –Synchronizes the.

CHESS Finding and Reproducing Heisenbugs in Concurrent Programs

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Agenda  Quick Review  Finish Introduction  Java Threads.

CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Logical Time Steve Ko Computer Sciences and Engineering University at Buffalo.

FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Lecture 11: Consistency Models

References [1] LEAP:The Lightweight Deterministic Multi-processor Replay of Concurrent Java Programs [2] CLAP:Recording Local Executions to Reproduce.

COT 5611 Operating Systems Design Principles Spring 2014

Background and Motivation

Concurrency: Mutual Exclusion and Process Synchronization

EE 4xx: Computer Architecture and Performance Programming

CSE 153 Design of Operating Systems Winter 19

Lecture 20: Synchronization

CSE 542: Operating Systems

CSE 542: Operating Systems

Presentation transcript:

Correctness of parallel programs Shaz Qadeer Research in Software Engineering CSEP 506 Spring 2011

Why is correctness important? Software development is expensive – Testing and debugging significant part of the cost Testing and debugging of parallel and concurrent programs is even more difficult and expensive

The Heisenbug problem Sequential program: program execution fully determined by the input Parallel program: program execution may depend both on the input and the communication among the parallel tasks Communication dependencies are invariably not reproducible

Essence of reasoning about correctness Modeling Abstraction Specification Verification

What is a data race? An execution in which two tasks simultaneously access the same memory location Parallel.For(0, filenames.Length, (int i) => { int len = filenames[i].Length; count[len]++; });

Example filenames.Length == 2 filenames[0].Length == filenames[1].Length == 1 Parallel.For(0, filenames.Length, (int i) => { int len = filenames[i].Length; count[len]++; }); Task(0) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(1) int len, t; len = 1; t = count[1]; t++; count[1] = t;

Two executions Task(0) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(1) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(0) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(1) int len, t; len = 1; t = count[1]; t++; count[1] = t; E1E2 Which execution has a data race?

Example count.Length == 2 for (int i = 0; i < count.Length; i++) { count[i] = 0; } Parallel.For(0, count.Length, (int i) => { count[i]++; }); for (int i = 0; i < count.Length; i++) { Console.WriteLine(count[0]); }

An execution Task(0) int t; t = count[0]; t++; count[0] = t; Task(1) int t; t = count[1]; t++; count[1] = t; Parent() count[0] = 0; count[1] = 0; WriteLine(count[0]); WriteLine(count[1]);

What is a parallel execution? Happens-before graph: directed acyclic graph over the set of events in the execution Three kinds of events – Access(t, a): task t accessed address a – Fork(t, u): task t created task u – Join(t, u): task t waited for task u Three kinds of edges – program order: edge from an event by a particular task to subsequent event by the same task – fork: edge from Fork(t, u) to first event performed by task u – join: edge from last event performed by task u to Join(t, u)

A note on modeling executions A real execution on a live system is complicated – instruction execution on the CPU – scheduling in the runtime – hardware events, e.g., cache coherence messages Focus on what is relevant to the specification – memory accesses because data-races are about conflicting memory accesses – fork and join operations because otherwise we cannot reason precisely

Example of happens-before graph Access(Parent, &count[0]) Fork(Parent, Task(0)) Access(Parent, &count[0]) Access(Parent, &count[1]) Access(Task(0), &count[0]) Access(Task(1), &count[1]) Fork(Parent, Task(1)) Access(Parent, &count[1]) Join(Parent, Task(0)) Join(Parent, Task(1))

Definition of data race e < f in an execution iff there is a path in the happens- before graph from e to f – e happens before f An execution has a data-race on address x iff there are different events e and f – both e and f access x – not e < f – not f < e An execution is race-free iff there is no data-race on any address x

Task(0) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(1) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(0) int len, t; len = 1; t = count[1]; t++; count[1] = t; Task(1) int len, t; len = 1; t = count[1]; t++; count[1] = t; Access(Task(0), &count[1] Access(Task(0), &count[1] Access(Task(1), &count[1] Access(Task(1), &count[1] Access(Task(0), &count[1] Access(Task(0), &count[1] Access(Task(1), &count[1] Access(Task(1), &count[1] Racy execution

Race-free execution Access(Parent, &count[0]) Fork(Parent, Task(0)) Access(Parent, &count[0]) Access(Parent, &count[1]) Access(Task(0), &count[0]) Access(Task(1), &count[1]) Fork(Parent, Task(1)) Access(Parent, &count[1]) Join(Parent, Task(0)) Join(Parent, Task(1))

Vector-clock algorithm Vector clock: an array of integers indexed by task identifiers For each task t, maintain a vector clock C(t) – each clock in C(t) initialized to 0 For each address a, maintain a vector clock X(a) – each clock in X(a) initialized to 0

Vector-clock operations Task t executes an event – increment C(t)[t] by one Task t forks task u – initialize C(u) to C(t) Task t joins with task u – update C(t) to max(C(t), C(u)) Task t accesses address a – data race unless X(a) < C(t) – update X(a) to C(t)

Access(Parent, &count[0]) Access(Parent, &count[1]) Fork(Parent, Task(0)) Fork(Parent, Task(1)) Access(Task(0), &count[0]) Access(Task(1), &count[1]) Access(Task(0), &count[0]) Access(Task(1), &count[1]) Join(Parent, Task(0)) Join(Parent, Task(1)) Access(Parent, &count[0]) Access(Parent, &count[1]) C(Parent)C(Task(0))C(Task(1))X(&count[0]) [1, 0, 0] [0, 0, 0] [2, 0, 0] [3, 0, 0] [4, 0, 0] [3, 1, 0] [1, 0, 0] [2, 0, 0] [4, 0, 1] [3, 1, 0] [4, 0, 1] [3, 2, 0] [4, 0, 2] [5, 2, 0] [6, 2, 2] [7, 2, 2] [8, 2, 2] [7, 2, 2] [8, 2, 2]

Access(Task(0), &count[1]) Access(Task(1), &count[1]) Access(Task(0), &count[1]) Access(Task(1), &count[1]) C(Task(0))C(Task(1))X(&count[0]) [0, 0] [1, 0] [0, 1]

Correctness of vector-clock algorithm For any execution and any two events followed by in the execution – < iff <

Synchronizing using locks Parallel.For(0, filenames.Length, (int i) => { int len = filenames[i].Length; lock (lock[len]) { count[len]++; } });

filenames.Length == 2 filenames[0].Length == filenames[1].Length == 1 count[1] = 0; Parallel.For(0, filenames.Length, (int i) => { int len = filenames[i].Length; lock (lock[len]) { count[len]++; } }); Console.WriteLine(count[1]); Task(0) int len, t; len = 1; acquire(lock[1]); t = count[1]; t++; count[1] = t; release(lock[1]); Task(1) int len, t; len = 1; acquire(lock[1]); t = count[1]; t++; count[1] = t; release(lock[1]); Parent count[1] = 0; Console.WriteLine(count[1]);

Execution I Task(0) int len, t; len = 1; acquire(lock[1]); t = count[1]; t++; count[1] = t; release(lock[1]); Task(1) int len, t; len = 1; acquire(lock[1]); t = count[1]; t++; count[1] = t; release(lock[1]); Parent count[1] = 0; Console.WriteLine(count[1]);

Execution II Task(0) int len, t; len = 1; acquire(lock[1]); t = count[1]; t++; count[1] = t; release(lock[1]); Task(1) int len, t; len = 1; acquire(lock[1]); t = count[1]; t++; count[1] = t; release(lock[1]); Parent count[1] = 0; Console.WriteLine(count[1]);

What is a parallel execution? Happens-before graph: directed acyclic graph over the set of events in an execution Five kinds of events – Access(t, a): task t accessed address a – Fork(t, u): task t created task u – Join(t, u): task t waited for task u – Acquire(t, l): task t acquired lock l – Release(t, l): task t released lock l Three kinds of edges – program order: edge from an event by a particular task to subsequent event by the same task – fork: edge from Fork(t, u) to first event performed by task u – join: edge from last event performed by task u to Join(t, u) – release-acquire: edge from Release(t, l) to subsequent Acquire(u, l)

Fork(Parent, Task(0)) Access(Parent, &count[1]) Access(Task(0), &count[1]) Access(Task(1), &count[1]) Fork(Parent, Task(1)) Access(Parent, &count[1]) Join(Parent, Task(0)) Join(Parent, Task(1)) Acquire(Task(0), &lock[1]) Release(Task(0), &lock[1]) Acquire(Task(1), &lock[1]) Release(Task(1), &lock[1])

Fork(Parent, Task(0)) Access(Parent, &count[1]) Access(Task(0), &count[1]) Access(Task(1), &count[1]) Fork(Parent, Task(1)) Access(Parent, &count[1]) Join(Parent, Task(0)) Join(Parent, Task(1)) Acquire(Task(0), &lock[1]) Release(Task(0), &lock[1]) Acquire(Task(1), &lock[1]) Release(Task(1), &lock[1])

Vector-clock algorithm extended Vector clock: an array of integers indexed by the set of tasks For each task t, maintain a vector clock C(t) – each clock in C(t) initialized to 0 For each address a, maintain a vector clock X(a) – each clock in X(a) initialized to 0 For each lock l, maintain a vector clock S(l) – each clock in S(l) initialized to 0

Vector-clock operations extended Task t executes an event – increment C(t)[t] by one Task t forks task u – initialize C(u) to C(t) Task t joins with task u – update C(t) to max(C(t), C(u)) Task t accesses address a – data race unless X(a) < C(t) – update X(a) to C(t) Task t acquires lock l – update C(t) to max(C(t), S(l)) Task t releases lock l – update S(l) to C(t)

Access(Parent, &count[1]) Fork(Parent, Task(0)) Fork(Parent, Task(1)) Acquire(Task(0), &lock[1]) Access(Task(0), &count[1]) Release(Task(0), &lock[1]) Acquire(Task(1), &lock[1]) Access(Task(1), &count[1]) Release(Task(1), &lock[1]) Join(Parent, Task(0)) Join(Parent, Task(1)) Access(Parent, &count[1]) C(Parent)C(Task(0))C(Task(1))S(&lock[1])X(&count[1]) [1, 0, 0] [0, 0, 0] [2, 0, 0] [3, 0, 0] [2, 1, 0] [2, 2, 0] [1, 0, 0] [2, 0, 0] [2, 2, 0] [2, 4, 0] [3, 4, 1] [3, 4, 2] [3, 4, 3] [3, 4, 4] [4, 4, 0] [3, 4, 4] [2, 3, 0] [3, 4, 2] [3, 4, 3] [5, 4, 4] [6, 4, 4]