Mergesort example: Merge as we return from recursive calls 1 8 2 9 4 5 3 1 6 8 2 1 6 9 4 5 3 8 2 2 8 2 4 8 9 1 2 3 4 5 6 8 9 Merge Divide 1 element 829.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Section 5: More Parallel Algorithms
Single Source Shortest Paths
Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
1 Potential for Parallel Computation Module 2. 2 Potential for Parallelism Much trivially parallel computing  Independent data, accounts  Nothing to.
Lecture 7 : Parallel Algorithms (focus on sorting algorithms) Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture.
1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.
CS 206 Introduction to Computer Science II 04 / 27 / 2009 Instructor: Michael Eckmann.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Dan Grossman Last Updated: November.
CSE332: Data Abstractions Lecture 20: Parallel Prefix and Parallel Sorting Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 19: Analysis of Fork-Join Parallel Programs Dan Grossman Spring 2010.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: May.
CSE332: Data Abstractions Lecture 21: Parallel Prefix and Parallel Sorting Tyler Robison Summer
All Pairs Shortest Paths and Floyd-Warshall Algorithm CLRS 25.2
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
Shortest Paths Definitions Single Source Algorithms –Bellman Ford –DAG shortest path algorithm –Dijkstra All Pairs Algorithms –Using Single Source Algorithms.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
CS 206 Introduction to Computer Science II 12 / 05 / 2008 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
Shortest Paths Definitions Single Source Algorithms
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
All-Pairs Shortest Paths
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
CSE332: Data Abstractions Lecture 20: Analysis of Fork-Join Parallel Programs Tyler Robison Summer
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.
Minimum Spanning Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
10/14/ Algorithms1 Algorithms - Ch2 - Sorting.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Nicki Dell Spring 2014.
Introduction to Algorithms Jiafen Liu Sept
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Data Parallelism Task Parallel Library (TPL) The use of lambdas Map-Reduce Pattern FEN 20141UCN Teknologi/act2learn.
Debugging Threaded Applications By Andrew Binstock CMPS Parallel.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
CSE332: Data Abstractions Lecture 25: Deadlocks and Additional Concurrency Issues Tyler Robison Summer
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015.
Concurrency and Performance Based on slides by Henri Casanova.
Analysis of Fork-Join Parallel Programs. Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of.
Parallel Prefix, Pack, and Sorting. Outline Done: –Simple ways to use parallelism for counting, summing, finding –Analysis of running time and implications.
Sorting and Runtime Complexity CS255. Sorting Different ways to sort: –Bubble –Exchange –Insertion –Merge –Quick –more…
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008
CSE 332: Locks and Deadlocks
Chapter 4 Divide-and-Conquer
Quick-Sort 9/13/2018 1:15 AM Quick-Sort     2
CSE332: Data Abstractions About the Final
CSCI1600: Embedded and Real Time Software
CSE373: Data Structures & Algorithms Lecture 27: Parallel Reductions, Maps, and Algorithm Analysis Catie Baker Spring 2015.
CSE332: Data Abstractions Lecture 20: Parallel Prefix, Pack, and Sorting Dan Grossman Spring 2012.
Data Structures Review Session
Parallelismo.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January.
CSE 332: Parallel Algorithms
Divide and Conquer Merge sort and quick sort Binary search
Time Complexity and the divide and conquer strategy
Presentation transcript:

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element We need another array in which to do each merging step; merge results into there, then copy back to original array

Dijkstra’s Algorithm Overview 2 Given a weighted graph and a vertex in the graph (call it A), find the shortest path from A to each other vertex Cost of path defined as sum of weights of edges Negative edges not allowed The algorithm: Create a table like this: Init A’s cost to 0, others infinity (or just ‘??’) While there are unknown vertices: Select unknown vertex w/ lowest cost (A initially) Mark it as known Update cost and path to all uknown vertices adjacent to that vertex vertexknown?costpath A0 B?? C D

Parallelism Overview 3  We say it takes time T P to complete a task with P processors  Adding together an array of n elements would take O(n) time, when done sequentially (that is, P=1)  Called the work; T 1  If we have ‘enough’ processors, we can do it much faster; O(logn) time  Called the span; T 

Considering Parallel Run-time 4 Our fork and join frequently look like this: base cases divide combine results Each node takes O(1) time Even the base cases, as they are at the cut-off Sequentially, we can do this in O(n) time; O(1) for each node, ~3n nodes, if there were no cut-off (linear # on base case row, halved each row up/down) Carrying this out in (perfect) parallel will take the time of the longest branch; ~2logn, if we halve each time

Some Parallelism Definitions 5  Speed-up on P processors: T 1 / T P  We often assume perfect linear speed-up  That is, T 1 / T P = P; w/ 2x processors, it’s twice as fast  ‘Perfect linear speed-up ’usually our goal; hard to get in practice  Parallelism is the maximum possible speed-up: T 1 / T   At some point, adding processors won’t help  What that point is depends on the span

The ForkJoin Framework Expected Performance 6 If you write your program well, you can get the following expected performance: T P  (T 1 / P) + O(T  )  T 1 /P for the overall work split between P processors  P=4? Each processor takes 1/4 of the total work  O(T  ) for merging results  Even if P= ∞, then we still need to do O(T  ) to merge results  What does it mean??  We can get decent benefit for adding more processors; effectively linear speed-up at first (expected)  With a large # of processors, we’re still bounded by T  ; that term becomes dominant

Amdahl’s Law 7 Let the work (time to run on 1 processor) be 1 unit time Let S be the portion of the execution that cannot be parallelized Then: T 1 = S + (1-S) = 1 Then:T P = S + (1-S)/P Amdahl’s Law: The overall speedup with P processors is: T 1 / T P = 1 / (S + (1-S)/P) And the parallelism (infinite processors) is: T 1 / T  = 1 / S

Parallel Prefix Sum 8  Given an array of numbers, compute an array of their running sums in O(logn) span  Requires 2 passes (each a parallel traversal)  First is to gather information  Second figures out output input output

Parallel Prefix Sum 9 input output range 0,8 sum fromleft range 0,4 sum fromleft range 4,8 sum fromleft range 6,8 sum fromleft range 4,6 sum fromleft range 2,4 sum fromleft range 0,2 sum fromleft r 0,1 s f r 1,2 s f r 2,3 s f r 3,4 s f r 4,5 s f r 5,6 s f r 6,7 s f r 7.8 s f passes: 1.Compute ‘sum’ 2.Compute ‘fromtleft’

Parallel Quicksort 10 2 optimizations: 1.Do the two recursive calls in parallel Now recurrence takes the form: O(n) + 1T(n/2) So O(n) span 2.Parallelize the partitioning step Partitioning normally O(n) time Recall that we can use Parallel Prefix Sum to ‘filter’ with O(logn) span Partitioning can be done with 2 filters, so O(logn) span for each partitioning step These two parallel optimizations bring parallel quicksort to a span of O( log 2 n)

Race Conditions 11 A race condition occurs when the computation result depends on scheduling (how threads are interleaved)  If T1 and T2 happened to get scheduled in a certain way, things go wrong  We, as programmers, cannot control scheduling of threads; result is that we need to write programs that work independent of scheduling Race conditions are bugs that exist only due to concurrency  No interleaved scheduling with 1 thread Typically, problem is that some intermediate state can be seen by another thread; screws up other thread  Consider a ‘partial’ insert in a linked list; say, a new node has been added to the end, but ‘back’ and ‘count’ haven’t been updated

Data Races 12  A data race is a specific type of race condition that can happen in 2 ways:  Two different threads can potentially write a variable at the same time  One thread can potentially write a variable while another reads the variable  Simultaneous reads are fine; not a data race, and nothing bad would happen  ‘Potentially’ is important; we say the code itself has a data race – it is independent of an actual execution  Data races are bad, but we can still have a race condition, and bad behavior, when no data races are present

Readers/writer locks 13 A new synchronization ADT: The readers/writer lock  Idea: Allow any number of readers OR one writer  This allows more concurrent access (multiple readers)  A lock’s states fall into three categories:  “not held”  “held for writing” by one thread  “held for reading” by one or more threads  new: make a new lock, initially “not held”  acquire_write: block if currently “held for reading” or “held for writing”, else make “held for writing”  release_write: make “not held”  acquire_read: block if currently “held for writing”, else make/keep “held for reading” and increment readers count  release_read: decrement readers count, if 0, make “not held” 0  writers  1 && 0  readers && writers * readers==0

Deadlock 14  As illustrated by the ‘The Dining Philosophers’ problem A deadlock occurs when there are threads T1, …, Tn such that: Each is waiting for a lock held by the next Tn is waiting for a resource held by T1 In other words, there is a cycle of waiting class BankAccount { … synchronized void withdraw(int amt) {…} synchronized void deposit(int amt) {…} synchronized void transferTo(int amt,BankAccount a){ this.withdraw(amt); a.deposit(amt); } Consider simultaneous transfers from account x to account y, and y to x