Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.
Outline Introduction (case study: maximum element) – Work-optimality The Parallel Random Access Machine – Shared memory modes – Accelerated cascading Other Parallel Architectures (case study: sorting) – Circuits – Linear processor networks – (Mesh processor networks) Distributed Algorithms – Message-optimality – Broadcast and echo – (Leader election)
Introduction
Why use parallelism? p steps on 1 printer, 1 step on p printers p = speed-up factor (best case) Given a sequential algorithm, how can we parallelize it? – Some are inherently sequential (P-complete)
Case Study: Maximum Element In: a[] Out: maximum element in a sequential_maximum(a) { n = a.length max = a[0] for i = 1 to n – 1 { if (a[i] > max) max = a[i] } return max } O(n)
Parallel Maximum Idea: Use ⌈ n / 2 ⌉ processors Note idle processors after the first step! O(lg n)
Work-Optimality Work = number of algorithmic steps × number of processors Running time of parallelized maximum algo = O(lg n) × (n / 2) = O(n lg n) Not work-optimal! Sequential algo’s work is O(n) – Workaround: accelerated cascading…
Formal Algorithm for Parallel Maximum But first!...
The Parallel Random Access Machine
The Parallel Random Access Machine (PRAM) New construct: parallel loop for i = 1 to n in parallel { … } Assumption 1: use n processors to execute this loop (processors are synchronized) Assumption 2: memory shared across all processors
Example: Parallel Search In: a[], x Out: true if x is in a, false otherwise parallel_search(a, x) { n = a.length found = false for i = 0 to n – 1 in parallel { if (a[i] == x) found = true } return found } Is this work-optimal? Shared memory modes: Exclusive Read (ER) Concurrent Read (CR) Exclusive Write (EW) Concurrent Write (CW) Real-world systems are most commonly CREW parallel_search runs on what type?
Formal Algorithm for Parallel Maximum In: a[] Out: maximum element in a parallel_maximum(a) { n = a.length for i = 0 to ⌈ lg n ⌉ – 1 { for j = 0 to ⌈ n/2 i+1 ⌉ – 1 in parallel { if (j × 2 i i < n) // boundary check a[j × 2 i+1 ] = max(a[j × 2 i+1 ], a[j × 2 i i ]) } } return a[0] } Theorem: parallel_maximum is CREW and finds the maximum element in parallel time O(lg n) and work O(n lg n)
Accelerated Cascading Phase 1: Use sequential_maximum on blocks of lg n elements – We use n / lg n processors – O(lg n) sequential steps per processor – Total work = O(lg n) steps × (n / lg n) processors = O(n) Phase 2: Use parallel_maximum on the resulting n / lg n elements – lg (n / lg n) parallel steps = lg n – lg (lg n) = O(lg n) – Total work = O(lg n) steps × ((n / lg n) / 2) processors = O(n)
Formal Algorithm for Optimal Maximum In: a[] Out: maximum element in a optimal_maximum(a) { n = a.length block_size = ⌈ lg n ⌉ block_count = ⌈ n / block_size ⌉ create array block_results[block_count] for i = 0 to block_count – 1 in parallel { start = i × block_size end = min(n – 1, start + block_size – 1) block_results[i] = sequential_maximum(a[start.. end]) } return parallel_maximum(block_results) }
Some Notes All CR algorithms can be converted to ER algorithms! – “Broadcasting” an ER variable to all processors for concurrent access takes O(lg n) parallel time maximum is a “semigroup algorithm” – Semigroup = a set of elements + an associative binary relation (max, min, +, ×, etc.) – Same accelerated-cascading methods can be applied for min-element, summation, product of n numbers, etc.!
Other Parallel Architectures
PRAM may not be the best model Shared memory = expensive! – Some algorithms require communication between processors (= memory locking issues) – Better to use channels! Extreme case: very simple processors with no shared memory (just communication channels)
Circuits Each processor is a gate with a specialized function (e.g., comparator gate) Circuit = a layout of gates to perform a full task (e.g., sorting) x y min(x, y) max(x, y)
Sorting circuit for 4 elements (depth 3) Step 1Step 2Step 3 (Depth of network = 3)
Sorting circuit for n elements? Simpler problem: max element Idea: Add as many of these diagonals as needed
Odd-Even Transposition Network Theorem: The odd-even transposition network sorts n numbers in n steps and O(n 2 ) processors
Zero-One Principle of Sorting Networks Lemma: If a sorting network works correctly on all inputs consisting of only 0’s and 1’s, it works for any arbitrary input – Assume there is a network that sorts 0-1 sequences but not another arbitrary input a 0.. a n-1 – Let b 0.. b n-1 be the output of that network – There must exist s b t – Label all a i < b s with 0 and all else with 1 – If we run all a 0.. a n-1 with their labels, then b s ’s label will be 1 and b t ’s label will be 0 – Contradiction: The network is assumed to sort 0-1 sequences properly but did not do so here!
Correctness of the Odd-Even Transposition Network Assume binary sequence a 0.. a n–1 Let a i = first 0 in the sequence Two cases: i is odd or even To sort a 0.. a i, we need i steps (worst-case) Induction: Given a 0.. a k (where k ≥ i) will sort in k steps, will a 0.. a k+1 get sorted in k+1 steps?
Better Sorting Networks Batcher’s Bitonic Sorter (1968) – Depth O(lg 2 n), size O(n lg 2 n) – Idea: sort 2 groups (recursively), then merge using a network that can sort bitonic sequences AKS Network (1983) – Ajtai, Komlós and Szemerédl – Depth O(lg n), size O(n lg n) – Not practical! Hides a very large c in the cn lg n algorithm
More Intelligent Processors: Processor Networks Star Linear/Ring Completely-connected Mesh Diameter = 2 Diameter = n – 1 (or n – 2) Diameter = 1
Sorting on Linear Networks Emulate an odd-even transposition network! O(n) steps, work is O(n 2 ) – We can’t expect better on a linear network
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Done! Sort rows, sort columns, repeat
Sorting on Mesh Networks: Shearsort Theorem: Shearsort sorts n 2 elements in O(n lg n) steps on an n × n mesh We can use the Zero-One Principle! – Only because algorithm is comparison-exchange Can be implemented using comparators only – and oblivious Outcome of comparator does not influence comparisons made later on – (Disclaimer: reference is actually very unclear about this)
Correctness of Shearsort
full row of 1’s 1 full row of 0’s 1 full row of 1’s
Correctness of Shearsort lg(n) × 2 phases, each phase takes n steps Sort space guaranteed to be halved after 2 phases
Distributed Algorithms
Different concerns altogether… Problems usually easy to parallelize Main problems: – Inherently asynchronous – How to broadcast data and ensure every node gets it – How to minimize bandwidth usage – What to do when nodes go down (decentralization) – (Do we trust the results given by the nodes?) 2, 3, 5, 7, 13 … … , … DES (56-bit)
Message-Optimality New language constructs: send to p receive from p terminate Message-complexity = number of messages sent by a distributed algorithm (also uses O-notation)
Broadcast Initiators vs. noninitiators Simple case: ring network w/ one initiator init_ring_broadcast() { send token to successor receive token from predecessor terminate } ring_broadcast() { receive token from predecessor send token to successor terminate } Theorem: init_ring_broadcast + ring_broadcast broadcasts to n machines using time and message complexity O(n)
Broadcast on a tree network init_broadcast() { N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate } broadcast() { receive token from parent N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate } Note: no acknowledgment!
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate }
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul 0
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul 0 0
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate }
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate }
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin 4
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin
Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin
Echo Creates a spanning tree out of any connected network Theorem: init_echo + echo has time complexity O(diameter) and message complexity O(edges) init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin
Leader Election (for ring networks) init_election() { send token, p.ID to successor min = p.ID receive token, token_id while (p.ID != token_id) { if token_id < min min = token_id send token, token_id to successor receive token, token_id } if (p.ID == min) i_am_the_leader = true else i_am_the_leader = false terminate } election() { i_am_the_leader = false do { receive token, token_id send token, token_id to successor } while (true) } Theorem: init_election + election runs in n steps with message complexity O(n 2 )