PRAM architectures, algorithms, performance evaluation

Slides:



Advertisements
Similar presentations
Parallel Algorithms.
Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
SE-292 High Performance Computing
Parallel Algorithms and Computing Selected topics Parallel Architecture.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Overview Efficient Parallel Algorithms COMP308. COMP 308 Exam Time allowed : 2.5 hours Answer four questions (out of six). If you attempt to answer more.
The Control Unit: Sequencing the Processor Control Unit: –provides control signals that activate the various microoperations in the datapath the select.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Complexity Classes Kang Yu 1. NP NP : nondeterministic polynomial time NP-complete : 1.In NP (can be verified in polynomial time) 2.Every problem in NP.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
COMP308 Efficient Parallel Algorithms
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
RAM, PRAM, and LogP models
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Data Structures and Algorithms in Parallel Computing Lecture 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
PRAM and Parallel Computing
Higher Level Parallelism
Lecture 3: Parallel Algorithm Design
Distributed and Parallel Processing
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithms (chap. 30, 1st edition)
What Exactly is Parallel Processing?
Parallel Algorithms CS170 Fall 2016.
Lecture 2: Parallel computational models
Morgan Kaufmann Publishers
Parallel computation models
PRAM Algorithms.
A Perspective Hardware and Software
XMT Another PRAM Architectures
Data Structures and Algorithms in Parallel Computing
Parallel and Distributed Algorithms
Part 2: Parallel Models (II)
CSE838 Lecture notes copy right: Moon Jung Chung
COMP60621 Fundamentals of Parallel and Distributed Systems
Unit –VIII PRAM Algorithms.
Parallel Algorithms A Simple Model for Parallel Processing
Module 6: Introduction to Parallel Computing
Computational Intelligence
COMP60611 Fundamentals of Parallel and Distributed Systems
Computational Intelligence
Presentation transcript:

PRAM architectures, algorithms, performance evaluation

Shared Memory model and PRAM p processors, each may have local memory Each processor has index, available to local code Shared memory During each time unit, each processor either Performs one compute operation, or Performs one memory access Challenging. Means very good shared memory (maybe small) Two modes: Synchronous: all processors use same clock (PRAM) Asynchronous: synchronization is code responsibility Asynchronous is more realistic

The other model: Network Linear, ring, mesh, hypercube Recall the two key interconnects: FT and Torus

A first glimpse, based on Joseph F. JaJa, Introduction to Parallel Algorithms, 1992 www.umiacs.umd.edu/~joseph/ Uzi Vishkin, PRAM concepts (1981-today) www.umiacs.umd.edu/~vishkin

Definitions 𝑇 ∗ (𝑛) Time to solve problem of input size n on one processor, using best sequential algorithm 𝑇 𝑝 (𝑛) Time to solve on p processors SUp(n)= 𝑇 ∗ (𝑛) 𝑇 𝑝 (𝑛) Speedup on p processors 𝐸 𝑝 𝑛 = 𝑇 1 (𝑛) 𝑝𝑇 𝑝 (𝑛) Efficiency (work on 1 / work that could be done on p) 𝑇 ∞ (𝑛) Shortest run time on any p C(n)=P(n)∙T(n) Cost (processors and time) W(n) Work = total number of operations 𝑇 ∗ ≠ 𝑇 1 If 𝑇 ∗ ≈ 𝑇 1 , 𝑆𝑈 𝑝 ≈ 𝑇 1 𝑇 𝑝 If 𝑇 ∗ ≈ 𝑇 1 , 𝐸 𝑝 ≈ 𝑆𝑈 𝑝 𝑝 SUp ≤ p 𝐸 𝑝 ≤1 𝑇 1 ≥𝑇 ∗ ≥ 𝑇 𝑝 ≥ 𝑇 ∞ SUp ≤ 𝑇 1 𝑇 ∞ 𝐸 𝑝 = 𝑇 1 𝑝𝑇 𝑝 ≤ 𝑇 1 𝑝𝑇 ∞ No use making p larger than max SU: E0, execution not faster 𝑇 1 ∈𝑂 𝐶 , 𝑇 𝑝 ∈𝑂 𝐶/𝑝 𝑊≤𝐶 𝑝≈area, 𝑊≈energy, 𝑊 𝑇 𝑝 ≈power

SpeedUp and Efficiency Warning: This is only a (bad) example: An 80% parallel Amdahl’s law chart. We’ll see why it’s bad when we analyze (and refute) Amdahl’s law. Meanwhile, consider only the trend.

Example 1: Matrix-Vector multiply (Mvm) y := Ax (𝑛×𝑛, 𝑛) 𝐴= 𝐴 1 𝐴 2 ⋮ 𝐴 𝑝 , 𝐴 𝑖 (𝑟×𝑛) 𝑝≤𝑛, 𝑟=𝑛/𝑝 Example: (256×256, 256) 𝐴= 𝐴 1 𝐴 2 ⋮ 𝐴 32 , 𝐴 𝑖 (8×256) 32 processors, each 𝐴 𝑖 block is 8 rows Processor 𝑃 𝑖 reads 𝐴 𝑖 and x, computes and writes yi. “embarrassingly parallel” – no cross-dependence

Performance of Mvm T1(n2)=O(n2) Tp(n2)=O(n2/p) --- linear speedup, SU=p Cost=O(p∙n2/p)= O(n2), W=C, W/Tp=p --- linear power 𝐸 𝑝 = 𝑇 1 𝑝𝑇 𝑝 = 𝑛 2 𝑝 𝑛 2 /𝑝 =1 ---perfect efficiency lin log n2=1024 p p log We use log-log charts

Example 2: SPMD Sum A(1:n) on PRAM SPMD? MIMD? SIMD? Example 2: SPMD Sum A(1:n) on PRAM (given 𝑛= 2 𝑘 ) Begin 1. Global read (aA(i)) 2. Global write(aB(i)) 3. For h=1:k if 𝑖≤𝑛/ 2 ℎ then begin global read(xB(2i-1)) global read(yB(2i)) z := x + y global write(zB(i)) end 4. If i=1 then global write(zS) End h i adding 1 1,2 2 3,4 3 5,6 4 7,8

Logarithmic sum a1 a2 a3 a4 a5 a6 a7 a8 The PRAM algorithm // Sum vector A(*) Begin B(i) := A(i) For h=1:log(n) if 𝑖≤𝑛/ 2 ℎ then B(i) = B(2i-1) + B(2i) End // B(1) holds the sum a1 a2 a3 a4 a5 a6 a7 a8 h=3 h=2 h=1

Performance of Sum (p=n) T*(n)=T1(n)=n Tp=n(n)=2+log n SUp= 𝑛 2+𝑙𝑜𝑔 𝑛 Cost=p∙ (2+log n)≈n log n 𝐸 𝑝 = 𝑇 1 𝑝𝑇 𝑝 = 𝑛 𝑛 𝑙𝑜𝑔 𝑛 = 1 𝑙𝑜𝑔 𝑛 p=n log-log chart p=n Speedup and efficiency decrease

Performance of Sum (n>>p) T*(n)=T1(n)=n 𝑇 𝑝 𝑛 = 𝑛 𝑝 + log 𝑝 SUp= 𝑛 𝑛 𝑝 +𝑙𝑜𝑔 𝑝 ≈p Cost=𝑝 𝑛 𝑝 + log 𝑝 ≈n Work = n+p ≈n 𝐸 𝑝 = 𝑇 1 𝑝𝑇 𝑝 = 𝑛 𝑝 𝑛 𝑝 + log 𝑝 ≈1 p log-log chart Speedup & power are linear Cost is fixed Efficiency is 1 (max)

Work doing Sum T8 = 5 1 C = 85 = 40 -- could do 40 steps 1 W = 2n = 16 -- 16/40, wasted 24 𝐸𝑝= 2 log 𝑛 = 2 3 =0.67 2 4 𝑊 𝐶 = 16 40 =0.4 8 Work = 16

Which PRAM? Namely, how does it write? Exclusive Read Exclusive Write (EREW) Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW) Common: concurrent only if same value Arbitrary: one succeeds, others ignored Priority: minimum index succeeds Computational power: EREW < CREW < CRCW

Simplifying pseudo-code Replace global read(xB) global read(yC) z := x + y global write(zA) By A := B + C ---A,B,C shared variables

Example 3: Matrix multiply on PRAM C := AB (𝑛×𝑛), 𝑛= 2 𝑘 Recall Mm: 𝐶 𝑖,𝑗 = 𝑙=1 𝑛 𝐴 𝑖,𝑙 𝐵 𝑙,𝑗 𝑝= 𝑛 3 Steps Processor 𝑃 𝑖,𝑗,𝑙 computes 𝐴 𝑖,𝑙 𝐵 𝑙,𝑗 The 𝑛 processors 𝑃 𝑖,𝑗,1:𝑛 compute Sum 𝑙=1 𝑛 𝐴 𝑖,𝑙 𝐵 𝑙,𝑗 = ×

Mm Algorithm Begin End Runs on CREW PRAM 1. 𝑇 𝑖,𝑗,𝑙 = 𝐴 𝑖,𝑙 𝐵 𝑙,𝑗 (each processor knows its i,j,l indices, or computes it from an instance number) Begin 1. 𝑇 𝑖,𝑗,𝑙 = 𝐴 𝑖,𝑙 𝐵 𝑙,𝑗 2. For h=1:k if 𝑙≤𝑛/ 2 ℎ then 𝑇 𝑖,𝑗,𝑙 = 𝑇 𝑖,𝑗,2𝑙−1 + 𝑇 𝑖,𝑗,2𝑙 3. If 𝑙=1 then 𝐶 𝑖,𝑗 = 𝑇 𝑖,𝑗, 1 End Step 1: compute 𝐴 𝑖,𝑙 𝐵 𝑙,𝑗 Concurrent read Step 2: Sum Step 3: Store Exclusive write Runs on CREW PRAM What is the purpose of “If 𝑙=1” in step 3? What happens if eliminated?

Performance of Mm 𝑇 1 = 𝑛 3 𝑇 𝑝= 𝑛 3 = log 𝑛 𝑆𝑈= 𝑛 3 log 𝑛 log-log chart

Prefix Sum Take advantage of idle processors in Sum Compute all prefix sums 𝑆 𝑖 = 1 𝑖 𝑎 𝑗 𝑎 1 , 𝑎 1 + 𝑎 2 , 𝑎 1 + 𝑎 2 + 𝑎 3 , …

Prefix Sum on CREW PARM s1 s2 s3 s4 s5 s6 s7 s8 a1 a2 a3 a4 a5 a6 a7 HW3: Write this as a PRAM algorithm (due May 6 2012)

Is PRAM implementable? Can be an ideal model for theoretical algorithms Algorithms may be converted to real machine models (XMT, Plural, Tilera, …) Or can be implemented ‘directly’ Concurrent read by detect-and-multicast Like the Plural P2M net Like the XMT read-only buffers Concurrent write how? Fetch & Op: serializing write Prefix-sum (f&a) on XMT: serializing write Common CRCW: detect-and-merge Priority CRCW: detect-and-prioritize Arbitrary CRCW: arbitrarily…

Common CRCW example 1: DNF Boolean DNF (sum of products) X = a1b1 + a2b2 + a3b3 + … (AND, OR operations) PRAM code (X initialized to 0, task index=$) : if (a$b$) X=1; Common output: Not all processors write X. Those that do, write 1. Time O(1) Great for other associative operators e.g. (a1+b1)(a2+b2).. OR/AND (CNF): init X=1, if NOT(a$+b$) X=0; Works on common / priority / arbitrary CRCW

Common CRCW example 2: Transitive Closure The transitive closure G* of a directed graph G may be computed by matrix multiply B adjacency matrix Bk shows paths of exactly k steps (B+I)k shows paths of 1,2,…,k steps Compute (B+I)|V|-1 in log(|V|) steps how? Boolean matrix multiply (and, or) shows only existence of paths Normal multiply counts number of paths |V|=n, |B|=n×n P W T Matrix Multiply n3 1 Transitive Closure n3 log n log n Joseph F. JaJa, Introduction to Parallel Algorithms, 1992, Ch. 5

Arbitrary CRCW example: Connectivity Serial algorithm for connected components: for each vertex vV MakeSet(v) for each edge (u,v)E // arbitrary order If (Set(u)  Set(v)) Union(Set(u),Set(v)) // arbitrary union Parallel: Processor per edge set(v) is shared variable Each set is named after one of the nodes it includes Union selects the lower available index P(b): set(8)=2 P(c): set(8)=3 No problem! Arbitrary CRCW selects arbitrarily a 1 2 b 8 c 3

Arbitrary CRCW example: Connectivity 1 2 b 8 c 3 T P(a) P(b) P(c) set(1) set(2) set(8) set(3) 1 2 8 3 set(2)=1 set(8)=2 set(8)=3 set(8)=1 set(3)=2 set(3)=1 Try also with a different arbitrary result

Why PRAM? Large body of algorithms Easy to think about Sync version of shared memory  eliminates sync and comm issues, allows focus on algorithms But allows adding these issues Allows conversion to async versions Exist architectures for both sync (PRAM) model async (SM) model PRAM algorithms can be mapped to other models