04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09

Some Sorting algorithms Choice of algorithm depends on Is the data all in memory, or on disk/tape? Do we compare operands, or just use their values? Sequential or parallel? Shared or distributed memory? SIMD or not? We will consider all data in memory and parallel: Bitonic Sort Naturally parallel, suitable for SIMD architectures Sample Sort Good generalization of Quick Sort to parallel case Radix Sort Data measured on CM-5 Written in Split-C (precursor of UPC) www.cs.berkeley.edu/~culler/papers/sort.ps 04/27/2009CS267 Lecture 242

LogP – model to predict, understand performance Interconnection Network MP MPMP ° ° ° P ( processors ) Limited Volume (L/ g to or from a proc) o (overhead) L (latency) o g (gap) L atency in sending a (small) message between modules o verhead felt by the processor on sending or receiving msg g ap between successive sends or receives (1/BW) P rocessors  

Bottom Line on CM-5 using Split-C (Preview) Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted (using LogP model) and measured (10%) No single algorithm always best scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model Algorithm, #Procs 04/27/2009CS267 Lecture 244

04/27/2009CS267 Lecture 245 Bitonic Sort (1/2) A bitonic sequence is one that is: 1.Monotonically increasing and then monotonically decreasing 2.Or can be circularly shifted to satisfy 1 A half-cleaner takes a bitonic sequence and produces 1.First half is smaller than smallest element in 2 nd 2.Both halves are bitonic Apply recursively to each half to complete sorting Where do we get a bitonic sequence to start with? 0011100000111000 0000101100001011 0000011100000111 0000101100001011

04/27/2009CS267 Lecture 246 Bitonic Sort (2/2) A bitonic sequence is one that is: 1.Monotonically increasing and then monotonically decreasing 2.Or can be circularly shifted to satisfy Any sequence of length 2 is bitonic So we can sort it using bitonic sorter on last slide Sort all length-2 sequences in alternating increasing/decreasing order Two such length-2 sequences form a length-4 bitonic sequence Recursively Sort length-2 k sequences in alternating increasing/decreasing order Two such length-2 k sequences form a length-2 k+1 bitonic sequence (that can be sorted)

Bitonic Sort for n=16 – all dependencies shown Block Layout Similar pattern as FFT: similar optimizations possible

Bitonic Sort: time per key Predicted using LogP Model N/P us/key 0 10 20 30 40 50 60 70 80 163843276865536 131072262144524288 1048576 Measured N/P us/key 0 10 20 30 40 50 60 70 80 163843276865536 131072262144524288 1048576 512 256 128 64 32 #Procs 04/27/2009CS267 Lecture 248

Sample Sort 1. compute P-1 values of keys that would split the input into roughly equal pieces. – take S~64 samples per processor – sort P·S keys (on processor 0) – let keys S, 2·S,... (P-1)·S be “Splitters” – broadcast Splitters 2. Distribute keys based on Splitters Splitter(i-1) < keys  Splitter(i) all sent to proc i 3. Local sort of keys on each proc [4.] possibly reshift, so each proc has N/p keys If samples represent total population, then Splitters should divide population into P roughly equal pieces 04/27/2009CS267 Lecture 249

Sample Sort: Times Predicted N/P us/key 0 5 10 15 20 25 30 163843276865536 131072262144524288 1048576 Measured N/P us/key 0 5 10 15 20 25 30 163843276865536 131072262144524288 1048576 512 256 128 64 32 # Processors 04/27/2009CS267 Lecture 2410

Sample Sort Timing Breakdown Predicted and Measured (-m) times 04/27/2009CS267 Lecture 2411

09/19/2015CS267 Lecture 2412 Sequential Radix Sort: Counting Sort Idea: build a histogram of the keys and compute position in answer array for each element A = [3, 5, 4, 1, 3, 4, 1, 4] Make temp array B, and write values into position B = [1, 1, 3, 3, 4, 4, 4, 5] Cost = O(#keys + size of histogram) What if histogram too large (eg all 32-bit ints? All words?)

09/19/2015CS267 Lecture 2413 Radix Sort: Separate Key Into Parts Divide keys into parts, e.g., by digits (radix) Using counting sort on these each radix: Start with least-significant satrunsatpin sawpinsawrun tip sat runsatpinsaw pinsawruntip sort on 3 rd character sort on 2 nd character sort on 1 st character satrunsatpin sawpinsawrun tip sat runsatpinsaw pinsawruntip satrunsatpin sawpinsawrun tip pinsat runsattipsaw pinsawruntip satrunsatpin sawpinsawrun tip pinsat runsattipsaw pinsawruntip Cost = O(#keys * #characters)

Histo-radix sort P n=N/P Per pass: 1.compute local histogram – r-bit keys, 2 r bins 2. compute position of 1st member of each bucket in global array – 2 r scans with end-around 3. distribute all the keys Only r = 4, 8,11,16 make sense for sorting 32 bit numbers 2r2r 2 3 09/19/2015CS267 Lecture 2414

Histo-Radix Sort (again) Local Data Local Histograms Each Pass form local histograms form global histogram globally distribute data P 09/19/2015CS267 Lecture 2415

Radix Sort: Times Predicted N/P us/key 0 20 40 60 80 100 120 140 163843276865536 131072262144524288 1048576 us/key 512 256 128 64 Measured N/P 0 20 40 60 80 100 120 140 163843276865536 131072262144524288 1048576 32 # procs 09/19/2015CS267 Lecture 2416

Radix Sort: Timing Breakdown 09/19/2015CS267 Lecture 2417 Predicted and Measured (-m) times

Local Sort Performance on CM-5 (11 bit radix sort of 32 bit numbers) 09/19/201518CS267 Lecture 24 Entropy = -  i p i log 2 p i, p i = Probability of key i Ranges from 0 to log 2 (#different keys)

Radix Sort: Timing dependence on Key distribution Slowdown due to contention in redistribution 09/19/2015CS267 Lecture 2419 Entropy = -  i p i log 2 p i, p i = Probability of key i Ranges from 0 to log 2 (#different keys)

Bottom Line on CM-5 using Split-C Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted (using LogP model) and measured (10%) No single algorithm always best scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model Algorithm, #Procs 04/27/2009CS267 Lecture 2420

Sorting Conclusions Distributed memory model leads to hybrid global / local algorithm Use best local algorithm combined with global part LogP model is good enough to model global part bandwidth (g) or overhead (o) matter most including end-point contention latency (L) only matters when bandwidth doesn’t Modeling local computational performance is harder dominated by effects of storage hierarchy (eg TLBs), depends on entropy See http://www.cs.berkeley.edu/~culler/papers/sort.ps See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps disk-to-disk parallel sorting 04/27/2009CS267 Lecture 2421

EXTRA SLIDES 09/19/2015CS267 Lecture 2422

Radix: Stream Broadcast Problem n (P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well Processor 0 does only sends Others receive then send Receives prioritized over sends  Processor 0 needs to be delayed 09/19/2015CS267 Lecture 2423

What’s the right communication mechanism? Permutation via writes consistency model? false sharing? Reads? Bulk Transfers? what do you need to change in the algorithm? Network scheduling?

Comparison Good fit between predicted and measured (10%) Different sorts for different sorts scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32

04/27/2009CS267 Lecture 2426 Outline Some Performance Laws Performance analysis Performance modeling Parallel Sorting: combining models with measurments Reading: Chapter 3 of Foster’s “Designing and Building Parallel Programs” online text http://www-unix.mcs.anl.gov/dbpp/text/node26.html Abbreviated as DBPP in this lecture David Bailey’s “Twelve Ways to Fool the Masses”

9/19/2015CS267, Yelick27 Measuring Performance Performance criterion may vary with domain There may be limits on acceptable running time E.g., a climate model must run 1000x faster than real time. Any performance improvement may be acceptable E.g., faster on 4 cores than on 1. Throughout may be more critical than latency E.g., number of images processed per minute (throughput) vs. total delay for one image (latency) in a pipelined system. Execution time per unit cost E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt Parallel scalability (speedup or parallel efficiency) Percent relative to best possible (some kind of peak)

9/19/2015CS267, Yelick28 Amdahl’s Law (review) Suppose only part of an application seems parallel Amdahl’s law let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly performance is limited by the sequential part

9/19/2015CS267, Yelick29 Amdahl’s Law (for 1024 processors) Does this mean parallel computing is a hopeless enterprise? Source: Gustafson, Montry, Benner

9/19/2015CS267, Yelick30 Scaled Speedup See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.

9/19/2015CS267, Yelick31 Scaled Speedup (background)

9/19/2015CS267, Yelick32 Little’s Law Latency vs. Bandwidth Latency is physics (wire length) e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room Bandwidth is cost: add more cables to increase bandwidth (over-simplification) Principle (Little's Law): the relationship of a production system in steady state is: Inventory = Throughput × Flow Time For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency Required concurrency = Bandwidth * Latency bandwidth-delay product For parallel computing, this means: Concurrency = bandwidth x latency

9/19/2015CS267, Yelick33 Little’s Law Example 1: a single processor : If the latency is to memory is 50ns, and the bandwidth is 5 GB/s (.2ns / Bytes = 12.8 ns / 64-byte cache line) The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed) An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them) Example 2: 1000 processor system 1 GHz clock, 100 ns memory latency, 100 words of memory in data paths between CPU and memory. Main memory bandwidth is: ~ 1000 x 100 words x 10 9 /s = 10 14 words/sec. To achieve full performance, an application needs: ~ 10 -7 x 10 14 = 10 7 way concurrency (some of that may be hidden in the instruction stream)

9/19/2015CS267, Yelick34 In Performance Analysis: Use more Data Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information. E.g., from DBPP: Serial algorithm scales as N + N 2 Speedup of 10.8 on 12 processors with problem size N=100 Case 1: T = N + N 2 /P Case 2: T = (N + N 2 )/P + 100 Case 2: T = (N + N 2 )/P + 0.6*P 2 All have speedup ~10.8 on 12 procs Performance graphs (n = 100, 1000) show differences in scaling

9/19/2015CS267, Yelick35 Example: Immersed Boundary Simulation Joint work with Ed Givelberg, Armando Solar-Lezama Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC How useful is this data? What are ways to make is more useful/interesting?

9/19/2015CS267, Yelick36 Performance Analysis

9/19/2015CS267, Yelick37 Building a Performance Model Based on measurements/scaling of components FFT is time is: 5*nlogn flops * flops/sec (measured for FFT) Other costs are linear in either material or fluid points Measure constants: a)# flops/point (independent machine or problem size) b)Flops/sec (measured per machine, per phase) Time is: a * b * #points Communication done similarly Find formula for message size as function of problem size Check the formula using tracing of some kind Use  model to predict running time:  * size

9/19/2015CS267, Yelick38 A Performance Model 512 3 in < 1 second per timestep not possible Primarily limited by bisection bandwidth

9/19/2015CS267, Yelick39 Model Success and Failure

9/19/2015CS267, Yelick40 OSKI SPMV: What We Expect Assume Cost(SpMV) = time to read matrix 1 double-word = 2 integers r, c in {1, 2, 4, 8} CSR: 1 int / non-zero BCSR(r x c): 1 int / (r*c non-zeros) As r*c increases, speedup should Increase smoothly Approach 1.5 (eliminate all index overhead)

9/19/2015CS267, Yelick41 What We Get (The Need for Search) Reference Best: 4x2 Mflop/s

9/19/2015CS267, Yelick42 Using Multiple Models

9/19/2015CS267, Yelick43 Multiple Models

9/19/2015CS267, Yelick44 Multiple Models

Extended Example Using Performance Modeling (LogP) To Explain Data Application to Sorting

Deriving the LogP Model ° Processing – powerful microprocessor, large DRAM, cache=> P ° Communication + significant latency (100's –1000’s of cycles)=> L + limited bandwidth (1 – 5% of memory bw)=> g + significant overhead(10's – 100's of cycles)=> o - on both ends – no consensus on topology => should not exploit structure + limited network capacity – no consensus on programming model => should not enforce one

LogP Interconnection Network MP MPMP ° ° ° P ( processors ) Limited Volume (L/ g to or from a proc) o (overhead) L (latency) o g (gap) L atency in sending a (small) message between modules o verhead felt by the processor on sending or receiving msg g ap between successive sends or receives (1/BW) P rocessors  

Using the LogP Model ° Send n messages from proc to proc in time 2o + L + g (n-1) – each processor does o * n cycles of overhead – has (g-o)(n-1) + L available compute cycles ° Send n total messages from one to many in same time ° Send n messages from many to one in same time – all but L/g processors block so fewer available cycles, unless scheduled carefully oL o o o g L time P P

Use of the LogP Model (cont) ° Two processors sending n words to each other (i.e., exchange) 2o + L + max(g,2o) (n-1)  max(g,2o) + L ° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts, e.g., transpose, fft, cyclic-to-block, block-to-cyclic exercise: what’s wrong with the formula above? Assumes optimal pattern of send/receive, so could underestimate time oL o o o g L o

LogP "philosophy" Think about: – mapping of N words onto P processors – computation within a processor, its cost, and balance – communication between processors, its cost, and balance given a charaterization of processor and network performance Do not think about what happens within the network This should be good enough!

Typical Sort Exploits the n = N/P grouping ° Significant local computation ° Very general global communication / transformation ° Computation of the transformation

Costs Split-C (UPC predecessor) Operations Read, Writex = *G, *G = x2 (L + 2o) Store*G :– xL + 2o Getx := *Go....2L + 2o sync();o with interval g Bulk store (n words with  words/message) 2o + (n-1)g + L Exchange2o + 2L + (  n  L/g) max(g,2o) One to many Many to one

LogP model CM5: L = 6 µs o = 2.2 µs g = 4 µs P varies from 32 to 1024 NOW L = 8.9 o = 3.8 g = 12.8 P varies up to 100 What is the processor performance? Application-specific 10s of Mflops for these machines

9/19/2015CS267, Yelick54 LogP Parameters Today

Local Computation Parameters - Empirical ParameterOperationµs per keySort SwapSimulate cycle butterfly per key0.025 lg NBitonic mergesortSort bitonic sequence1.0 scatterMove key for Cyclic-to-block0.46 gatherMove key for Block-to-cyclic0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwise local sortLocal radix sort (11 bit)4.5 if n < 64K 9.0 - (281000/n) mergeMerge sorted lists1.5Column copyShift Key0.5 zeroClear histogram bin0.2Radix histproduce histogram1.2 addproduce scan value1.0 bsumadjust scan of bins2.5 addressdetermine desitination4.7 comparecompare key to splitter0.9Sample localsort8local radix sort of samples5.0

Odd-Even Merge - classic parallel sort N values to be sorted A 0 A 1 A 2 A 3 A M-1 B 0 B 1 B 2 B 3 B M-1 Treat as two lists of M = N/2 Sort each separately A 0 A 2 … A M-2 B 0 B 2 … B M-2 Redistribute into even and odd sublists A 1 A 3 … A M-1 B 1 B 3 … B M-1 Merge into two sorted lists E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 Pairwise swaps of Ei and Oi will put it in order

Where’s the Parallelism? E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 1xN 4xN/4 2xN/2

Mapping to a Butterfly (or Hypercube) A0A1A2A3B0B1B2B3 A0A1A2A3 A0A1A2A3 B0B1B3B2 B3B1B0 A0A1A2A3B2B3B1B0 Reverse Order of one list via cross edges two sorted sublists Pairwise swaps on way back 23487651 234765 8 1 246 8 1357 12345678

Bitonic Sort with N/P per node all_bitonic(int A[PROCS]::[n]) sort(tolocal(&A[ME][0]),n,0) for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); } sort(tolocal(&A[ME][0]),n,mask(i)); sort swap A bitonic sequence decreases and then increases (or vice versa) Bitonic sequences can be merged like monotonic sequences

Bitonic: Breakdown P= 512, random

Bitonic: Effect of Key Distributions P = 64, N/P = 1 M

9/19/2015CS267, Yelick62 Sequential Radix Sort: Counting Sort Idea: build a histogram of the keys and compute position in answer array for each element A = [3, 5, 4, 1, 3, 4, 1, 4] Make temp array B, and write values into position 1 5 8 7 9 2 4 4 2 4 9 2 4 9 7 2 4 8 7 2 49 5 8 7 2 49 1 3 4 4 1 345

9/19/2015CS267, Yelick63 Counting Sort Pseudo Code Counting Sort static void countingSort(int [] A) { int N = A.length; int L = min(A), U = max(A); int[] count = new int[U-L+2]; for (int i = 0; i < N; i += 1) count[A[i] - L + 1] += 1; for (int j = 1; j < count.length; j++) count[j] += count[j-1]; … A = [3, 5, 4, 1, 3, 4, 1, 4] N = 8 L=1, U=5 count = [0,0,0,0,0,0] count = [0,2,0,2,3,1] count = [0,2,2,4,7,8]

9/19/2015CS267, Yelick64 Distribution Sort Continued static void countingSort (int [] A) { … int [] B = new int [N]; for (int i = 0; i < N; i += 1) { B[count[A[i]-L]] = A[i]; count[A[i]-L] += 1; } // copy back into A for (int i = 0; i < N; i += 1) A[i] = B[i]; } count = [0,2,2,4,7,8] A = [3, 5, 4, 1, 3, 4, 1, 4] B = [0, 0, 0, 0, 0, 0, 0, 0] … B = [ 1, 1, 3, 3, 4, 4, 4, 5] B = [0, 0, 3, 0, 0, 0, 0, 0] count = [0,2,3,4,7,8] B = [0, 0, 3, 0, 0, 0, 0, 5] count = [0,2,3,4,8,8] B = [0, 0, 3, 0, 4, 0, 0, 5] count = [0,2,3,5,8,8] B = [ 1, 0, 3, 0, 4, 0, 0, 5] count = [ 1,2,3,5,8,8] B = [ 1, 0, 3, 3, 4, 0, 0, 5] count = [ 1,2,4,5,8,8]

9/19/2015CS267, Yelick65 Analysis of Counting Sort What is the complexity of each step for an n-element array? Find min and max:  (n) Fill in count histogram:  (n) Compute sums of count:  (max-min) Fill in B (run over count):  (max-min) Copy B to A:  (n) So this is a  (n + m) algorithm, where m=max-min Great if the range of keys isn’t too large If m < n, then  (n) overall

04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

Similar presentations

Presentation on theme: "04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

Similar presentations

Presentation on theme: "04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09."— Presentation transcript:

Similar presentations

About project

Feedback