11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23

James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr10
11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23 CS267, Yelick

Some Sorting algorithms
Choice of algorithm depends on Is the data all in memory, or on disk/tape? Do we compare operands, or just use their values? Sequential or parallel? Shared or distributed memory? SIMD or not? We will consider all data in memory and parallel: Bitonic Sort Naturally parallel, suitable for SIMD architectures Sample Sort Good generalization of Quick Sort to parallel case Radix Sort Data measured on CM-5 Written in Split-C (precursor of UPC) 04/13/2010 CS267 Lecture 23

LogP – model to predict, understand performance
11/17/2018 LogP – model to predict, understand performance P ( processors ) P M P M P M ° ° ° o (overhead) o g (gap) L (latency) Limited Volume Interconnection Network ( L/ g to or from a proc) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/BW) Processors a b CS267, Yelick

Bottom Line on CM-5 using Split-C (Preview)
11/17/2018 Bottom Line on CM-5 using Split-C (Preview) Algorithm, #Procs N/P us/key 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 16384 32768 65536 131072 262144 524288 Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Column sort requires n > p^3, so restrictive; Never uniquely the best, so we discuss the others. Good fit between predicted (using LogP model) and measured (10%) No single algorithm always best scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model 04/13/2010 CS267 Lecture 23 CS267, Yelick

Bitonic Sort (1/2) A bitonic sequence is one that is:
Monotonically increasing and then monotonically decreasing Or can be circularly shifted to satisfy 1 A half-cleaner takes a bitonic sequence and produces First half is smaller than smallest element in 2nd Both halves are bitonic Apply recursively to each half to complete sorting Where do we get a bitonic sequence to start with? 1 04/13/2010 CS267 Lecture 23

Bitonic Sort (2/2) A bitonic sequence is one that is:
Monotonically increasing and then monotonically decreasing Or can be circularly shifted to satisfy this Any sequence of length 2 is bitonic So we can sort it using bitonic sorter on last slide Sort all length-2 sequences in alternating increasing/decreasing order Two such length-2 sequences form a length-4 bitonic sequence Recursively Sort length-2k sequences in alternating increasing/decreasing order Two such length-2k sequences form a length-2k+1 bitonic sequence (that can be sorted) 04/13/2010 CS267 Lecture 23

Bitonic Sort for n=16 – all dependencies shown
11/17/2018 Bitonic Sort for n=16 – all dependencies shown Similar pattern as FFT: similar optimizations possible lg N/p stages are local sort – Use best local sort remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols) cyclic-to-block, local merges ( lg N/p cols within stage) Block Layout CS267, Yelick

Bitonic Sort: time per key
11/17/2018 Bitonic Sort: time per key Predicted using LogP Model Measured #Procs 80 80 70 70 512 60 60 256 50 50 us/key 40 us/key 40 128 30 30 64 20 20 32 10 10 16384 32768 65536 131072 262144 524288 16384 32768 65536 131072 262144 524288 N/P N/P 04/13/2010 CS267 Lecture 23 CS267, Yelick

Sample Sort 1. compute P-1 values of keys that
11/17/2018 Sample Sort 1. compute P-1 values of keys that would split the input into roughly equal pieces. – take S~64 samples per processor – sort P·S keys (on processor 0) – let keys S, 2·S, (P-1)·S be “Splitters” – broadcast Splitters 2. Distribute keys based on Splitters Splitter(i-1) < keys  Splitter(i) all sent to proc i 3. Local sort of keys on each proc [4.] possibly reshift, so each proc has N/p keys If samples represent total population, then Splitters should divide population into P roughly equal pieces 04/13/2010 CS267 Lecture 23 CS267, Yelick

Sample Sort: Times # Processors 04/13/2010 CS267 Lecture 23 Predicted
11/17/2018 Sample Sort: Times Predicted Measured # Processors 30 30 25 25 512 20 20 256 us/key 15 us/key 15 128 10 10 64 5 5 32 16384 32768 65536 131072 262144 524288 16384 32768 65536 131072 262144 524288 N/P N/P 04/13/2010 CS267 Lecture 23 CS267, Yelick

Sample Sort Timing Breakdown
11/17/2018 Sample Sort Timing Breakdown Predicted and Measured (-m) times N/P us/key 5 10 15 20 25 30 16384 32768 65536 131072 262144 524288 Split Sort Dist Split-m Sort-m Dist-m 04/13/2010 CS267 Lecture 23 CS267, Yelick

Sequential Radix Sort: Counting Sort
Idea: build a histogram of the keys and compute position in answer array for each element A = [3, 5, 4, 1, 3, 4, 1, 4] Make temp array B, and write values into position B = [1, 1, 3, 3, 4, 4, 4, 5] Cost = O(#keys + size of histogram) What if histogram too large (eg all 32-bit ints? All words?) 04/13/2010 CS267 Lecture 23

Radix Sort: Separate Key Into Parts
Divide keys into parts, e.g., by digits (radix) Using counting sort on these each radix: Start with least-significant sat run pin saw tip sat run pin saw tip sat run pin saw tip sat run pin saw tip sort on 3rd character sort on 2nd character sort on 1st character Cost = O(#keys * #characters) 04/13/2010 CS267 Lecture 23

Histo-radix sort P n=N/P 2 2r 3 Per pass: compute local histogram
11/17/2018 Histo-radix sort Per pass: compute local histogram – r-bit keys, 2r bins 2. compute position of 1st member of each bucket in global array – 2r scans with end-around 3. distribute all the keys Only r = 4, 8,11,16 make sense for sorting 32 bit numbers P n=N/P 2 2r 3 04/13/2010 CS267 Lecture 23 CS267, Yelick

Histo-Radix Sort (again)
11/17/2018 Histo-Radix Sort (again) Local Data Local Histograms P Each Pass form local histograms form global histogram globally distribute data 04/31/2010 CS267 Lecture 23 CS267, Yelick

Radix Sort: Times Predicted Measured # procs 04/13/2010
11/17/2018 Radix Sort: Times Predicted N/P us/key 20 40 60 80 100 120 140 16384 32768 65536 131072 262144 524288 Measured N/P 20 40 60 80 100 120 140 16384 32768 65536 131072 262144 524288 # procs 512 256 128 64 32 04/13/2010 CS267 Lecture 23 CS267, Yelick

Radix Sort: Timing Breakdown
11/17/2018 Radix Sort: Timing Breakdown Predicted and Measured (-m) times Vertical axis: microsecs per keys 04/13/2010 CS267 Lecture 23 CS267, Yelick

Local Sort Performance on CM-5
11/17/2018 Local Sort Performance on CM-5 Entropy = -Si pi log2 pi , pi = Probability of key i Ranges from 0 to log2 (#different keys) (11 bit radix sort of 32 bit numbers) Log N/P µs / Key 1 2 3 4 5 6 7 8 9 10 15 20 31 25.1 16.9 10.4 6.2 Entropy in Key Values < TLB misses > 04/13/2010 CS267 Lecture 23 CS267, Yelick

Radix Sort: Timing dependence on Key distribution
11/17/2018 Radix Sort: Timing dependence on Key distribution Entropy = -Si pi log2 pi , pi = Probability of key i Ranges from 0 to log2 (#different keys) Vertical Axis of plot: microsecs/key Cyclic distribution is worst case: P0 gets 0, P1 gets 1, etc. Slowdown due to contention in redistribution 04/13/2010 CS267 Lecture 23 CS267, Yelick

Bottom Line on CM-5 using Split-C
11/17/2018 Bottom Line on CM-5 using Split-C Algorithm, #Procs N/P us/key 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 16384 32768 65536 131072 262144 524288 Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Column sort requires n > p^3, so restrictive; Never uniquely the best, so we discuss the others. Good fit between predicted (using LogP model) and measured (10%) No single algorithm always best scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model 04/13/2010 CS267 Lecture 23 CS267, Yelick

11/17/2018 Sorting Conclusions Distributed memory model leads to hybrid global / local algorithm Use best local algorithm combined with global part LogP model is good enough to model global part bandwidth (g) or overhead (o) matter most including end-point contention latency (L) only matters when bandwidth doesn’t Modeling local computational performance is harder dominated by effects of storage hierarchy (eg TLBs), depends on entropy See See disk-to-disk parallel sorting 04/13/2010 CS267 Lecture 23 CS267, Yelick

Extra slides 04/13/2010 CS267 Lecture 23

Radix: Stream Broadcast Problem
11/17/2018 Radix: Stream Broadcast Problem Processor 0 does only sends Others receive then send Receives prioritized over sends  Processor 0 needs to be delayed n Receives priorities (P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well 011/17/2018 CS267 Lecture 24 CS267, Yelick

What’s the right communication mechanism?
Permutation via writes consistency model? false sharing? Reads? Bulk Transfers? what do you need to change in the algorithm? Network scheduling?

Comparison Good fit between predicted and measured (10%)
11/17/2018 Comparison N/P us/key 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 16384 32768 65536 131072 262144 524288 Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted and measured (10%) Different sorts for different sorts scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model CS267, Yelick

Outline Some Performance Laws Performance analysis
Performance modeling Parallel Sorting: combining models with measurments Reading: Chapter 3 of Foster’s “Designing and Building Parallel Programs” online text Abbreviated as DBPP in this lecture David Bailey’s “Twelve Ways to Fool the Masses” 04/27/2009 CS267 Lecture 24

Measuring Performance
Performance criterion may vary with domain There may be limits on acceptable running time E.g., a climate model must run 1000x faster than real time. Any performance improvement may be acceptable E.g., faster on 4 cores than on 1. Throughout may be more critical than latency E.g., number of images processed per minute (throughput) vs. total delay for one image (latency) in a pipelined system. Execution time per unit cost E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt Parallel scalability (speedup or parallel efficiency) Percent relative to best possible (some kind of peak) 11/17/2018 CS267, Yelick

Amdahl’s Law (review) Suppose only part of an application seems parallel Amdahl’s law let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly performance is limited by the sequential part 11/17/2018 CS267, Yelick

Amdahl’s Law (for 1024 processors)
Does this mean parallel computing is a hopeless enterprise? Source: Gustafson, Montry, Benner 11/17/2018 CS267, Yelick

Scaled Speedup See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609. 11/17/2018 CS267, Yelick

Scaled Speedup (background)
11/17/2018 CS267, Yelick

Little’s Law Latency vs. Bandwidth Latency is physics (wire length)
e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room Bandwidth is cost: add more cables to increase bandwidth (over-simplification) Principle (Little's Law): the relationship of a production system in steady state is: Inventory = Throughput × Flow Time For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency Required concurrency = Bandwidth * Latency bandwidth-delay product For parallel computing, this means: Concurrency = bandwidth x latency 11/17/2018 CS267, Yelick

Little’s Law Example 1: a single processor:
If the latency is to memory is 50ns, and the bandwidth is 5 GB/s (.2ns / Bytes = 12.8 ns / 64-byte cache line) The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed) An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them) Example 2: 1000 processor system 1 GHz clock, 100 ns memory latency, 100 words of memory in data paths between CPU and memory. Main memory bandwidth is: ~ 1000 x 100 words x 109/s = 1014 words/sec. To achieve full performance, an application needs: ~ 10-7 x 1014 = 107 way concurrency (some of that may be hidden in the instruction stream) 11/17/2018 CS267, Yelick

In Performance Analysis: Use more Data
Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information. E.g., from DBPP: Serial algorithm scales as N + N2 Speedup of 10.8 on 12 processors with problem size N=100 Case 1: T = N + N2/P Case 2: T = (N + N2)/P + 100 Case 2: T = (N + N2)/P + 0.6*P2 All have speedup ~10.8 on 12 procs Performance graphs (n = 100, 1000) show differences in scaling 11/17/2018 CS267, Yelick

Example: Immersed Boundary Simulation
11/17/2018 Example: Immersed Boundary Simulation Don’t say “toy language” example of applicatinon 64^3 was possible on Cray YMP, but 128^3 required for accurate model (would have taken 3 years). Done on a Cray C x faster and 100x more memory Until recently, limited to vector machines Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC How useful is this data? What are ways to make is more useful/interesting? Joint work with Ed Givelberg, Armando Solar-Lezama 11/17/2018 CS267, Yelick CS267, Yelick

Performance Analysis 11/17/2018 CS267, Yelick

Building a Performance Model
Based on measurements/scaling of components FFT is time is: 5*nlogn flops * flops/sec (measured for FFT) Other costs are linear in either material or fluid points Measure constants: # flops/point (independent machine or problem size) Flops/sec (measured per machine, per phase) Time is: a * b * #points Communication done similarly Find formula for message size as function of problem size Check the formula using tracing of some kind Use a/b model to predict running time: a + b * size 11/17/2018 CS267, Yelick

A Performance Model 5123 in < 1 second per timestep not possible
Primarily limited by bisection bandwidth 11/17/2018 CS267, Yelick

Model Success and Failure
11/17/2018 CS267, Yelick

OSKI SPMV: What We Expect
11/17/2018 OSKI SPMV: What We Expect Assume Cost(SpMV) = time to read matrix 1 double-word = 2 integers r, c in {1, 2, 4, 8} CSR: 1 int / non-zero BCSR(r x c): 1 int / (r*c non-zeros) As r*c increases, speedup should Increase smoothly Approach 1.5 (eliminate all index overhead) Assuming the cost of SpMV is dominated by the time just to read the matrix entries (i.e., ignore the vectors, or equivalently, assume they “fit” in registers), and that the size of one integer index equals half the size of a double precision word. Then we might reasonably expect is that performance should increase smoothly as the block size, r x c, increases, and that the best speedup is approximately 1.5. 11/17/2018 CS267, Yelick CS267, Yelick

What We Get (The Need for Search)
11/17/2018 What We Get (The Need for Search) Mflop/s Best: 4x2 [NOTE: This slide has some animation in it.] Consider the following experiment in which we implement SpMV using BCSR format for the matrix shown in the previous slide at all block sizes that divide 8x8—16 implementations in all. These implementations fully unroll the innermost loop and use scalar replacement for the source and destination vectors. You might reasonably expect performance to increase relatively smoothly as r and c increase, but this is clearly not the case! Platform: 900 MHz Itanium-2, 3.6 Gflop/s peak speed. Intel v8.0 compiler. Good speedups (4x) but at an unexpected block size (4x2). Figure taken from Im, Yelick, Vuduc, IJHPCA 2005 paper. Reference Mflop/s 11/17/2018 CS267, Yelick CS267, Yelick

Using Multiple Models 11/17/2018 CS267, Yelick

Multiple Models 11/17/2018 CS267, Yelick

11/17/2018 Extended Example Using Performance Modeling (LogP) To Explain Data Application to Sorting CS267, Yelick

Deriving the LogP Model
11/17/2018 Deriving the LogP Model ° Processing – powerful microprocessor, large DRAM, cache => P ° Communication + significant latency (100's –1000’s of cycles) => L + limited bandwidth (1 – 5% of memory bw) => g + significant overhead (10's – 100's of cycles) => o - on both ends – no consensus on topology => should not exploit structure + limited network capacity – no consensus on programming model => should not enforce one CS267, Yelick

LogP Latency in sending a (small) message between modules
11/17/2018 LogP P ( processors ) P M P M P M ° ° ° o (overhead) o g (gap) L (latency) Limited Volume Interconnection Network ( L/ g to or from a proc) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/BW) Processors a b CS267, Yelick

Using the LogP Model time ° Send n messages from proc to proc in time
11/17/2018 Using the LogP Model o L o o L o g time ° Send n messages from proc to proc in time 2o + L + g (n-1) – each processor does o*n cycles of overhead – has (g-o)(n-1) + L available compute cycles ° Send n total messages from one to many in same time ° Send n messages from many to one – all but L/g processors block so fewer available cycles, unless scheduled carefully P g > o by construction P CS267, Yelick

Use of the LogP Model (cont)
11/17/2018 Use of the LogP Model (cont) ° Two processors sending n words to each other (i.e., exchange) 2o + L + max(g,2o) (n-1) £ max(g,2o) + L ° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic exercise: what’s wrong with the formula above? o L o o L o g o Compare shaded part of gap to o – does o fit inside? Assumes optimal pattern of send/receive, so could underestimate time CS267, Yelick

LogP "philosophy" This should be good enough! Think about:
11/17/2018 LogP "philosophy" Think about: – mapping of N words onto P processors – computation within a processor, its cost, and balance – communication between processors, its cost, and balance given a charaterization of processor and network performance Do not think about what happens within the network This should be good enough! CS267, Yelick

Typical Sort Exploits the n = N/P grouping
11/17/2018 Typical Sort Exploits the n = N/P grouping ° Significant local computation ° Very general global communication / transformation ° Computation of the transformation CS267, Yelick

Costs Split-C (UPC predecessor) Operations
11/17/2018 Costs Split-C (UPC predecessor) Operations Read, Write x = *G, *G = x 2 (L + 2o) Store *G :– x L + 2o Get x := *G o .... 2L + 2o sync(); o with interval g Bulk store (n words with w words/message) 2o + (n-1)g + L Exchange 2o + 2L + (ì n/w ù L/g) max(g,2o) One to many Many to one CS267, Yelick

LogP model CM5: NOW What is the processor performance? L = 6 µs
11/17/2018 LogP model CM5: L = 6 µs o = 2.2 µs g = 4 µs P varies from 32 to 1024 NOW L = 8.9 o = 3.8 g = 12.8 P varies up to 100 What is the processor performance? Application-specific 10s of Mflops for these machines CS267, Yelick

LogP Parameters Today 11/17/2018 CS267, Yelick

Local Computation Parameters - Empirical
11/17/2018 Local Computation Parameters - Empirical Parameter Operation µs per key Sort Swap Simulate cycle butterfly per key lg N Bitonic mergesort Sort bitonic sequence 1.0 scatter Move key for Cyclic-to-block 0.46 gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwise local sort Local radix sort (11 bit) 4.5 if n < 64K 9.0 - (281000/n) merge Merge sorted lists 1.5 Column copy Shift Key 0.5 zero Clear histogram bin 0.2 Radix hist produce histogram 1.2 add produce scan value 1.0 bsum adjust scan of bins 2.5 address determine desitination 4.7 compare compare key to splitter 0.9 Sample localsort8 local radix sort of samples 5.0 CS267, Yelick

Odd-Even Merge - classic parallel sort
N values to be sorted Treat as two lists of M = N/2 Sort each separately A0 A1 A2 A AM-1 B0 B1 B2 B BM-1 Redistribute into even and odd sublists A0 A2 … AM-2 A1 A3 … AM-1 B0 B2 … BM-2 B1 B3 … BM-1 Merge into two sorted lists E0 E1 E2 E EM-1 O0 O1 O2 O OM-1 Pairwise swaps of Ei and Oi will put it in order

Where’s the Parallelism?
1xN 2xN/2 4xN/4 E0 E1 E2 E EM-1 O0 O1 O2 O OM-1 1xN

Mapping to a Butterfly (or Hypercube)
two sorted sublists Reverse Order of one list via cross edges A0 A1 A2 A3 B3 B2 B1 B0 Pairwise swaps on way back 2 3 4 8 7 6 5 1 2 3 4 1 7 6 5 8 2 1 4 3 5 6 7 8 1 2 3 4 5 6 7 8

Bitonic Sort with N/P per node
11/17/2018 Bitonic Sort with N/P per node A bitonic sequence decreases and then increases (or vice versa) Bitonic sequences can be merged like monotonic sequences all_bitonic(int A[PROCS]::[n]) sort(tolocal(&A[ME][0]),n,0) for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); } sort(tolocal(&A[ME][0]),n,mask(i)); sort swap CS267, Yelick

11/17/2018 Bitonic: Breakdown P= 512, random CS267, Yelick

Bitonic: Effect of Key Distributions
11/17/2018 Bitonic: Effect of Key Distributions P = 64, N/P = 1 M CS267, Yelick

Sequential Radix Sort: Counting Sort
Idea: build a histogram of the keys and compute position in answer array for each element A = [3, 5, 4, 1, 3, 4, 1, 4] Make temp array B, and write values into position 9 7 2 4 5 8 7 2 4 9 1 9 2 4 8 7 2 4 9 1 5 8 7 9 2 4 2 4 4 1 3 3 4 4 4 5 11/17/2018 CS267, Yelick

11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23

Similar presentations

Presentation on theme: "11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23

Similar presentations

Presentation on theme: "11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23"— Presentation transcript:

Similar presentations

About project

Feedback