04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Garfield AP Computer Science
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
 Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
Chapter 10 in textbook. Sorting Algorithms
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Sorting Algorithms: Topic Overview
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
CS 240A: Complexity Measures for Parallel Computation.
Simple Sorting Algorithms. 2 Bubble sort Compare each element (except the last one) with its neighbor to the right If they are out of order, swap them.
CSE 373 Data Structures Lecture 15
Parallel Programming in C with MPI and OpenMP
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
“Sorting networks and their applications”, AFIPS Proc. of 1968 Spring Joint Computer Conference, Vol. 32, pp
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Searching and Sorting Searching: Sequential, Binary Sorting: Selection, Insertion, Shell.
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
Basic Communication Operations Carl Tropper Department of Computer Science.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Programming for Performance Laxmikant Kale CS 433.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Buffering Techniques Greg Stitt ECE Department University of Florida.
CSC 4250 Computer Architectures
Introduction to Algorithms
5.2 Eleven Advanced Optimizations of Cache Performance
Parallel Sorting Algorithms
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
CSCI1600: Embedded and Real Time Software
11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23
Bitonic Sorting and Its Circuit Design
Architectural Interactions in High Performance Clusters
Parallel Sorting Algorithms
Analysis of Algorithms
Chapter 4 Multiprocessors
Parallel Sorting Algorithms
CSCI1600: Embedded and Real Time Software
Presentation transcript:

04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel

Some Sorting algorithms Choice of algorithm depends on Is the data all in memory, or on disk/tape? Do we compare operands, or just use their values? Sequential or parallel? Shared or distributed memory? SIMD or not? We will consider all data in memory and parallel: Bitonic Sort Naturally parallel, suitable for SIMD architectures Sample Sort Good generalization of Quick Sort to parallel case Radix Sort Data measured on CM-5 Written in Split-C (precursor of UPC) 04/27/2009CS267 Lecture 242

LogP – model to predict, understand performance Interconnection Network MP MPMP ° ° ° P ( processors ) Limited Volume (L/ g to or from a proc) o (overhead) L (latency) o g (gap) L atency in sending a (small) message between modules o verhead felt by the processor on sending or receiving msg g ap between successive sends or receives (1/BW) P rocessors  

Bottom Line on CM-5 using Split-C (Preview) Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted (using LogP model) and measured (10%) No single algorithm always best scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model Algorithm, #Procs 04/27/2009CS267 Lecture 244

04/27/2009CS267 Lecture 245 Bitonic Sort (1/2) A bitonic sequence is one that is: 1.Monotonically increasing and then monotonically decreasing 2.Or can be circularly shifted to satisfy 1 A half-cleaner takes a bitonic sequence and produces 1.First half is smaller than smallest element in 2 nd 2.Both halves are bitonic Apply recursively to each half to complete sorting Where do we get a bitonic sequence to start with?

04/27/2009CS267 Lecture 246 Bitonic Sort (2/2) A bitonic sequence is one that is: 1.Monotonically increasing and then monotonically decreasing 2.Or can be circularly shifted to satisfy Any sequence of length 2 is bitonic So we can sort it using bitonic sorter on last slide Sort all length-2 sequences in alternating increasing/decreasing order Two such length-2 sequences form a length-4 bitonic sequence Recursively Sort length-2 k sequences in alternating increasing/decreasing order Two such length-2 k sequences form a length-2 k+1 bitonic sequence (that can be sorted)

Bitonic Sort for n=16 – all dependencies shown Block Layout Similar pattern as FFT: similar optimizations possible

Bitonic Sort: time per key Predicted using LogP Model N/P us/key Measured N/P us/key #Procs 04/27/2009CS267 Lecture 248

Sample Sort 1. compute P-1 values of keys that would split the input into roughly equal pieces. – take S~64 samples per processor – sort P·S keys (on processor 0) – let keys S, 2·S,... (P-1)·S be “Splitters” – broadcast Splitters 2. Distribute keys based on Splitters Splitter(i-1) < keys  Splitter(i) all sent to proc i 3. Local sort of keys on each proc [4.] possibly reshift, so each proc has N/p keys If samples represent total population, then Splitters should divide population into P roughly equal pieces 04/27/2009CS267 Lecture 249

Sample Sort: Times Predicted N/P us/key Measured N/P us/key # Processors 04/27/2009CS267 Lecture 2410

Sample Sort Timing Breakdown Predicted and Measured (-m) times 04/27/2009CS267 Lecture 2411

09/19/2015CS267 Lecture 2412 Sequential Radix Sort: Counting Sort Idea: build a histogram of the keys and compute position in answer array for each element A = [3, 5, 4, 1, 3, 4, 1, 4] Make temp array B, and write values into position B = [1, 1, 3, 3, 4, 4, 4, 5] Cost = O(#keys + size of histogram) What if histogram too large (eg all 32-bit ints? All words?)

09/19/2015CS267 Lecture 2413 Radix Sort: Separate Key Into Parts Divide keys into parts, e.g., by digits (radix) Using counting sort on these each radix: Start with least-significant satrunsatpin sawpinsawrun tip sat runsatpinsaw pinsawruntip sort on 3 rd character sort on 2 nd character sort on 1 st character satrunsatpin sawpinsawrun tip sat runsatpinsaw pinsawruntip satrunsatpin sawpinsawrun tip pinsat runsattipsaw pinsawruntip satrunsatpin sawpinsawrun tip pinsat runsattipsaw pinsawruntip Cost = O(#keys * #characters)

Histo-radix sort P n=N/P Per pass: 1.compute local histogram – r-bit keys, 2 r bins 2. compute position of 1st member of each bucket in global array – 2 r scans with end-around 3. distribute all the keys Only r = 4, 8,11,16 make sense for sorting 32 bit numbers 2r2r /19/2015CS267 Lecture 2414

Histo-Radix Sort (again) Local Data Local Histograms Each Pass form local histograms form global histogram globally distribute data P 09/19/2015CS267 Lecture 2415

Radix Sort: Times Predicted N/P us/key us/key Measured N/P # procs 09/19/2015CS267 Lecture 2416

Radix Sort: Timing Breakdown 09/19/2015CS267 Lecture 2417 Predicted and Measured (-m) times

Local Sort Performance on CM-5 (11 bit radix sort of 32 bit numbers) 09/19/201518CS267 Lecture 24 Entropy = -  i p i log 2 p i, p i = Probability of key i Ranges from 0 to log 2 (#different keys)

Radix Sort: Timing dependence on Key distribution Slowdown due to contention in redistribution 09/19/2015CS267 Lecture 2419 Entropy = -  i p i log 2 p i, p i = Probability of key i Ranges from 0 to log 2 (#different keys)

Bottom Line on CM-5 using Split-C Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted (using LogP model) and measured (10%) No single algorithm always best scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model Algorithm, #Procs 04/27/2009CS267 Lecture 2420

Sorting Conclusions Distributed memory model leads to hybrid global / local algorithm Use best local algorithm combined with global part LogP model is good enough to model global part bandwidth (g) or overhead (o) matter most including end-point contention latency (L) only matters when bandwidth doesn’t Modeling local computational performance is harder dominated by effects of storage hierarchy (eg TLBs), depends on entropy See See disk-to-disk parallel sorting 04/27/2009CS267 Lecture 2421

EXTRA SLIDES 09/19/2015CS267 Lecture 2422

Radix: Stream Broadcast Problem n (P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well Processor 0 does only sends Others receive then send Receives prioritized over sends  Processor 0 needs to be delayed 09/19/2015CS267 Lecture 2423

What’s the right communication mechanism? Permutation via writes consistency model? false sharing? Reads? Bulk Transfers? what do you need to change in the algorithm? Network scheduling?

Comparison Good fit between predicted and measured (10%) Different sorts for different sorts scaling by processor, input size, sensitivity All are global / local hybrids the local part is hard to implement and model Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32

04/27/2009CS267 Lecture 2426 Outline Some Performance Laws Performance analysis Performance modeling Parallel Sorting: combining models with measurments Reading: Chapter 3 of Foster’s “Designing and Building Parallel Programs” online text Abbreviated as DBPP in this lecture David Bailey’s “Twelve Ways to Fool the Masses”

9/19/2015CS267, Yelick27 Measuring Performance Performance criterion may vary with domain There may be limits on acceptable running time E.g., a climate model must run 1000x faster than real time. Any performance improvement may be acceptable E.g., faster on 4 cores than on 1. Throughout may be more critical than latency E.g., number of images processed per minute (throughput) vs. total delay for one image (latency) in a pipelined system. Execution time per unit cost E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt Parallel scalability (speedup or parallel efficiency) Percent relative to best possible (some kind of peak)

9/19/2015CS267, Yelick28 Amdahl’s Law (review) Suppose only part of an application seems parallel Amdahl’s law let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly performance is limited by the sequential part

9/19/2015CS267, Yelick29 Amdahl’s Law (for 1024 processors) Does this mean parallel computing is a hopeless enterprise? Source: Gustafson, Montry, Benner

9/19/2015CS267, Yelick30 Scaled Speedup See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.

9/19/2015CS267, Yelick31 Scaled Speedup (background)

9/19/2015CS267, Yelick32 Little’s Law Latency vs. Bandwidth Latency is physics (wire length) e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room Bandwidth is cost: add more cables to increase bandwidth (over-simplification) Principle (Little's Law): the relationship of a production system in steady state is: Inventory = Throughput × Flow Time For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency Required concurrency = Bandwidth * Latency bandwidth-delay product For parallel computing, this means: Concurrency = bandwidth x latency

9/19/2015CS267, Yelick33 Little’s Law Example 1: a single processor : If the latency is to memory is 50ns, and the bandwidth is 5 GB/s (.2ns / Bytes = 12.8 ns / 64-byte cache line) The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed) An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them) Example 2: 1000 processor system 1 GHz clock, 100 ns memory latency, 100 words of memory in data paths between CPU and memory. Main memory bandwidth is: ~ 1000 x 100 words x 10 9 /s = words/sec. To achieve full performance, an application needs: ~ x = 10 7 way concurrency (some of that may be hidden in the instruction stream)

9/19/2015CS267, Yelick34 In Performance Analysis: Use more Data Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information. E.g., from DBPP: Serial algorithm scales as N + N 2 Speedup of 10.8 on 12 processors with problem size N=100 Case 1: T = N + N 2 /P Case 2: T = (N + N 2 )/P Case 2: T = (N + N 2 )/P + 0.6*P 2 All have speedup ~10.8 on 12 procs Performance graphs (n = 100, 1000) show differences in scaling

9/19/2015CS267, Yelick35 Example: Immersed Boundary Simulation Joint work with Ed Givelberg, Armando Solar-Lezama Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC How useful is this data? What are ways to make is more useful/interesting?

9/19/2015CS267, Yelick36 Performance Analysis

9/19/2015CS267, Yelick37 Building a Performance Model Based on measurements/scaling of components FFT is time is: 5*nlogn flops * flops/sec (measured for FFT) Other costs are linear in either material or fluid points Measure constants: a)# flops/point (independent machine or problem size) b)Flops/sec (measured per machine, per phase) Time is: a * b * #points Communication done similarly Find formula for message size as function of problem size Check the formula using tracing of some kind Use  model to predict running time:  * size

9/19/2015CS267, Yelick38 A Performance Model in < 1 second per timestep not possible Primarily limited by bisection bandwidth

9/19/2015CS267, Yelick39 Model Success and Failure

9/19/2015CS267, Yelick40 OSKI SPMV: What We Expect Assume Cost(SpMV) = time to read matrix 1 double-word = 2 integers r, c in {1, 2, 4, 8} CSR: 1 int / non-zero BCSR(r x c): 1 int / (r*c non-zeros) As r*c increases, speedup should Increase smoothly Approach 1.5 (eliminate all index overhead)

9/19/2015CS267, Yelick41 What We Get (The Need for Search) Reference Best: 4x2 Mflop/s

9/19/2015CS267, Yelick42 Using Multiple Models

9/19/2015CS267, Yelick43 Multiple Models

9/19/2015CS267, Yelick44 Multiple Models

Extended Example Using Performance Modeling (LogP) To Explain Data Application to Sorting

Deriving the LogP Model ° Processing – powerful microprocessor, large DRAM, cache=> P ° Communication + significant latency (100's –1000’s of cycles)=> L + limited bandwidth (1 – 5% of memory bw)=> g + significant overhead(10's – 100's of cycles)=> o - on both ends – no consensus on topology => should not exploit structure + limited network capacity – no consensus on programming model => should not enforce one

LogP Interconnection Network MP MPMP ° ° ° P ( processors ) Limited Volume (L/ g to or from a proc) o (overhead) L (latency) o g (gap) L atency in sending a (small) message between modules o verhead felt by the processor on sending or receiving msg g ap between successive sends or receives (1/BW) P rocessors  

Using the LogP Model ° Send n messages from proc to proc in time 2o + L + g (n-1) – each processor does o * n cycles of overhead – has (g-o)(n-1) + L available compute cycles ° Send n total messages from one to many in same time ° Send n messages from many to one in same time – all but L/g processors block so fewer available cycles, unless scheduled carefully oL o o o g L time P P

Use of the LogP Model (cont) ° Two processors sending n words to each other (i.e., exchange) 2o + L + max(g,2o) (n-1)  max(g,2o) + L ° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts, e.g., transpose, fft, cyclic-to-block, block-to-cyclic exercise: what’s wrong with the formula above? Assumes optimal pattern of send/receive, so could underestimate time oL o o o g L o

LogP "philosophy" Think about: – mapping of N words onto P processors – computation within a processor, its cost, and balance – communication between processors, its cost, and balance given a charaterization of processor and network performance Do not think about what happens within the network This should be good enough!

Typical Sort Exploits the n = N/P grouping ° Significant local computation ° Very general global communication / transformation ° Computation of the transformation

Costs Split-C (UPC predecessor) Operations Read, Writex = *G, *G = x2 (L + 2o) Store*G :– xL + 2o Getx := *Go....2L + 2o sync();o with interval g Bulk store (n words with  words/message) 2o + (n-1)g + L Exchange2o + 2L + (  n  L/g) max(g,2o) One to many Many to one

LogP model CM5: L = 6 µs o = 2.2 µs g = 4 µs P varies from 32 to 1024 NOW L = 8.9 o = 3.8 g = 12.8 P varies up to 100 What is the processor performance? Application-specific 10s of Mflops for these machines

9/19/2015CS267, Yelick54 LogP Parameters Today

Local Computation Parameters - Empirical ParameterOperationµs per keySort SwapSimulate cycle butterfly per key0.025 lg NBitonic mergesortSort bitonic sequence1.0 scatterMove key for Cyclic-to-block0.46 gatherMove key for Block-to-cyclic0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwise local sortLocal radix sort (11 bit)4.5 if n < 64K (281000/n) mergeMerge sorted lists1.5Column copyShift Key0.5 zeroClear histogram bin0.2Radix histproduce histogram1.2 addproduce scan value1.0 bsumadjust scan of bins2.5 addressdetermine desitination4.7 comparecompare key to splitter0.9Sample localsort8local radix sort of samples5.0

Odd-Even Merge - classic parallel sort N values to be sorted A 0 A 1 A 2 A 3 A M-1 B 0 B 1 B 2 B 3 B M-1 Treat as two lists of M = N/2 Sort each separately A 0 A 2 … A M-2 B 0 B 2 … B M-2 Redistribute into even and odd sublists A 1 A 3 … A M-1 B 1 B 3 … B M-1 Merge into two sorted lists E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 Pairwise swaps of Ei and Oi will put it in order

Where’s the Parallelism? E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 1xN 4xN/4 2xN/2

Mapping to a Butterfly (or Hypercube) A0A1A2A3B0B1B2B3 A0A1A2A3 A0A1A2A3 B0B1B3B2 B3B1B0 A0A1A2A3B2B3B1B0 Reverse Order of one list via cross edges two sorted sublists Pairwise swaps on way back

Bitonic Sort with N/P per node all_bitonic(int A[PROCS]::[n]) sort(tolocal(&A[ME][0]),n,0) for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); } sort(tolocal(&A[ME][0]),n,mask(i)); sort swap A bitonic sequence decreases and then increases (or vice versa) Bitonic sequences can be merged like monotonic sequences

Bitonic: Breakdown P= 512, random

Bitonic: Effect of Key Distributions P = 64, N/P = 1 M

9/19/2015CS267, Yelick62 Sequential Radix Sort: Counting Sort Idea: build a histogram of the keys and compute position in answer array for each element A = [3, 5, 4, 1, 3, 4, 1, 4] Make temp array B, and write values into position

9/19/2015CS267, Yelick63 Counting Sort Pseudo Code Counting Sort static void countingSort(int [] A) { int N = A.length; int L = min(A), U = max(A); int[] count = new int[U-L+2]; for (int i = 0; i < N; i += 1) count[A[i] - L + 1] += 1; for (int j = 1; j < count.length; j++) count[j] += count[j-1]; … A = [3, 5, 4, 1, 3, 4, 1, 4] N = 8 L=1, U=5 count = [0,0,0,0,0,0] count = [0,2,0,2,3,1] count = [0,2,2,4,7,8]

9/19/2015CS267, Yelick64 Distribution Sort Continued static void countingSort (int [] A) { … int [] B = new int [N]; for (int i = 0; i < N; i += 1) { B[count[A[i]-L]] = A[i]; count[A[i]-L] += 1; } // copy back into A for (int i = 0; i < N; i += 1) A[i] = B[i]; } count = [0,2,2,4,7,8] A = [3, 5, 4, 1, 3, 4, 1, 4] B = [0, 0, 0, 0, 0, 0, 0, 0] … B = [ 1, 1, 3, 3, 4, 4, 4, 5] B = [0, 0, 3, 0, 0, 0, 0, 0] count = [0,2,3,4,7,8] B = [0, 0, 3, 0, 0, 0, 0, 5] count = [0,2,3,4,8,8] B = [0, 0, 3, 0, 4, 0, 0, 5] count = [0,2,3,5,8,8] B = [ 1, 0, 3, 0, 4, 0, 0, 5] count = [ 1,2,3,5,8,8] B = [ 1, 0, 3, 3, 4, 0, 0, 5] count = [ 1,2,4,5,8,8]

9/19/2015CS267, Yelick65 Analysis of Counting Sort What is the complexity of each step for an n-element array? Find min and max:  (n) Fill in count histogram:  (n) Compute sums of count:  (max-min) Fill in B (run over count):  (max-min) Copy B to A:  (n) So this is a  (n + m) algorithm, where m=max-min Great if the range of keys isn’t too large If m < n, then  (n) overall