Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.

Similar presentations


Presentation on theme: "1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003."— Presentation transcript:

1 1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003

2 2 Today Bucket sort (review) Radix sort Merge sort for external memory

3 3 Lower Bound Recall: Any sorting algorithm based on comparisons requires  (n log n) time –More precisely: there exists a “bad input” for that algorithm, on which it takes  (n log n) –The theorem can be extended: the average running time is  (n log n) The next two algorithms (Bucket, Radix) apparently break this theorem !

4 4 Bucket Sort Now let’s sort in O(N) Assume: A[0], A[1], …, A[N-1]  {0, 1, …, M-1} M = not too big Example: sort 1,000,000 person records on the first character of their last names: –Hence M = 128 (in practice: M = 27)

5 5 Bucket Sort int bucketSort(Array A, int N) { for k = 0 to M-1 Q[k] = new Queue; for j = 0 to N-1 Q[A[j]].enqueue(A[j]); Result = new Queue; for k = 0 to M-1 Result = Result.append(Q[k]); return Result; } int bucketSort(Array A, int N) { for k = 0 to M-1 Q[k] = new Queue; for j = 0 to N-1 Q[A[j]].enqueue(A[j]); Result = new Queue; for k = 0 to M-1 Result = Result.append(Q[k]); return Result; } Stable sorting !

6 6 Bucket Sort Running time: O(M+N) Space: O(M+N) Recall that M << N, hence time = O(N) What about the Theorem that says sorting takes  (N log N) ?? This is not based on key comparisons, instead exploits the fact that keys are small

7 7 Radix Sort I still want to sort in time O(N): non-trivial keys A[0], A[1], …, A[N-1] are strings –Very common in practice Each string is: c d-1 c d-2 …c 1 c 0, where c 0, c 1, …, c d-1  {0, 1, …, M-1} M = 128 Other example: decimal numbers

8 8 RadixSort Radix = “The base of a number system” (Webster’s dictionary) –alternate terminology: radix is number of bits needed to represent 0 to base-1; can say “base 8” or “radix 3” Used in 1890 U.S. census by Hollerith Idea: BucketSort on each digit, bottom up.

9 9 The Magic of RadixSort Input list: 126, 328, 636, 341, 416, 131, 328 BucketSort on lower digit: 341, 131, 126, 636, 416, 328, 328 BucketSort result on next-higher digit: 416, 126, 328, 328, 131, 636, 341 BucketSort that result on highest digit: 126, 131, 328, 328, 341, 416, 636

10 10 Inductive Proof that RadixSort Works Keys: d-digit numbers, base B –(that wasn’t hard!) Claim: after i th BucketSort, least significant i digits are sorted. –Base case: i=0. 0 digits are sorted. –Inductive step: Assume for i, prove for i+1. Consider two numbers: X, Y. Say X i is i th digit of X: X i+1 < Y i+1 then i+1 th BucketSort will put them in order X i+1 > Y i+1, same thing X i+1 = Y i+1, order depends on last i digits. Induction hypothesis says already sorted for these digits because BucketSort is stable

11 11 Radix Sort int radixSort(Array A, int N) { for k = 0 to d-1 A = bucketSort(A, on position k) } int radixSort(Array A, int N) { for k = 0 to d-1 A = bucketSort(A, on position k) } Running time: T = O(d(M+N)) = O(dN) = O(Size)

12 12 Radix Sort 3535535353525232322525 Q[0]Q[1]Q[2]Q[3]Q[4]Q[5]Q[6]Q[7]Q[8]Q[9] 53533353552525 5252323253533353552525 52523232 Q[0]Q[1]Q[2]Q[3]Q[4]Q[5]Q[6]Q[7]Q[8]Q[9] 3232335352525525253535 2525323233535525253535 A=

13 13 Running time of Radixsort N items, d digit keys of max value M How many passes? How much work per pass? Total time?

14 14 Running time of Radixsort N items, d digit keys of max value M How many passes? d How much work per pass? N + M –just in case M>N, need to account for time to empty out buckets between passes Total time? O( d(N+M) )

15 15 Radix Sort What is the size of the input ? Size = dN Radix sort takes time O(Size) !! c d-1 c D-2 …c0c0 A[0]‘S’‘m’‘i’‘t’‘h’ A[1]‘J’‘o’‘n’‘e’‘s’ … A[N-1]

16 16 Radix Sort Variable length strings: Can adapt Radix Sort to sort in time O(Size) ! –What about our Theorem ?? A[0] A[1] A[2] A[3] A[4]

17 17 Radix Sort Suppose we want to sort N distinct numbers Represent them in decimal: –Need d=log N digits Hence RadixSort takes time O(Size) = O(dN) = O(N log N) The total Size of N keys is O(N log N) ! No conflict with theory

18 18 Sorting HUGE Data Sets US Telephone Directory: –300,000,000 records 64-bytes per record –Name: 32 characters –Address: 54 characters –Telephone number: 10 characters –About 2 gigabytes of data –Sort this on a machine with 128 MB RAM… Other examples?

19 19 Merge Sort Good for Something! Basis for most external sorting routines Can sort any number of records using a tiny amount of main memory –in extreme case, only need to keep 2 records in memory at any one time!

20 20 External MergeSort Split input into two “tapes” (or areas of disk) Merge tapes so that each group of 2 records is sorted Split again Merge tapes so that each group of 4 records is sorted Repeat until data entirely sorted log N passes

21 21 Sorting Illustrates the difference in algorithm design when your data is not in main memory: –Problem: sort 8Gb of data with 8Mb of RAM We know we can do it in O(n log n) time, but let’s see the number of disk I/O’s

22 22 2-Way Merge-sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. –only one buffer page is used Pass 2, 3, …, etc.: – three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk Buffer size = 8Kb (typically)

23 23 2-Way Merge-sort A run = a sequence of sorted elements Main property of 2-way merge: –If the minimum run length is L, then after merge the minimum run length is 2L Initially: minimum run length = 1 (why ?) After one pass= 2 After two passes= 4... 24064966708075654050  run    run 

24 24 Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Improvement: start with larger runs Sort 1GB with 1MB memory in 10 passes Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,4 5,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8

25 25 2-Way Merge-sort Hence we need exactly log N passes through the entire data How much is N ? N  10 6 (why ?) Hence we need to read and write the entire 8GB data log(10 6 ) = 20 times ! It takes about 1minute to read 1GB of data –even more It takes at least160 minutes to sort = 3 hours

26 26 2-Way Merge-sort: less dumb Use the 8Mb of main memory better ! Initial step: Run formation –Read 8Mb of data in main memory –Sort (what algorithm would you use ?) –Write to disk Now the runs are 8Mb after one pass ! After subsequent passes the runs are –2  8Mb, 4  8Mb,... Need log(8Gb/8Mb) = log(10 3 ) = 10 passes 1.5h instead of 3h have time to see a movie

27 27 Can We Do Better ? We have more main memory Used it during all passes Multiway merge: Given M sorted sequence Merge them

28 28 Multiway Merge 19276080... 234594... 1020304050... 24326994... 122344 m At each step: select the smallest value among the M, store in the output What data structure should we use here ?

29 29 Multiway Merge-Sort Phase one: load M bytes in memory, sort –Result: runs of length M/B blocks M bytes of main memory Disk... M/B blocks B = size of one block = 8Kb typically

30 30 Pass Two Merge m = M/B runs into a new run Runs have M/B (M/B – 1)  (M/B) 2 blocks M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

31 31 Pass Three Merge M/B – 1 runs into a new run Runs have now M/R (M/B – 1) 2  (M/B) 3 blocks M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

32 32 Multiway Merge-Sort Input file has N bytes Need log M/B (N/B) = (log(N/B)) / (log(M/B)) complete passes over the file N/B = 8Gb / 8Kb = 10 6 M/B = 8Mb / 8Kb = 10 3 Hence need log(10 6 )/log(10 3 ) = 2 passes ! 2 minutes ! Time for two movies

33 33 Multiway Merge-Sort With today’s main memories, we can sort almost any file in two passes The file can have (M/B) 2 blocks XML Toolkit: –xsort sorts using multiway merge, in two passes


Download ppt "1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003."

Similar presentations


Ads by Google