1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Analysis of Algorithms
Analysis of Algorithms CS 477/677 Linear Sorting Instructor: George Bebis ( Chapter 8 )
Sorting in Linear Time Comp 550, Spring Linear-time Sorting Depends on a key assumption: numbers to be sorted are integers in {0, 1, 2, …, k}. Input:
1 Sorting in Linear Time How can we do better?  CountingSort  RadixSort  BucketSort.
CSC 213 – Large Scale Programming. Today’s Goals  Review discussion of merge sort and quick sort  How do they work & why divide-and-conquer?  Are they.
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2010.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Lecture 5: Linear Time Sorting Shang-Hua Teng. Sorting Input: Array A[1...n], of elements in arbitrary order; array size n Output: Array A[1...n] of the.
Comp 122, Spring 2004 Lower Bounds & Sorting in Linear Time.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 CSE 326: Data Structures: Sorting Lecture 16: Friday, Feb 14, 2003.
Storage. The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 512Mb Access times in milliseconds, great variability. Unit.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Computer Algorithms Lecture 11 Sorting in Linear Time Ch. 8
CSE 373 Data Structures Lecture 15
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
Lecture 11: DMBS Internals
Sorting Fun1 Chapter 4: Sorting     29  9.
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2012.
Analysis of Algorithms CS 477/677
1 CSE 326: Data Structures Sorting It All Out Henry Kautz Winter Quarter 2002.
Sorting.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSE 326: Data Structures Lecture 24 Spring Quarter 2001 Sorting, Part B David Kaplan
1 Today’s Material Iterative Sorting Algorithms –Sorting - Definitions –Bubble Sort –Selection Sort –Insertion Sort.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Searching and Sorting Recursion, Merge-sort, Divide & Conquer, Bucket sort, Radix sort Lecture 5.
Lecture 24 Query Execution Monday, November 28, 2005.
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
1 CSE 326: Data Structures Sorting in (kind of) linear time Zasha Weinberg in lieu of Steve Wolfman Winter Quarter 2000.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
CSE 326: Data Structures Lecture 23 Spring Quarter 2001 Sorting, Part 1 David Kaplan
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Lecture 16: Data Storage Wednesday, November 6, 2006.
Lecture 5 Algorithm Analysis
CS222: Principles of Data Management Lecture #10 External Sorting
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
CENG 351 Data Management and File Structures
CSE 326: Data Structures: Sorting
Lecture 20: Representing Data Elements
Presentation transcript:

1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003

2 Today Bucket sort (review) Radix sort Merge sort for external memory

3 Lower Bound Recall: Any sorting algorithm based on comparisons requires  (n log n) time –More precisely: there exists a “bad input” for that algorithm, on which it takes  (n log n) –The theorem can be extended: the average running time is  (n log n) The next two algorithms (Bucket, Radix) apparently break this theorem !

4 Bucket Sort Now let’s sort in O(N) Assume: A[0], A[1], …, A[N-1]  {0, 1, …, M-1} M = not too big Example: sort 1,000,000 person records on the first character of their last names: –Hence M = 128 (in practice: M = 27)

5 Bucket Sort int bucketSort(Array A, int N) { for k = 0 to M-1 Q[k] = new Queue; for j = 0 to N-1 Q[A[j]].enqueue(A[j]); Result = new Queue; for k = 0 to M-1 Result = Result.append(Q[k]); return Result; } int bucketSort(Array A, int N) { for k = 0 to M-1 Q[k] = new Queue; for j = 0 to N-1 Q[A[j]].enqueue(A[j]); Result = new Queue; for k = 0 to M-1 Result = Result.append(Q[k]); return Result; } Stable sorting !

6 Bucket Sort Running time: O(M+N) Space: O(M+N) Recall that M << N, hence time = O(N) What about the Theorem that says sorting takes  (N log N) ?? This is not based on key comparisons, instead exploits the fact that keys are small

7 Radix Sort I still want to sort in time O(N): non-trivial keys A[0], A[1], …, A[N-1] are strings –Very common in practice Each string is: c d-1 c d-2 …c 1 c 0, where c 0, c 1, …, c d-1  {0, 1, …, M-1} M = 128 Other example: decimal numbers

8 RadixSort Radix = “The base of a number system” (Webster’s dictionary) –alternate terminology: radix is number of bits needed to represent 0 to base-1; can say “base 8” or “radix 3” Used in 1890 U.S. census by Hollerith Idea: BucketSort on each digit, bottom up.

9 The Magic of RadixSort Input list: 126, 328, 636, 341, 416, 131, 328 BucketSort on lower digit: 341, 131, 126, 636, 416, 328, 328 BucketSort result on next-higher digit: 416, 126, 328, 328, 131, 636, 341 BucketSort that result on highest digit: 126, 131, 328, 328, 341, 416, 636

10 Inductive Proof that RadixSort Works Keys: d-digit numbers, base B –(that wasn’t hard!) Claim: after i th BucketSort, least significant i digits are sorted. –Base case: i=0. 0 digits are sorted. –Inductive step: Assume for i, prove for i+1. Consider two numbers: X, Y. Say X i is i th digit of X: X i+1 < Y i+1 then i+1 th BucketSort will put them in order X i+1 > Y i+1, same thing X i+1 = Y i+1, order depends on last i digits. Induction hypothesis says already sorted for these digits because BucketSort is stable

11 Radix Sort int radixSort(Array A, int N) { for k = 0 to d-1 A = bucketSort(A, on position k) } int radixSort(Array A, int N) { for k = 0 to d-1 A = bucketSort(A, on position k) } Running time: T = O(d(M+N)) = O(dN) = O(Size)

12 Radix Sort Q[0]Q[1]Q[2]Q[3]Q[4]Q[5]Q[6]Q[7]Q[8]Q[9] Q[0]Q[1]Q[2]Q[3]Q[4]Q[5]Q[6]Q[7]Q[8]Q[9] A=

13 Running time of Radixsort N items, d digit keys of max value M How many passes? How much work per pass? Total time?

14 Running time of Radixsort N items, d digit keys of max value M How many passes? d How much work per pass? N + M –just in case M>N, need to account for time to empty out buckets between passes Total time? O( d(N+M) )

15 Radix Sort What is the size of the input ? Size = dN Radix sort takes time O(Size) !! c d-1 c D-2 …c0c0 A[0]‘S’‘m’‘i’‘t’‘h’ A[1]‘J’‘o’‘n’‘e’‘s’ … A[N-1]

16 Radix Sort Variable length strings: Can adapt Radix Sort to sort in time O(Size) ! –What about our Theorem ?? A[0] A[1] A[2] A[3] A[4]

17 Radix Sort Suppose we want to sort N distinct numbers Represent them in decimal: –Need d=log N digits Hence RadixSort takes time O(Size) = O(dN) = O(N log N) The total Size of N keys is O(N log N) ! No conflict with theory

18 Sorting HUGE Data Sets US Telephone Directory: –300,000,000 records 64-bytes per record –Name: 32 characters –Address: 54 characters –Telephone number: 10 characters –About 2 gigabytes of data –Sort this on a machine with 128 MB RAM… Other examples?

19 Merge Sort Good for Something! Basis for most external sorting routines Can sort any number of records using a tiny amount of main memory –in extreme case, only need to keep 2 records in memory at any one time!

20 External MergeSort Split input into two “tapes” (or areas of disk) Merge tapes so that each group of 2 records is sorted Split again Merge tapes so that each group of 4 records is sorted Repeat until data entirely sorted log N passes

21 Sorting Illustrates the difference in algorithm design when your data is not in main memory: –Problem: sort 8Gb of data with 8Mb of RAM We know we can do it in O(n log n) time, but let’s see the number of disk I/O’s

22 2-Way Merge-sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. –only one buffer page is used Pass 2, 3, …, etc.: – three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk Buffer size = 8Kb (typically)

23 2-Way Merge-sort A run = a sequence of sorted elements Main property of 2-way merge: –If the minimum run length is L, then after merge the minimum run length is 2L Initially: minimum run length = 1 (why ?) After one pass= 2 After two passes=  run    run 

24 Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Improvement: start with larger runs Sort 1GB with 1MB memory in 10 passes Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,4 5,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8

25 2-Way Merge-sort Hence we need exactly log N passes through the entire data How much is N ? N  10 6 (why ?) Hence we need to read and write the entire 8GB data log(10 6 ) = 20 times ! It takes about 1minute to read 1GB of data –even more It takes at least160 minutes to sort = 3 hours

26 2-Way Merge-sort: less dumb Use the 8Mb of main memory better ! Initial step: Run formation –Read 8Mb of data in main memory –Sort (what algorithm would you use ?) –Write to disk Now the runs are 8Mb after one pass ! After subsequent passes the runs are –2  8Mb, 4  8Mb,... Need log(8Gb/8Mb) = log(10 3 ) = 10 passes 1.5h instead of 3h have time to see a movie

27 Can We Do Better ? We have more main memory Used it during all passes Multiway merge: Given M sorted sequence Merge them

28 Multiway Merge m At each step: select the smallest value among the M, store in the output What data structure should we use here ?

29 Multiway Merge-Sort Phase one: load M bytes in memory, sort –Result: runs of length M/B blocks M bytes of main memory Disk... M/B blocks B = size of one block = 8Kb typically

30 Pass Two Merge m = M/B runs into a new run Runs have M/B (M/B – 1)  (M/B) 2 blocks M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

31 Pass Three Merge M/B – 1 runs into a new run Runs have now M/R (M/B – 1) 2  (M/B) 3 blocks M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

32 Multiway Merge-Sort Input file has N bytes Need log M/B (N/B) = (log(N/B)) / (log(M/B)) complete passes over the file N/B = 8Gb / 8Kb = 10 6 M/B = 8Mb / 8Kb = 10 3 Hence need log(10 6 )/log(10 3 ) = 2 passes ! 2 minutes ! Time for two movies

33 Multiway Merge-Sort With today’s main memories, we can sort almost any file in two passes The file can have (M/B) 2 blocks XML Toolkit: –xsort sorts using multiway merge, in two passes