Sorting and Lower Bounds Fundamental Data Structures and Algorithms Peter Lee February 25, 2003
Announcements Quiz #2 available today Open until Wednesday midnight Midterm exam next Tuesday Tuesday, March 4, 2003, in class Review session in Thursday’s class Homework #4 is out You should finish Part 1 this week! Reading: Chapter 8
Recap
Naïve sorting algorithms Bubble sort Insertion sort.
Heapsort Build heap. O(N) DeleteMin until empty. O(Nlog N) Total worst case: O(Nlog N)
Shellsort Example with sequence 3, Several inverted pairs fixed in one exchange.
Divide-and-conquer
Analysis of recursive sorting Suppose it takes time T(N) to sort N elements. Suppose also it takes time N to combine the two sorted arrays. Then: T(1) = 1 T(N) = 2T(N/2) + N, for N>1 Solving for T gives the running time for the recursive sorting algorithm.
Divide-and-Conquer Theorem Theorem: Let a, b, c 0. The recurrence relation T(1) = b T(N) = aT(N/c) + bN for any N which is a power of c has upper-bound solutions T(N) = O(N)if a<c T(N) = O(Nlog N)if a=c T(N) = O(N log c a )if a>c a=2, b=1, c=2 for rec. sorting
Exact solutions It is sometimes possible to derive closed-form solutions to recurrence relations. Several methods exist for doing this. Telescoping-sum method Repeated-substitution method
Mergesort Mergesort is the most basic recursive sorting algorithm. Divide array in halves A and B. Recursively mergesort each half. Combine A and B by successively looking at the first elements of A and B and moving the smaller one to the result array. Note: Should be a careful to avoid creating of lots of result arrays.
Mergesort LLRL Use simple indexes to perform the split. Use a single extra array to hold each intermediate result.
Analysis of mergesort Mergesort generates almost exactly the same recurrence relations shown before. T(1) = 1 T(N) = 2T(N/2) + N - 1, for N>1 Thus, mergesort is O(Nlog N).
Comparison-based sorting Recall that these are all examples of comparison-based sorting algorithms: Items are stored in an array. Can be moved around in the array. Can compare any two array elements. Comparison has 3 possible outcomes:
Non-comparison-based sorting If we can do more than just compare pairs of elements, we can sometimes sort more quickly Two simple examples are bucket sort and radix sort
Bucket Sort
Bucket sort In addition to comparing pairs of elements, we require these additional restrictions: all elements are non-negative integers all elements are less than a predetermined maximum value
Bucket sort
Bucket sort characteristics Runs in O(N) time. Easy to implement each bucket as a linked list. Is stable: If two elements (A,B) are equal with respect to sorting, and they appear in the input in order (A,B), then they remain in the same order in the output.
Radix Sort
Radix sort Another sorting algorithm that goes beyond comparison is radix sort Each sorting step must be stable.
Radix sort characteristics Each sorting step can be performed via bucket sort, and is thus O(N). If the numbers are all b bits long, then there are b sorting steps. Hence, radix sort is O(bN). Also, radix sort can be implemented in-place (just like quicksort).
Not just for binary numbers Radix sort can be used for decimal numbers and alphanumeric strings
Why comparison-based? Bucket and radix sort are much faster than any comparison-based sorting algorithm Unfortunately, we can’t always live with the restrictions imposed by these algorithms In such cases, comparison-based sorting algorithms give us general solutions
Back to Quick Sort
Review: Quicksort algorithm If array A has 1 (or 0) elements, then done. Choose a pivot element x from A. Divide A-{x} into two arrays: B = {yA | yx} C = {yA | yx} Quicksort arrays B and C. Result is B+{x}+C.
Implementation issues Quick sort can be very fast in practice, but this depends on careful coding Three major issues: 1.doing quicksort in-place 2.picking the right pivot 3.avoiding quicksort on small arrays
1. Doing quicksort in place LR LR LR
1. Doing quicksort in place LR RL LR
2. Picking the pivot In real life, inputs to a sorting routine are often partially sorted why does this happen? So, picking the first or last element to be the pivot is usually a bad choice One common strategy is to pick the middle element this is an OK strategy
2. Picking the pivot A more sophisticated approach is to use random sampling think about opinion polls For example, the median-of-three strategy: take the median of the first, middle, and last elements to be the pivot
3. Avoiding small arrays While quicksort is extremely fast for large arrays, experimentation shows that it performs less well on small arrays For small enough arrays, a simpler method such as insertion sort works better The exact cutoff depends on the language and machine, but usually is somewhere between 10 and 30 elements
Putting it all together LR LR LR
Putting it all together LR RL LR
A complication! What should happen if we encounter an element that is equal to the pivot? Four possibilities: L stops, R keeps going R stops, L keeps going L and R stop L and R keep going
Quiz Break
Red-green quiz What should happen if we encounter an element that is equal to the pivot? Four possibilities: L stops, R keeps going R stops, L keeps going L and R stop L and R keep going Explain why your choice is the only reasonable one
Quick Sort Analysis
Worst-case behavior
Best-case analysis In the best case, the pivot is always the median element. In that case, the splits are always “down the middle”. Hence, same behavior as mergesort. That is, O(Nlog N).
Average-case analysis Consider the quicksort tree:
Average-case analysis The time spent at each level of the tree is O(N). So, on average, how many levels? That is, what is the expected height of the tree? If on average there are O(log N) levels, then quicksort is O(Nlog N) on average.
Expected height of qsort tree Assume that pivot is chosen randomly. When is a pivot “good”? “Bad”? Probability of a good pivot is 0.5. After good pivot, each child is at most 3/4 size of parent.
Expected height of qsort tree So, if we descend k levels in the tree, each time being lucky enough to pick a “good” pivot, the maximum size of the k th child is: N(3/4)(3/4) … (3/4) (k times) = N(3/4) k But on average, only half of the pivots will be good, so N(3/4) k/2 = 2log 4/3 N = O(log N)
Summary of quicksort A fast sorting algorithm in practice. Can be implemented in-place. But is O(N 2 ) in the worst case. O(Nlog N) average-case performance.
Lower Bound for the Sorting Problem
How fast can we sort? We have seen several sorting algorithms with O(Nlog N) running time. In fact, O(Nlog N) is a general lower bound for the sorting algorithm. A proof appears in Weiss. Informally…
Upper and lower bounds N d g(N) T(N) T(N) = O(f(N)) T(N) = (g(N)) c f(N)
Decision tree for sorting a<b<c a<c<b b<a<c b<c<a c<a<b c<b<a a<b<c a<c<b c<a<b b<a<c b<c<a c<b<a a<b<c a<c<b c<a<b a<b<ca<c<b b<a<c b<c<a c<b<a b<a<cb<c<a a<bb<a b<cc<bc<aa<c b<cc<ba<cc<a N! leaves. So, tree has height log(N!). log(N!) = (Nlog N).
Summary on sorting bound If we are restricted to comparisons on pairs of elements, then the general lower bound for sorting is (Nlog N). A decision tree is a representation of the possible comparisons required to solve a problem.
External Sorting
External sorting In many real-world situations, the amount of data to be sorted is much more than can be stored in memory So, it is important in some cases to use algorithms that work well when sorting data stored externally See tomorrow’s recitation…
World’s Fastest Sorters
Sorting competitions There are several world-wide sorting competitions Unix CoSort has achieved 1GB in under one minute, on a single Alpha Berkeley’s NOW-sort sorted 8.4GB of disk data in under one minute, using a network of 95 workstations Sandia Labs was able to sort 1TB of data in under 50 minutes, using a 144- node multiprocessor machine