Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004
Announcements Homework #5 Midterm March 4 Review: March 2
Today - Recall Sorting - Implementation Issues - Average case RT for quicksort - Timing Results
Total Recall: Sorting Algorithms
The Bible Robert Sedgewick Algorithms in C Parts 1-4 Fundamentals, Data Structures, Sorting, Searching Addison-Wesley 1998
Multiple Keys We could use a special comparator function (this would require a special function for each combination of keys). Easier is often to - first sort by name - stable sort by year Done!
Sorting Review Several simple, quadratic algorithms (worst case and average). - Bubble Sort - Selection Sort - Insertion Sort Only Insertion Sort of practical interest: running time linear in number of inversion of input sequence. Constants small. Also stable.
Sorting Review Asymptotically optimal O(n log n) algorithms (worst case and average). - Merge Sort - Heap Sort Merge Sort purely sequential and stable. But requires extra memory: 2n + O(log n).
Quick Sort Overall fastest. In place. BUT: Worst case quadratic. Not stable. Implementation details messy.
Picking An Algorithm First Question: Is the input short? Short means something like n < 500. In this case Insertion Sort is probably the best choice. Don't bother with asymptotically faster methods.
Picking An Algorithm Second Question: Does the input have special properties? E.g., if the number of inversions is small, Insertion Sort may be the best choice. Or linear sorting methods may be appropriate.
Otherwise: Quick Sort Large inputs, comparison based method, stability not required (recall our stabilizer trick, though). Quick Sort is worst case quadratic, why should it be the default candidate? On average, Quick Sort is O(n log ), and the constants are quite small.
Average ??? Average case analysis requires a probability distribution on the inputs: we have to average the running times. t(n) = p x t(x) where the sum is over all instances of size n and p x is the probability of getting instance x. Often simply assume uniform distribution: every instance (of a certain size) is equally likely.
A Computation Can we write down a recurrence equation? Can we solve the equation? At least approximately? Is the solution (if any) practically relevant? (see handout from last time)
Implementing Quick Sort
Pivot Selection Ideally, the pivot should be the median. Much too slow to be of practical value. Instead either - pick the pivot at random, or - take the median of a small sample.
Partitioning Partitioning is easy if we use extra scratch space. But we would like to partition in place. Need to move elements within the same given block of the big array. Basic idea: use two pointers, sweep across block from left and right till an out-of-place element is encountered. Swap them.
1. Doing quicksort in place LR LR LR
1. Doing quicksort in place LR RL LR
Pseudo Code i = lo – 1; j = hi; while( true ) { while( A[++i] < p ); while( p < a[--j] ) if( j==lo ) break; if( i >= j ) break; swap( i, j ); } swap( i, hi ); return i;
Getting Out Using Quick Sort on very short arrays is a bad idea: the overhead becomes too large. So, when the block becomes short we should exit Quick Sort and switch to Insertion Sort. But not locally: quicksort( A, lo, hi ) { if( hi – lo < magic_number ) insertionsort( A, lo, hi ); else …
Getting Out Just do nothing when the block is short. Then do one global cleanup with insertion sort. quicksort( A, 0, n ) insertionsort( A, 0, n ); This is linear, since the number of inversions is linear.
Magic Number The best way to determine the magic number is to run real-world tests. It seems that for current architectures, some value in the range 5 to 20 will work best.
Equal Elements Note that ideally pivoting should produce three sub-blocks: left:< p middle:== p right:> p Then the recursion could ignore the middle part, possibly omitting many elements.
Equal Elements Three natural strategies: Both pointers stop. Only one pointer stops. Neither pointer stops. Fact: The first strategy works best overall.
Equal Elements There are clever implementations that partition into three sub-blocks. This is amazingly hard to get both right and fast. Try it!
Application: Quick Select
Selection (Order Statistics) A classical problem: given a list, find the k-th element in the ordered list. The brute-force approach sorts the whole list first, and thus produces more information than required. Can we get away with less than n log n work (in a comparison based world)?
Easy Cases Needless to say, when k is small there are easy answers. - Scan the array and keep track of the k smallest. - Use a Selection Sort approach. But how about general k?
Selection and Partitioning qselect( A, lo, hi, k ) { if( hi <= lo ) return; i = partition( A, lo, hi ); if( i > k ) qselect( A, lo, i-1, k ); if( i < k ) qselect( A, i+1, hi, k ); } This looks like a typo. What’s really going on here?
Quick Select What should we expect as running time? As usual, if there is a ghost in the machine, it could force quadratic behavior. But on average this algorithm is linear. Don’t get any ideas about using this to find the median in the pivoting step of Quick Sort!
Some Timing Results
The Real World Beyond asymptotic analysis, it is always a good idea to do some real world testing. Construct a small test-bed: - automate testing - flexible but simple - organize the data in a useful way