The Selection Problem
2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements We will again use divide-and-conquer algorithms
3 The Selection Problem Input: A set A of n (distinct) numbers and a number i, with 1 i n Output: The element x A that is larger than exactly i – 1 other elements of A – x is the i th smallest element i = 1 minimum i = n maximum
4 The Selection Problem (cont) A simple solution – Sort A – Return A[ i ] – This is (nlgn)
5 Minimum and Maximum Finding the minimum and maximum – Takes (n-1) comparisons ( (n)) – This is the best we can do and is optimal with respect to the number of comparisons MINIMUM(A) min A[1] for i 2 to length(A) if min > A[ i ] min A[ i ] return min MAXIMUM(A) max A[1] for i 2 to length(A) if max < A[ i ] max A[ i ] return max
6 Minimum and Maximum (cont) Simultaneous minimum and maximum – Obvious solution is 2(n-1) comparisons – But we can do better – namely – The algorithm If n is odd, set max and min to first element If n is even, compare first two elements and set max, min Process the remaining elements in pairs Find the larger and the smaller of the pair Compare the larger of the pair with the current max And the smaller of the pair with the current min
7 Minimum and Maximum (cont) – Total number of comparisons If n is oddcomparisons If n is even – 1 initial comparison – And 3(n – 2)/2 comparisons – For a total of 3n/2 – 2 comparisons In either case, total number of comparisons is at most
8 Selection in Expected Linear Time Goal: Select i th smallest element from A[p..r]. Partition into A[p..q-1] and A[q+1..r] if i = q – then return A[q] If i th smallest element is in A[p..q-1] – then recurse on A[p..q-1] – else recurse on A[q+1..r]
9 Selection in Expected Linear Time (cont) Randomized-Select(A, p, r, i) 1 if p = r 2 return A[p] 3 q Randomized-Partition(A, p, r) 4k q - p + 1 //number of elements in the low side of 5 of partition + pivot 5 if i = k //the pivot value is the answer 6 return A[q] 7else if i < k 8return Randomized-Partition(A, p, q-1, i) 9else 10return Randomized-Partition(A, q+1, r, i-k)
10
11 Analysis of Selection Algorithm Worst-case running time is (n 2 ) – Partition takes (n) – If we always partition around the largest remaining element, we reduce the partition-size by one element each time What is best-case?
12 Analysis of Selection Algorithm (cont) Average Case – Average-case running time is (n) – The time required is the random variable T(n) – We want an upper bound on E[T(n)] – In Randomized-Partition, all elements are equally likely to be the pivot
13 Analysis of Selection Algorithm (cont) – So, for each k such that 1 k n, subarray A[p..q] has k elements All the pivot with probability 1/n – For k = 1, 2, …, n we define indicator random variables X k where X k = I{the subarray A[p..q] has exactly k elements} – So, E[X k ] = 1/n
14 Analysis of Selection Algorithm (cont) – When we choose the pivot element (which ends up in A[q]) we do not know what will happen next Do we return with the i th element (k = i)? Do we recurse on A[p..q-1]? Do we recurse on A[q+1..r]? – Decision depends on i in relation to k – We will find the upper-bound on the average case by assuming that the i th element is always in the larger partition
15 Analysis of Selection Algorithm (cont) – Now, X k = 1 for just one value of k, 0 for all others – When X k = 1, the two subarrays have sizes k – 1 and n – k – Hence the recurrence:
16 Analysis of Selection Algorithm (cont) – Taking the expected values:
17 Analysis of Selection Algorithm (cont) Looking at the expression max(k-1, n-k) If n is even, each term from appears twice in the summation If n is odd, each term from appears twice and appears once in the summation
18 Analysis of Selection Algorithm (cont) – Thus we have – We use substitution to solve the recurrence – Note: T(1) = (1) for n less than some constant – Assume that T(n) cn for some constant c that satisfies the initial conditions of the recurrence
19 Analysis of Selection Algorithm (cont) – Using this inductive hypothesis
20 Analysis of Selection Algorithm (cont)
21 Analysis of Selection Algorithm (cont) – To complete the proof, we need to show that for sufficiently large n, this last expression is at most cn i.e. As long as we choose the constant c so that c/4 – a > 0 (i.e., c > 4a), we can divide both sides by c/4 – a
22 Analysis of Selection Algorithm (cont) – Thus, if we assume that T(n) = (1) for, we have T(n) = (n)
23 Selection in Worst-Case Linear Time “Median of Medians” algorithm It guarantees a good split when array is partitioned – Partition is modified so that the pivot now becomes an input parameter The algorithm: – If n = 1 return A[n]
24 Selection in Worst-Case Linear Time (cont) 1.Divide the n elements of the input array into n/5 groups of 5 elements each and at most one group of (n mod 5) elements 2.Find the median of each of the n/5 groups by using insertion sort to sort list and then pick the 3 rd element of each group 3.Use Select recursively to find the median x of the n/5 medians found in step 2. – If even number of medians, choose lower median
25 Selection in Worst-Case Linear Time (cont) 4.Partition the input array around the “median of medians” x using the modified version of Partition. Let k be one more than the number of elements on the low side of the partition, so that x is the k th smallest element and there are n – k elements on the high side of the partition 5.if i = k, then return x. Otherwise, use Select recursively to find the i th smallest element on the low side if i k
26 Selection in Worst-Case Linear Time (cont) Example of “Median of Medians” – Input Array A[1..125] – Step 1: 25 groups of 5 – Step 2: We get 25 medians – Step 3: Step 1: Using the 25 medians we get 5 groups of 5 Step 2: We get 5 medians Step 3: Step 1: Using the 5 medians, we get 1 group of 5 Step 2: We get 1 median – Step 4: Partition A around the median
27 Analyzing “Median of Medians” Analyzing “median of medians” – The following diagram might be helpful:
28 Analyzing “Median of Medians” (cont) – First, we need to put a lower bound on how many elements are greater than x (pivot) – How many of the medians are greater than x? At least half of the medians from the groups – Why “at least half?” medians are greater than x
29 The two discarded groups Analyzing “Median of Medians” (cont) – Each of these medians contribute at least 3 elements greater than x except for two groups The group that contains x – contributes only 2 elements greater than x The group that has less than 5 elements – So the total number of elements > x is at least:
30 Analyzing “Median of Medians” (cont) – Similarly, there are at least elements smaller than x – Thus, in the worst case, for Step 5 Select is called recursively on the largest partition The largest partition has at most elements The size of the array minus the number of elements in the smaller partition
31 Analyzing “Median of Medians” (cont) – Developing the recurrence: Step 1 takes (n) time Step 2 takes (n) time – (n) calls to Insertion Sort on sets of size (1) Step 3 takes Step 4 takes (n) time Step 5 takes at most
32 Analyzing “Median of Medians” (cont) – So the recurrence is – Now use substitution to solve Assume T(n) cn for some suitable large constant c and all n > ??? Also pick a constant a such that the function described by the (n) term is bounded above by an for all n > 0
33 Comes from removing the Analyzing “Median of Medians” (cont) Which is at most cn if If n = 70, then this inequality is undefined
34 Analyzing “Median of Medians” (cont) – We assume that n 71, so – Choosing c 710a will satisfy the inequality on the previous slide – You could choose any constant > 70 to be the base case constant Thus, the selection problem can be solved in the worst-case in linear time
35 Review of Sorts Review of sorts seen so far – Insertion Sort Easy to code Fast on small inputs (less than ~50) Fast on nearly sorted inputs Stable (n) best case (sorted list) (n 2 ) average case (n 2 ) worst case (reverse sorted list)
36 Review of Sorts Stable means that numbers with the same value appear in the output array in the same order as they do in the input array. That is, ties between two numbers are broken by the rule that whichever number appears first in the input array appears first in the output array. Normally, the property of stability is important only when satellite data are carried around with the element being sorted.
37 Review of Sorts (cont) – MergeSort Divide and Conquer algorithm Doesn’t sort in place Requires memory as a function of n Stable (nlgn) best case (nlgn) average case (nlgn) worst case
38 Review of Sorts (cont) – QuickSort Divide and Conquer algorithm – No merge step needed Small constants Fast in practice Not stable (nlgn) best case (nlgn) average case (n 2 ) worst case
39 Review of Sorts (cont) Several of these algorithms sort in (nlgn) time – MergeSort in worst case – QuickSort on average On some input we can achieve (nlgn) time for each of these algorithms The sorted order they determine is based only on comparisons between the input elements They are called comparison sorts
40 Review of Sorts (cont) Other techniques for sorting exist, such as Linear Sorting which is not based on comparisons Usually with some restrictions or assumptions on input elements Linear Sorting techniques include: – Counting Sort – Radix Sort – Bucket Sort
41 Lower Bounds for Sorting In general, assuming unique inputs, comparison sorts are expressed in terms of comparisons. – are equivalent in learning about the order of a i and a j What is the best we can do on the worst case type of input? What is the best worst-case running time?
42 The Decision-Tree Model 1:2 2:3 1:3 1,2,3 2,1,3 1,3,2 3,1,2 2,3,1 3,2,1 n = 3 input: a 1,a 2,a 3 # possible outputs = 3! = 6 Each possible output is a leaf
43 Analysis of Decision-Tree Model Worst Case Comparisons is equal to height of decision tree Lower bound on the worst case running time is the lower bound on the height of the decision tree. Note that the number of leaves in the decision tree n!, where n = number elements in the input sequence
44 Theorem 8.1 Any comparison sort algorithm requires (nlgn) comparisons in the worst case Proof: – Consider a decision tree of height h that sorts n elements – Since there are n! permutations of n elements, each permutation representing a distinct sorted order, the tree must have at least n! leaves
45 Theorem 8.1 (cont) – A binary tree of height h has at most 2 h leaves The best possible worst case running time for comparison sorts is thus (nlgn) Mergesort, which is O(nlgn), is asymptotically optimal By equation 3.18
46 Sorting in Linear Time How can we do better? – CountingSort – RadixSort – BucketSort