CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001

SortingCSE 326 Autumn 20012 Outline Sorting: The Problem Space Sorting by Comparison  Lower bound for comparison sorts  Insertion Sort  Heap Sort  Merge Sort  Quick Sort External Sorting Comparison of Sorting by Comparison

SortingCSE 326 Autumn 20013 Sorting: The Problem Space General problem Given a set of N orderable items, put them in order Without (significant) loss of generality, assume:  Items are integers  Ordering is  (Most sorting problems map to the above in linear time.)

SortingCSE 326 Autumn 20014 Lower Bound for Sorting by Comparison Sorting by Comparison  Only information available to us is the set of N items to be sorted  Only operation available to us is pairwise comparison between 2 items What is the best running time we can possibly achieve?

SortingCSE 326 Autumn 20015 Decision Tree Analysis of Sorting by Comparison

SortingCSE 326 Autumn 20016 Max depth of decision tree  How many permutations are there of N numbers?  How many leaves does the tree have?  What’s the shallowest tree with a given number of leaves?  What is therefore the worst running time (number of comparisons) by the best possible sorting algorithm?

SortingCSE 326 Autumn 20017 Lower Bound for Comparison Sort Stirling’s approximation:  Any comparison sort of n items must have a decision tree with n! leaves, of height log(n!)  n log(n) is a lower bound for log(n!) Thus, no comparison sort can be faster than O(n log(n))

SortingCSE 326 Autumn 20018 Insertion Sort Basic idea After k th pass, ensure that first k+1 elements are sorted On k th pass, swap (k+1) th element to left as necessary 7283596 2783596 2783596 Start After Pass 1 After Pass 2 2738596 After Pass 3 2378596 What if array is initially sorted? Reverse sorted?

SortingCSE 326 Autumn 20019 Why Insertion Sort is Slow Inversion: a pair (i,j) such that i Array[j] Array of size N can have  (N 2 ) inversions  average number of inversions in a random set of elements is N(N-1)/4 Insertion Sort only swaps adjacent elements  only removes 1 inversion!

SortingCSE 326 Autumn 200110 HeapSort Sorting via Priority Queue (Heap) Shove items into a priority queue, take them out smallest to largest Worst Case: Best Case:

SortingCSE 326 Autumn 200111 MergeSort Merging Cars by key [Aggressiveness of driver]. Most aggressive goes first. MergeSort (Table [1..n]) Split Table in half Recursively sort each half Merge two sorted halves together

SortingCSE 326 Autumn 200112 MergeSort Analysis Running Time  Worst case?  Best case?  Average case? Other considerations besides running time?

SortingCSE 326 Autumn 200113 QuickSort Basic idea:  Pick a “pivot”  Divide into less-than & greater-than pivot  Sort each side recursively Picture from PhotoDisc.com

SortingCSE 326 Autumn 200114 QuickSort Partition 7283596 Pick pivot: Partition with cursors 7283596 <> 7283596 <> 2 goes to less-than

SortingCSE 326 Autumn 200115 QuickSort Partition (cont’d) 7263598 <> 6, 8 swap less/greater-than 7263598 3,5 less-than 9 greater-than 7263598 Partition done. Recursively sort each side.

SortingCSE 326 Autumn 200116 Analyzing QuickSort  Picking pivot: constant time  Partitioning: linear time  Recursion: time for sorting left partition (say of size i) + time for right (size N-i-1) T(1) = b T(N) = T(i) + T(N-i-1) + cN where i is the number of elements smaller than the pivot

SortingCSE 326 Autumn 200117 QuickSort Worst case Pivot is always smallest element. T(N) = T(i) + T(N-i-1) + cN T(N)= T(N-1) + cN = T(N-2) + c(N-1) + cN = T(N-k) + = O(N 2 )

SortingCSE 326 Autumn 200118 Optimizing QuickSort Choosing the Pivot  Randomly choose pivot Good theoretically and practically, but call to random number generator can be expensive  Pick pivot cleverly “Median-of-3” rule takes Median(first, middle, last). Works well in practice. Cutoff  Use simpler sorting technique below a certain problem size (Weiss suggests using insertion sort, with a cutoff limit of 5-20)

SortingCSE 326 Autumn 200119 QuickSort Best Case Pivot is always median element. T(N) = T(i) + T(N-i-1) + cN T(N)= 2T(N/2 - 1) + cN

SortingCSE 326 Autumn 200120 QuickSort Average Case Assume all size partitions equally likely, with probability 1/N details: Weiss pg 278-279

SortingCSE 326 Autumn 200121 External Sorting When you just ain’t got enough RAM …  e.g. Sort 10 billion numbers with 1 MB of RAM.  Databases need to be very good at this MergeSort Good for Something!  Basis for most external sorting routines  Can sort any number of records using a tiny amount of main memory in extreme case, keep only 2 records in memory at once!

SortingCSE 326 Autumn 200122 External MergeSort  Split input into two tapes  Each group of 1 records is sorted by definition, so merge groups of 1 to groups of 2, again split between two tapes  Merge groups of 2 into groups of 4  Repeat until data entirely sorted log N passes

SortingCSE 326 Autumn 200123 Better External MergeSort Suppose main memory can hold M records:  Read in groups of M records and sort them (e.g. with QuickSort)  Number of passes reduced to log(N/M) k-way mergesort reduces number of passes to log k (N/M)  Requires 2k output devices (e.g. mag tapes) But wait, there’s more … Polyphase merge does a k-way mergesort using only k+1 output devices (plus k th -order Fibonacci numbers!)

SortingCSE 326 Autumn 200124 Sorting by Comparison Summary Sorting algorithms that only compare adjacent elements are  (N 2 ) worst case – but may be  (N) best case HeapSort -  (N log N) both best and worst case  Suffers from two test-ops per data move MergeSort -  (N log N) running time  Suffers from extra-memory problem QuickSort -  (N 2 ) worst case,  (N log N) best and average case  In practice, median-of-3 almost always gets us  (N log N)  Big win comes from {sorting in place, one test-op, few swaps}!  Any comparison-based sorting algorithm is  (N log N)  External sorting: MergeSort with  (log N/M) passes

SortingCSE 326 Autumn 200125 Sorting: The Problem Space Revisited  General problem Given a set of N orderable items, put them in order  Without (significant) loss of generality, assume:  Items are integers  Ordering is  Most sorting problems can be mapped to the above in linear time But what if we have more information available to us?

SortingCSE 326 Autumn 200126 Sorting in Linear Time Sorting by Comparison  Only information available to us is the set of N items to be sorted  Only operation available to us is pairwise comparison between 2 items  Best running time is O(N log(N)), given these constraints What if we relax the constraints?  Know something in advance about item values

SortingCSE 326 Autumn 200127 BinSort (a.k.a. BucketSort)  If keys are known to be in {1, …, K}  Have array of size K  Put items into correct bin (cell) of array, based on key

SortingCSE 326 Autumn 200128 BinSort example K=5 list=(5,1,3,4,3,2,1,1,5,4,5) Bins in array key = 11,1,1 key = 22 key = 33,3 key = 44,4 key = 55,5,5 Sorted list: 1,1,1,2,3,3,4,4,5,5,5

SortingCSE 326 Autumn 200129 BinSort Pseudocode procedure BinSort (List L,K) LinkedList bins[1..K] // Each element of array bins is linked list. // Could also BinSort with array of arrays. For Each number x in L bins[x].Append(x) End For For i = 1..K For Each number x in bins[i] Print x End For

SortingCSE 326 Autumn 200130 BinSort Running Time K is a constant  BinSort is linear time K is variable  Not simply linear time K is large (e.g. 2 32 )  Impractical

SortingCSE 326 Autumn 200131 BinSort is “stable” Stable Sorting algorithm  Items in input with the same key end up in the same order as when they began.  Important if keys have associated values  Critical for RadixSort

SortingCSE 326 Autumn 200132 Mr. Radix Herman Hollerith invented and developed a punch-card tabulation machine system that revolutionized statistical computation. Born in Buffalo, New York, the son of German immigrants, Hollerith enrolled in the City College of New York at age 15 and graduated from the Columbia School of Mines with distinction at the age of 19. His first job was with the U.S. Census effort of 1880. Hollerith successively taught mechanical engineering at the Massachusetts Institute of Technology and worked for the U.S. Patent Office. The young engineer developed an electrically actuated brake system for the railroads, but the Westinghouse steam-actuated brake prevailed. Hollerith began working on the tabulating system during his days at MIT, filing for the first patent in 1884. He developed a hand-fed 'press' that sensed the holes in punched cards; a wire would pass through the holes into a cup of mercury beneath the card closing the electrical circuit. This process triggered mechanical counters and sorter bins and tabulated the appropriate data. Hollerith's system-including punch, tabulator, and sorter-allowed the official 1890 population count to be tallied in six months, and in another two years all the census data was completed and defined; the cost was $5 million below the forecasts and saved more than two years' time. His later machines mechanized the card-feeding process, added numbers, and sorted cards, in addition to merely counting data. In 1896 Hollerith founded the Tabulating Machine Company, forerunner of Computer Tabulating Recording Company (CTR). He served as a consulting engineer with CTR until retiring in 1921. In 1924 CTR changed its name to IBM- the International Business Machines Corporation. Herman Hollerith Born February 29, 1860 - Died November 17, 1929 Art of Compiling Statistics; Apparatus for Compiling Statistics Patent Nos. 395,781; 395,782; 395,783 Inducted 1990 Source: National Institute of Standards and Technology (NIST) Virtual Museum - http://museum.nist.gov/panels/conveyor/hollerithbio.htm

SortingCSE 326 Autumn 200133 RadixSort  Radix = “The base of a number system” (Webster’s dictionary)  alternate terminology: radix is number of bits needed to represent 0 to base-1; can say “base 8” or “radix 3”  Idea: BinSort on each digit, bottom up.

SortingCSE 326 Autumn 200134 RadixSort – magic! It works.  Input list: 126, 328, 636, 341, 416, 131, 328  BinSort on lower digit: 341, 131, 126, 636, 416, 328, 328  BinSort result on next-higher digit: 416, 126, 328, 328, 131, 636, 341  BinSort that result on highest digit: 126, 131, 328, 328, 341, 416, 636

SortingCSE 326 Autumn 200135 RadixSort – how it works 0 1 2 3 4 5 6 7 8 9 126 636 416 328 341 131 0 1 2 3 4 5 6 7 8 9 416416 126 328 328 131 636 341341 0 1 2 3 4 5 6 7 8 9 636 126 131 328 328 341 416 126 328 636 341 416 131 328 126 131 328 328 341 416 636

SortingCSE 326 Autumn 200136 Not magic. It provably works. Keys  K-digit numbers  base B Claim: after i th BinSort, least significant i digits are sorted  e.g. B=10, i=3, keys are 1776 and 8234. 8234 comes before 1776 for last 3 digits.

SortingCSE 326 Autumn 200137 RadixSort Proof by Induction Base case:  i=0. 0 digits are sorted (that wasn’t hard!) Induction step:  assume for i, prove for i+1.  consider two numbers: X, Y. Say X i is i th digit of X (from the right)  X i+1 < Y i+1 then i+1 th BinSort will put them in order  X i+1 > Y i+1, same thing  X i+1 = Y i+1, order depends on last i digits. Induction hypothesis says already sorted for these digits. (Careful about ensuring that your BinSort preserves order aka “stable”…)

SortingCSE 326 Autumn 200138 What data types can you RadixSort?  Any type T that can be BinSorted  Any type T that can be broken into parts A and B, such that:  You can reconstruct T from A and B  A can be RadixSorted  B can be RadixSorted  A is always more significant than B, in ordering

SortingCSE 326 Autumn 200139 RadixSort Examples  1-digit numbers can be BinSorted  2 to 5-digit numbers can be BinSorted without using too much memory  6-digit numbers, broken up into A=first 3 digits, B=last 3 digits.  A and B can reconstruct original 6-digits  A and B each RadixSortable as above  A more significant than B

SortingCSE 326 Autumn 200140 RadixSorting Strings  1 Character can be BinSorted  Break strings into characters  Need to know length of biggest string (or calculate this on the fly).  Null-pad shorter strings  Running time:  N is number of strings  L is length of longest string  RadixSort takes O(N*L)

SortingCSE 326 Autumn 200141 Evaluating Sorting Algorithms  What factors other than asymptotic complexity could affect performance?  Suppose two algorithms perform exactly the same number of instructions. Could one be better than the other?

SortingCSE 326 Autumn 200142 Memory Hierarchy Stats (made up 1 ) CPU cyclesSize L1 (on chip) cache032 KB L2 cache8512 KB RAM35256 MB Hard Drive500,0008 GB 1 But plausible : - )

SortingCSE 326 Autumn 200143 Memory Hierarchy Exploits Locality of Reference Idea: small amount of fast memory Keep frequently used data in the fast memory LRU replacement policy  Keep recently used data in cache  To free space, remove Least Recently Used data

SortingCSE 326 Autumn 200144 Cache Details (simplified) Main Memory Cache Cache line size (4 adjacent memory cells)

SortingCSE 326 Autumn 200145 Traversing an Array One miss for every 4 accesses in a traversal cache misses cache hits

SortingCSE 326 Autumn 200146 Iterative MergeSort Poor Cache Performance Cache Size no temporal locality!

SortingCSE 326 Autumn 200147 Recursive MergeSort Better Cache Performance Cache Size

SortingCSE 326 Autumn 200148 Quicksort Pretty Good Cache Performance  Initial partition causes a lot of cache misses  As subproblems become smaller, they fit into cache  Generally good cache performance

SortingCSE 326 Autumn 200149 Radix Sort Lousy Cache Performance On each BinSort  Sweep through input list – cache misses along the way (bad!)  Append to output list – indexed by pseudorandom digit (ouch!)  Truly evil for large Radix (e.g. 2 16 ), which reduces # of passes

SortingCSE 326 Autumn 200150 Sorting Summary  Linear-time sorting is possible if we know more about the input (e.g. that all input values fall within a known range)  BinSort  Moderately useful O(n) categorizer  Most useful as building block for RadixSort  RadixSort  O(n), but input must conform to fairly narrow profile  Poor cache performance  Memory Hierarchy and Cache  Cache locality can have major impact on performance of sorting, and other, algorithms

CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

Similar presentations

Presentation on theme: "CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

Similar presentations

Presentation on theme: "CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001."— Presentation transcript:

Similar presentations

About project

Feedback