Download presentation
Presentation is loading. Please wait.
1
Data Structures Introduction
Alon Halevy
2
Clever? Efficient? Insert Lists, Stacks, Queues Delete Heaps
Find Merge Shortest Paths Union Lists, Stacks, Queues Heaps Binary Search Trees AVL Trees Hash Tables Graphs Disjoint Sets Data Structures Algorithms
3
Used Everywhere! Mastery of this material separates you from: Graphics
Theory AI Applications Systems Used Everywhere! Mastery of this material separates you from: Perhaps the most important course in your CS curriculum! Guaranteed non-obsolescence!
4
Anecdote #1 N2 “pretty print” routine nearly dooms major expert system project at AT&T 10 MB data = 10 days (100 MIPS) programmer was brilliant, but he skipped 326…
5
Asymptotic Complexity
Our notion of efficiency: How the running time of an algorithm scales with the size of its input several ways to further refine: worst case average case amortized over a series of runs
6
The Apocalyptic Laptop
Seth Lloyd, SCIENCE, 31 Aug 2000
7
Big Bang Ultimate Laptop, 1 year 1 second 1000 MIPS, since Big Bang
1 day
8
Specific Goals of the Course
Become familiar with some of the fundamental data structures in computer science Improve ability to solve problems abstractly data structures are the building blocks Improve ability to analyze your algorithms prove correctness gauge (and improve) time complexity Become modestly skilled with the UNIX operating system (you’ll need this in upcoming courses) This course is designed to familiarize you with the most basic and important data structures in computer science. The ones that will form the foundation of all your future work with computers. Moreover, you’ll learn how to analyze your programs and data structures so that you know how well they work and what sort of effort in the program is acceptable. These are the goals of the course as well as my expectations of you.
9
One Preliminary Hurdle
Recall what you learned in CSE 321 … proofs by mathematical induction proofs by contradiction formulas for calculating sums and products of series recursion Know Sec 1.1 – 1.4 of text by heart!
10
A Second Hurdle Unix Experience 1975 all over again!
Try to login, edit, create a Makefile, and compile your favorite “hello world” program right away Programming Project #1 distributed Wednesday Bring your questions and frustrations to Section on Thursday!
11
A Third Hurdle: Templates
class Set_of_ints { public: insert( int x ); boolean is_member( int x ); … } template <class Obj> class Set { insert( Obj x ); boolean is_member( Obj x ); … } Set <int> SomeNumbers; Set <char *> SomeWords;
12
In Every Silver Lining, There’s a Big Dark Cloud – George Carlin
Templates were invented 12 years ago, and still no compiler correctly implements them! Using templates with multiple source files tricky See Course Web pages and TAs for best way MAINTAINING SANITY RULE Write/debug first without templates Templatize as need Keep it simple!
13
Handy Libraries From Weiss: Like arrays and char*, but provide
vector < int > MySafeIntArray; vector < double > MySafeFloatArray; string MySafeString; Like arrays and char*, but provide bounds checking memory management STL (Standard Template Library) most of CSE 326 in a box don’t use (unless told); we’ll be rolling our own
14
C++ Data Structures One of the all time great books in computer science: The Art of Computer Programming ( ) by Donald Knuth Examples in assembly language (and English)! American Scientist says: in top 12 books of the CENTURY! Very little about C++ in class.
15
Abstract Data Types Abstract Data Type (ADT) Data Types Algorithms
Mathematical description of an object and the set of operations on the object tradeoffs! Given that this is computer science, I know you’d be disappointed if there were no acronyms in the class. Here’s our first one! Now, what an ADT really is is the interface of a data structure without any specification of the implementation. In this class, we’ll study groups of data structures to implement any given abstract data type. In that context… Data Types integer, array, pointers, … Algorithms binary search, quicksort, …
16
ADT Presentation Algorithm
Present an ADT Motivate with some applications Repeat until it’s time to move on: develop a data structure and algorithms for the ADT analyze its properties efficiency correctness limitations ease of programming Contrast strengths and weaknesses Given those definitions, here’s our first algorithm. This is how I’m going to try to present each set of data structures to you. You should hold me to this! You’re not getting enough out of the presentation if you don’t see these. And look, here’s an ADT now…
17
First Example: Queue ADT
Queue operations create destroy enqueue dequeue is_empty Queue property: if x is enQed before y is enQed, then x will be deQed before y is deQed FIFO: First In First Out F E D C B enqueue dequeue G A You’ve probably seen the Queue before. If so, this is a review and a way for us to get comfortable with the format of data structure presentations in this class. If not, this is a simple but very powerful data structure, and you should make sure you understand it thoroughly. This is an ADT description of the queue. Notice that there are no implementation details. Just a general description of the interface and important properties of those interface methods.
18
Applications of the Q Hold jobs for a printer
Store packets on network routers Make waitlists fair Breadth first search Qs are used widely in computer science. This is just a handful of the high profile uses, but _many_ programs use queues.
19
Circular Array Q Data Structure
size - 1 b c d e f front back enqueue(Object x) { Q[back] = x ; back = (back + 1) % size } How test for empty list? How to find K-th element in the queue? What is complexity of these operations? Limitations of this structure? Here is a data structure implementation of the Q. The queue is stored as an array, and, to avoid shifting all the elements each time an element is dequeued, we imagine that the array wraps around on itself. This is an excellent example of how implementation can affect interface: notice the “is_full” function. There’s also another problem here. What’s wrong with the Enqueue and Dequeue functions? Your data structures should be robust! Make them robust before you even consider thinking about making them efficient! That is an order! dequeue() { x = Q[front] ; front = (front + 1) % size; return x ; }
20
Linked List Q Data Structure
b c d e f front back enqueue(Object x) { back->next = new Node(x); back = back->next; } dequeue() { saved = front->data; temp = front; front = front->next; delete temp ; return saved;} What are tradeoffs? simplicity speed robustness memory usage Notice the tricky memory management
21
To Do Return your survey before leaving!
Sign up on the cse326 mailing list Check out the web page Log on to the PCs in course labs and access an instructional UNIX server Read Chapters 1 and 2 in the book
22
Data Structures Analysis of Algorithms
Alon Halevy
23
Analysis of Algorithms
Analysis of an algorithm gives insight into how long the program runs and how much memory it uses time complexity space complexity Why useful? Input size is indicated by a number n sometimes have multiple inputs, e.g. m and n Running time is a function of n n, n2, n log n, n(log n2) + 5n3
24
Simplifying the Analysis
Eliminate low order terms 4n 4n 0.5 n log n - 2n n log n 2n + n3 + 3n 2n Eliminate constant coefficients 4n n 0.5 n log n n log n log n2 = 2 log n log n log3 n = (log3 2) log n log n We didn’t get very precise in our analysis of the UWID info finder; why? Didn’t know the machine we’d use. Is this always true? Do you buy that coefficients and low order terms don’t matter? When might they matter? (Linked list memory usage)
25
Order Notation BIG-O T(n) = O(f(n)) OMEGA T(n) = (f(n))
Upper bound Exist constants c and n0 such that T(n) c f(n) for all n n0 OMEGA T(n) = (f(n)) Lower bound T(n) c f(n) for all n n0 THETA T(n) = θ (f(n)) Tight bound θ(n) = O(n) = (n) We’ll use some specific terminology to describe asymptotic behavior. There are some analogies here that you might find useful.
26
Examples n2 + 100 n = O(n2) = (n2) = (n2) n log n = O(n2)
( n n ) 2 n2 for n 10 ( n n ) 1 n2 for n 0 n log n = O(n2) n log n = (n log n) n log n = (n)
27
More on Order Notation Order notation is not symmetric; write
2n2 + 4n = O(n2) but never O(n2) = 2n2 + 4n right hand side is a crudification of the left Likewise O(n2) = O(n3) (n3) = (n2)
28
A Few Comparisons Function #2 Function #1 100n2 + 1000 n3 + 2n2 log n
29
Race I n3 + 2n2 vs. 100n
30
Race II n0.1 vs. log n Well, log n looked good out of the starting gate and indeed kept on looking good until about n^17 at which point n^0.1 passed it up forever. Moral of the story? N^epsilon beats log n for any eps > 0. BUT, which one of these is really better?
31
Race III n + 100n0.1 vs. 2n + 10 log n Notice that these just look like n and 2n once we get way out. That’s because the larger terms dominate. So, the left is less, but not asymptotically less. It’s a TIE!
32
Race IV 5n5 vs. n! N! is BIG!!!
33
Race V n-152n/100 vs. 1000n15 No matter how you put it, any exponential beats any polynomial. It doesn’t even take that long here (~250 input size)
34
Race VI 82log(n) vs. 3n7 + 7n We can reduce the left hand term to n^6, so they’re both polynomial and it’s an open and shut case.
35
The Losers Win Better algorithm! O(n2) O(log n) TIE O(n) O(n5) O(n15)
Function #1 n3 + 2n2 n0.1 n + 100n0.1 5n5 n-152n/100 82log n Function #2 100n log n 2n + 10 log n n! 1000n15 3n7 + 7n Welcome, everyone, to the Silicon Downs. I’m getting race results as we stand here. Let’s start with the first race. I’ll have the first row bet on race #1. Raise your hand if you bet on function #1 (the jockey is n^0.1) So on. Show the race slides after each race.
36
Common Names constant: O(1) logarithmic: O(log n) linear: O(n)
log-linear: O(n log n) superlinear: O(n1+c) (c is a constant > 0) quadratic: O(n2) polynomial: O(nk) (k is a constant) exponential: O(cn) (c is a constant > 1) Well, it turns out that the old Silicon Downs is fixed. They dope up the horses to make the first few laps interesting, but we can always find out who wins. Here’s a chart comparing some of the functions. Notice that any exponential beats any polynomial. Any superlinear beats any poly-log-linear. Also keep in mind (though I won’t show it) that sometimes the input has more than one parameter. Like if you take in two strings. In that case you need to be very careful about what is constant and what can be ignored. O(log m + 2n) is not necessarily O(2n)
37
Kinds of Analysis Running time may depend on actual data input, not just length of input Distinguish worst case your worst enemy is choosing input best case average case assumes some probabilistic distribution of inputs amortized average time over many operations We already discussed the bound flavor. All of these can be applied to any analysis case. For example, we’ll later prove that sorting in the worst case takes at least n log n time. That’s a lower bound on a worst case. Average case is hard! What does “average” mean. For example, what’s the average case for searching an unordered list (as precise as possible, not asymptotic). WRONG! It’s about n, not 1/2 n. Why? You have to search the whole thing if the elt is not there. Note there’s two senses of tight. I’ll try to avoid the terminology “asymptotically tight” and stick with the lower def’n of tight. O(inf) is not tight!
38
Analyzing Code C++ operations - constant time
consecutive stmts - sum of times conditionals - sum of branches, condition loops - sum of iterations function calls - cost of function body recursive functions - solve recursive equation Above all, use your head!
39
Nested Loops for i = 1 to n do for j = 1 to n do sum = sum + 1
This example is pretty straightforward. Each loop goes N times, constant amount of work on the inside. N*N*1 = O(N^2)
40
Nested Dependent Loops
for i = 1 to n do for j = i to n do sum = sum + 1 There’s a little twist here. J goes from I to N, not 1 to N. So, let’s do the sums inside is constant. Next loop is sum I to N of 1 which equals N - I + 1 Outer loop is sum 1 to N of N - I + 1 That’s the same as sum N to 1 of I or N(N+1)/2 or O(N^2)
41
Conditionals Conditional time time(C) + Max( time(S1), time(S2) )
if C then S1 else S2 time time(C) + Max( time(S1), time(S2) ) OK, so this isn’t exactly an example. Just reiterating the rule. Time <= time of C plus max of S1 and S2 <= time of C plus S1 plus S2 time <= sum of times of iterations often #of iterations * time of S (or worst time of S)
42
Coming Up Thursday Friday Unix tutorial First programming project!
Finishing up analysis A little on Stacks and Lists Homework #1 goes out
43
Data Structures Analysis of Recursive Algorithms
Alon Halevy
44
Nested Dependent Loops
for i = 1 to n do for j = i to n do sum = sum + 1 There’s a little twist here. J goes from I to N, not 1 to N. So, let’s do the sums inside is constant. Next loop is sum I to N of 1 which equals N - I + 1 Outer loop is sum 1 to N of N - I + 1 That’s the same as sum N to 1 of I or N(N+1)/2 or O(N^2)
45
Recursion A recursive procedure can often be analyzed by solving a recursive equation Basic form: T(n) = if (base case) then some constant else ( time to solve subproblems + time to combine solutions ) Result depends upon how many subproblems how much smaller are subproblems how costly to combine solutions (coefficients) You may want to take notes on this slide as it just vaguely resembles a homework problem! Here’s a function defined in terms of itself. You see this a lot with recursion. This one is a lot like the profile for factorial. WORK THROUGH Answer: O(n)
46
Example: Sum of Integer Queue
sum_queue(Q){ if (Q.length == 0 ) return 0; else return Q.dequeue() + sum_queue(Q); } One subproblem Linear reduction in size (decrease by 1) Combining: constant c (+), 1×subproblem Equation: T(0) b T(n) c + T(n – 1) for n>0 Here’s a function defined in terms of itself. You see this a lot with recursion. This one is a lot like the profile for factorial. WORK THROUGH Answer: O(n)
47
Sum, Continued Equation: T(0) b T(n) c + T(n – 1) for n>0
Solution: T(n) c + c + T(n-2) c + c + c + T(n-3) kc + T(n-k) for all k nc + T(0) for k=n cn + b = O(n)
48
Example: Binary Search
7 12 30 35 75 83 87 90 97 99 One subproblem, half as large Equation: T(1) b T(n) T(n/2) + c for n>1 Solution: T(n) T(n/2) + c T(n/4) + c + c T(n/8) + c + c + c T(n/2k) + kc T(1) + c log n where k = log n b + c log n = O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series. Tip: Look for powers/multiples of the numbers that appear in the original equation.
49
Example: MergeSort Split array in half, sort each half, merge together
2 subproblems, each half as large linear amount of work to combine T(1) b T(n) 2T(n/2) + cn for n>1 T(n) 2T(n/2)+cn 2(2(T(n/4)+cn/2)+cn = 4T(n/4) +cn +cn 4(2(T(n/8)+c(n/4))+cn+cn = 8T(n/8)+cn+cn+cn 2kT(n/2k)+kcn 2kT(1) + cn log n where k = log n = O(n log n) This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(n log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series. Tip: Look for powers/multiples of the numbers that appear in the original equation.
50
Example: Recursive Fibonacci
int Fib(n){ if (n == 0 or n == 1) return 1 ; else return Fib(n - 1) + Fib(n - 2); } Running time: Lower bound analysis T(0), T(1) 1 T(n) T(n - 1) + T(n - 2) + c if n > 1 Note: T(n) Fib(n) Fact: Fib(n) (3/2)n O( (3/2)n ) Why? This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.
51
Direct Proof of Recursive Fibonacci
int Fib(n) if (n == 0 or n == 1) return 1 else return Fib(n - 1) + Fib(n - 2) Lower bound analysis T(0), T(1) >= b T(n) >= T(n - 1) + T(n - 2) + c if n > 1 Analysis let be (1 + 5)/2 which satisfies 2 = + 1 show by induction on n that T(n) >= bn - 1 This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.
52
Direct Proof Continued
Basis: T(0) b > b-1 and T(1) b = b0 Inductive step: Assume T(m) bm - 1 for all m < n T(n) T(n - 1) + T(n - 2) + c bn-2 + bn-3 + c bn-3( + 1) + c = bn-32 + c bn-1
53
Fibonacci Call Tree 5 3 4 3 2 2 1 1 2 1 1 1
54
Learning from Analysis
To avoid recursive calls store all basis values in a table each time you calculate an answer, store it in the table before performing any calculation for a value n check if a valid answer for n is in the table if so, return it Memoization a form of dynamic programming How much time does memoized version take?
55
Kinds of Analysis So far we have considered worst case analysis
We may want to know how an algorithm performs “on average” Several distinct senses of “on average” amortized average time per operation over a sequence of operations average case average time over a random distribution of inputs expected case average time for a randomized algorithm over different random seeds for any input
56
Amortized Analysis Consider any sequence of operations applied to a data structure your worst enemy could choose the sequence! Some operations may be fast, others slow Goal: show that the average time per operation is still good
57
Stack ADT Stack operations
B C D E F E D C B A F Stack operations push pop is_empty Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation?
58
Stretchy Stack Implementation
int data[]; int maxsize; int top; Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; copy data into temp; deallocate data; data = temp; } else { data[++top] = e; } Best case Push = O( ) Worst case Push = O( )
59
Stretchy Stack Amortized Analysis
Consider sequence of n operations push(3); push(19); push(2); … What is the max number of stretches? What is the total time? let’s say a regular push takes time a, and stretching an array contain k elements takes time kb, for some constants a and b. Amortized time = (an+b(2n-1))/n = O(1) log n
60
Wrapup Having math fun? Homework #1 out wednesday – due in one week
Programming assignment #1 handed out. Next week: linked lists
61
Data Structures Alon Halevy
62
Direct Proof of Recursive Fibonacci
int Fib(n) if (n == 0 or n == 1) return 1 else return Fib(n - 1) + Fib(n - 2) Lower bound analysis T(0), T(1) >= b T(n) >= T(n - 1) + T(n - 2) + c if n > 1 Analysis let be (1 + 5)/2 which satisfies 2 = + 1 show by induction on n that T(n) >= bn - 1 This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.
63
Direct Proof Continued
Basis: T(0) b > b-1 and T(1) b = b0 Inductive step: Assume T(m) bm - 1 for all m < n T(n) T(n - 1) + T(n - 2) + c bn-2 + bn-3 + c bn-3( + 1) + c = bn-32 + c bn-1
64
Fibonacci Call Tree 5 3 4 3 2 2 1 1 2 1 1 1
65
Learning from Analysis
To avoid recursive calls store all basis values in a table each time you calculate an answer, store it in the table before performing any calculation for a value n check if a valid answer for n is in the table if so, return it Memoization a form of dynamic programming How much time does memoized version take?
66
Kinds of Analysis So far we have considered worst case analysis
We may want to know how an algorithm performs “on average” Several distinct senses of “on average” amortized average time per operation over a sequence of operations average case average time over a random distribution of inputs expected case average time for a randomized algorithm over different random seeds for any input
67
Amortized Analysis Consider any sequence of operations applied to a data structure your worst enemy could choose the sequence! Some operations may be fast, others slow Goal: show that the average time per operation is still good
68
Stack ADT Stack operations
B C D E F E D C B A F Stack operations push pop is_empty Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation?
69
Stretchy Stack Implementation
int data[]; int maxsize; int top; Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; copy data into temp; deallocate data; data = temp; } else { data[++top] = e; } Best case Push = O( ) Worst case Push = O( )
70
Stretchy Stack Amortized Analysis
Consider sequence of n operations push(3); push(19); push(2); … What is the max number of stretches? What is the total time? let’s say a regular push takes time a, and stretching an array containing k elements takes time kb, for some constants a and b. Amortized = (an+b(2n-1))/n = a+2b-(1/n)= O(1) log n
71
Average Case Analysis Attempt to capture the notion of “typical” performance Imagine inputs are drawn from some random distribution Ideally this distribution is a mathematical model of the real world In practice usually is much more simple – e.g., a uniform random distribution
72
Example: Find a Red Card
Input: a deck of n cards, half red and half black Algorithm: turn over cards (from top of deck) one at a time until a red card is found. How many cards will be turned over? Best case = Worst case = Average case: over all possible inputs (ways of shuffling deck)
73
Summary Asymptotic Analysis – scaling with size of input
Upper bound O, Lower bound O(1) or O(log n) great O(2n) almost never okay Worst case most important – strong guarantee Other kinds of analysis sometimes useful: amortized average case
74
List ADT ( A1 A2 … An-1 An ) List properties length = n Key operations
Ai precedes Ai+1 for 1 i < n Ai succeeds Ai-1 for 1 < i n Size 0 list is defined to be the empty list Key operations Find(item) = position Find_Kth(integer) = item Insert(item, position) Delete(position) Next(position) = position What are some possible data structures? ( A1 A2 … An-1 An ) length = n Now, back to work! We’re going to talk about lists briefly and quickly get to an idea which I hope you haven’t seen. Lists are sets of values. The type of those values is arbitrary but fixed (can’t change from one to another in the same list). Each value is at a position, and those positions are totally ordered.
75
Implementations of Linked Lists
Array: 1 2 3 4 5 6 7 8 9 10 H W 1 I S E A S Y Can we apply binary search to an array representation? Linked list: (optional header) (a b c) a b c L
76
Linked List vs. Array linked list array sorted array
Find(item) = position Find_Kth(integer)=item Find_Kth(1)=item Insert(item, position) Insert(item) Delete(position) Next(position) = position
77
Tradeoffs For what kinds of applications is a linked list best?
Examples for an unsorted array? Examples for a sorted array?
78
Implementing in C++ (optional (a b c) header)
Create separate classes for Node List (contains a pointer to the first node) List Iterator (specifies a position in a list; basically, just a pointer to a node) Pro: syntactically distinguishes uses of node pointers Con: a lot of verbage! Also, is a position in a list really distinct from a list?
79
Data Structures Alon Halevy
80
Implementations of Linked Lists
Array: 1 2 3 4 5 6 7 8 9 10 H W 1 I S E A S Y Can we apply binary search to an array representation? Linked list: (optional header) (a b c) a b c L
81
Linked List vs. Array linked list array sorted array
Find(item) = position Find_Kth(integer)=item Find_Kth(1)=item Insert(item, position) Insert(item) Delete(position) Next(position) = position
82
Tradeoffs For what kinds of applications is a linked list best?
Examples for an unsorted array? Examples for a sorted array?
83
Implementing in C++ (optional (a b c) header)
Create separate classes for Node List (contains a pointer to the first node) List Iterator (specifies a position in a list; basically, just a pointer to a node) Pro: syntactically distinguishes uses of node pointers Con: a lot of verbage! Also, is a position in a list really distinct from a list?
84
Other Data Structures for Lists
Doubly Linked List Circular List 7 11 3 2 Advantages/disadvantages (previous for doubly linked list) your book also describes header nodes. Are they just a hack? I’m not going to go into these, but: You should be able to (for a test) add and delete nodes in all these types of list; not to mention for your daily coding needs! c d e f
85
Implementing Linked Lists Using Arrays
1 2 3 4 5 6 7 8 9 10 Data F O A R N R T Next 3 8 6 4 -1 10 5 First = 2 “Cursor implementation” Ch 3.2.8 Often useful in any language Can use same array to manage a second list of unused cells
86
Application: Polynomial ADT
Ai is the coefficient of the xn-i term: 3x2 + 2x + 5 ( ) 8x + 7 ( 8 7 ) Here’s an application of the list abstract data type as a _data structure_ for another abstract data type. Is there a problem here? Why? x2 + 3 ( ) Problem?
87
3x ( ) What is it about lists that makes this a problem here and not in stacks and queues? (Answer: kth(int)!) Is there a solution? Will we get anything but zeroes overwhelming this data structure?
88
Sparse List Data Structure: 3x2001 + 4
(<4 0> <2001 3>) 4 3 2001 This slide is made possible in part by the sparse list data structure. Now, two questions: 1) Is a sparse list really a data structure or an abstract data type? (Answer: It depends but I lean toward data structure. YOUR ANSWER MUST HAVE JUSTIFICATION!) 2) Which list data structure should we use to implement it? Linked Lists or Arrays?
89
Addition of Two Polynomials
Similar to merging two sorted lists – O(n+m) 15+10x50+3x1200 p 15 10 50 3 1200 5+30x50+4x100 q 5 30 50 4 100 r 20 40 50 4 100 3 1200
90
Multiple Linked Lists Many ADTS such as graphs, relations, sparse matrices, multivariate polynomials use multiple linked lists Several options array of lists lists of lists multi lists General principle throughout the course: use one ADT to implement a more complicated one.
91
Array of Linked Lists: Adjacency List for Graphs
1 3 2 5 4 Array G of unordered linked lists Each list entry corresponds to an edge in the graph G Graphs are a very important data type. You might think as you read about your project if there are any graphs there. Here, we’re implementing graphs with adjacency lists. The reason is that this is a sparse graph. We want to have every node in an array (so we can find the first edge quickly), but we just need the edges around. 1 5 2 2 4 3 5 3 1 4 4 5 3 5
92
Reachability by Marking
Suppose we want to mark all the nodes in the graph which are reachable from a given node k. Let G[1..n] be the adjacency list rep. of the graph Let M[1..n] be the mark array, initially all falses. mark(int i){ M[i] = true; x = G[i] while (x != NULL) { if (M[x->node] == false) mark(G[x->node]) x = x->next } Here’s an algorithm that works on our adj list graph.
93
Multi-Lists Suppose we have a set of movies and cinemas, and we want a structure that stores which movies are playing where.
94
More on Multi-Lists What if we also want to store the playing times of movies?
95
Data Structures (end of Lists, then) Trees
Alon Halevy
96
Application: Polynomial ADT
Ai is the coefficient of the xn-i term: 3x2 + 2x + 5 ( ) 8x + 7 ( 8 7 ) Here’s an application of the list abstract data type as a _data structure_ for another abstract data type. Is there a problem here? Why? x2 + 3 ( ) Problem?
97
3x ( ) What is it about lists that makes this a problem here and not in stacks and queues? (Answer: kth(int)!) Is there a solution? Will we get anything but zeroes overwhelming this data structure?
98
Sparse List Data Structure: 3x2001 + 4
(<4 0> <2001 3>) 4 3 2001 This slide is made possible in part by the sparse list data structure. Now, two questions: 1) Is a sparse list really a data structure or an abstract data type? (Answer: It depends but I lean toward data structure. YOUR ANSWER MUST HAVE JUSTIFICATION!) 2) Which list data structure should we use to implement it? Linked Lists or Arrays?
99
Addition of Two Polynomials
Similar to merging two sorted lists – O(n+m) 15+10x50+3x1200 p 15 10 50 3 1200 5+30x50+4x100 q 5 30 50 4 100 r 20 40 50 4 100 3 1200
100
Multiple Linked Lists Many ADTS such as graphs, relations, sparse matrices, multivariate polynomials use multiple linked lists Several options array of lists lists of lists multi lists General principle throughout the course: use one ADT to implement a more complicated one.
101
Array of Linked Lists: Adjacency List for Graphs
1 3 2 5 4 Array G of unordered linked lists Each list entry corresponds to an edge in the graph G Graphs are a very important data type. You might think as you read about your project if there are any graphs there. Here, we’re implementing graphs with adjacency lists. The reason is that this is a sparse graph. We want to have every node in an array (so we can find the first edge quickly), but we just need the edges around. 1 5 2 2 4 3 5 3 1 4 4 5 3 5
102
Reachability by Marking
Suppose we want to mark all the nodes in the graph which are reachable from a given node k. Let G[1..n] be the adjacency list rep. of the graph Let M[1..n] be the mark array, initially all falses. mark(int i){ M[i] = true; x = G[i] while (x != NULL) { if (M[x->node] == false) mark(G[x->node]) x = x->next } Here’s an algorithm that works on our adj list graph.
103
Multi-Lists Suppose we have a set of movies and cinemas, and we want a structure that stores which movies are playing where.
104
More on Multi-Lists What if we also want to store the playing times of movies?
105
Trees Family Trees Organization Charts Classification trees
is this mushroom poisonous? File directory structure Parse Trees (x+y*z) Search Trees often better than lists for sorted data
106
Definition of a Tree Recursive definition: r T1 T2 T3
empty tree has no root given trees T1,…,Tk and a node r, there is a tree T where r is the root of T the children of r are the roots of T1, T2, …, Tk r T1 T2 T3
107
Tree Terminology root child parent sibling path descendent ancestor a
j b f k l e c g Let’s review the words: root: A leaf: DEFJKLMNI child:A - C or H - K leaves have no children parent: C - A or L - H the root has no parent sibling: D - E or F or J - K,L,M, or N grandparent: G to A grandchild: C to H or I ancestor: the node itself or any ancestor’s parent descendent: the node itself or any child’s descendent subtree: a node and all its descendents
108
More Tree Terminology a subtree leaf depth height branching factor
n-ary complete e b c d h i j f g Let’s review the words: root: A leaf: DEFJKLMNI child:A - C or H - K leaves have no children parent: C - A or L - H the root has no parent sibling: D - E or F or J - K,L,M, or N grandparent: G to A grandchild: C to H or I ancestor: the node itself or any ancestor’s parent descendent: the node itself or any child’s descendent subtree: a node and all its descendents k l
109
Basic Tree Data Structure
first_child next_sibling a b c d e
110
Logical View of Tree a i d h j b f k l e c g
111
Actual Data Structure a b c d e h i j f g k l
112
Combined View of Tree a b c d e h i j f g k l
113
Traversals Many algorithms involve walking through a tree, and performing some computation at each node Walking through a tree is called a traversal Common kinds of traversal Pre-order Post-order Level-order
114
Pre-Order Traversal Perform computation at the node, then recursively perform computation on each child preorder(node * n){ node * c; if (! n==NULL){ DO SOMETHING; c = n->first_child; while (! c==NULL){ preorder(c); c = c->next_sibling; } }
115
Pre-Order Traversal Example
i d h j b f k l e c g Start with a -
116
Pre-Order Applications
Use when computation at node depends upon values calculated higher in the tree (closer to root) Example: computing depth depth(node) = 1 + depth( parent of node ) Another example: printing out a directory structure.
117
Computing Depth of All Nodes
Add a field “depth” to all nodes Depth(node * n, int d){ node * c; if (! n==NULL){ n->depth = d; d = d+1; c = n->first_child; while (! c==NULL){ Depth(c, d); c = c->next_sibling; } } Call Depth(root,0) to set depth field correctly
118
Depth Calculation a i d h j b f k l e c g
119
Post-Order Traversal Recursively perform computation on each child, and then perform computation at node postorder(node * n){ node * c; if (! n==NULL){ c = n->first_child; while (! c==NULL){ postorder(c); c = c->next_sibling; } DO SOMETHING;
120
Post-Order Applications
Use when computation at node depends upon values calculated lower in the tree (closer to leafs) Example: computing height height(node) = 1 + MAX( height(child1), height(child2), … height(childk)) Example: size of tree rooted at node size(node) = 1 + size(child1) + size(child2) + … + size(childk))
121
Computing Size of Tree Size(node * n){ node * c; if (! n==NULL) return 0; else { int m=1; c = n->first_child; while (! c==NULL){ m = m + Size(c); c = c->next_sibling; } return m; } Call Size(root) to compute number of nodes in tree
122
Depth-First Search Both Pre-Order and Post-Order traversals are examples of depth-first search nodes are visited deeply on the left-most branches before any nodes are visited on the right-most branches visiting the right branches deeply before the left would still be depth-first! Crucial idea is “go deep first!” In DFS the nodes “being worked on” are kept on a stack (where?)
123
Level-Order/Breadth-first Traversal
Consider task of traversing tree level by level from top to bottom (alphabetic order) What data structure to use to keep track of nodes?? a i d h j b f k l e c g
124
Level-Order (Breadth First) Traversal
Put root in a Queue Repeat until Queue is empty: Dequeue a node Process it Add it’s children to queue
125
Example: Printing the Tree
print(node * root){ node * n, c; queue Q; Q.enqueue(root); while (! Q.empty()){ n = Q.dequeue(); print n->data; c = n->first_child; while (! c==NULL){ Q.enqueue(c); c = c->next_sibling; } } }
126
QUEUE a i d h j b f k l e c g a b c d e c d e f g d e f g e f g h i j
h i j k i j k j k l k l l a i d h j b f k l e c g
127
Applications of BFS Find the shortest path from the root to a given node N if N is at depth k, BFS will never visit a node at depth>k important for really deep trees Generalizes to finding shortest paths in graphs Spidering the world wide web From a root URL, fetch pages that are further and further away
128
Data Structures Binary Search Trees
Alon Halevy
129
Binary Trees A Many algorithms are efficient and easy to program for the special case of binary trees Binary tree is a root left subtree (maybe empty) right subtree (maybe empty) B C D E F G H Alright, we’ll focus today on one type of trees called binary trees. Here’s one now. Is this binary tree complete? Why not? (C has just one child, right side is much deeper than left) What’s the maximum # of leaves a binary tree of depth d can have? What’s the max # of nodes a binary tree of depth d can have? Minimum? We won’t go into this, but if you take N nodes and assume all distinct trees of the nodes are equally likely, you get an average depth of SQRT(N). Is that bigger or smaller than log n? Bigger, so it’s not good enough! I J
130
Representation A Data right pointer left A B C B C D E F D E F
131
Properties of Binary Trees
Max # of leafs in a tree of height h = Max # of nodes in a tree of height h = A B C D E F G
132
Dictionary & Search ADTs
Operations create destroy insert find delete Dictionary: Stores values associated with user-specified keys keys may be any (homogenous) comparable type values may be any (homogenous) type implementation: data field is a struct with two parts Search ADT: keys = values kim chi spicy cabbage kreplach tasty stuffed dough kiwi Australian fruit insert kohlrabi - upscale tuber find(kreplach) kreplach - tasty stuffed dough Dictionaries associate some key with a value, just like a real dictionary (where the key is a word and the value is its definition). In this example, I’ve stored user-IDs associated with descriptions of their coolness level. This is probably the most valuable and widely used ADT we’ll hit. I’ll give you an example in a minute that should firmly entrench this concept.
133
Naïve Implementations
unsorted array sorted linked list insert find + O(n) O(n) find + O(1) find O(log n) delete (if no shrink) Goal: fast find like sorted array, dynamic inserts/deletes like linked list
134
Binary Search Tree Dictionary Data Structure
Search tree property all keys in left subtree smaller than root’s key all keys in right subtree larger than root’s key result: easy to find any given key inserts/deletes by changing links 8 5 11 2 6 10 12 A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing) 4 7 9 14 13
135
Example and Counter-Example
5 8 4 8 5 18 1 7 11 2 7 6 10 11 Why is the one on the left a BST? It’s not complete! (B/c BSTs don’t need to be complete) Why isn’t the one on the right a BST? Three children of 5 20 has a left child larger than it. What’s wrong with 11? Even though 15 isn’t a direct child, it _still_ needs to be less than 11! 3 4 BINARY SEARCH TREE NOT A BINARY SEARCH TREE
136
In Order Listing visit left subtree visit node visit right subtree 10
5 15 2 9 20 Anyone notice anything interesting about that in-order listing? Everything in the left subtree is listed first. Then the root. Then everything in the right subtree. OK, let’s work out the code to make the in-order listing. Is there an iterative version that doesn’t use its own stack? Not really, no. So, recursion is probably OK here. Anyway, if the tree’s too deep for recursion, you must have a huge amount of data. If (n != null) inorder(n->left) cout << n inorder(n->right) 7 17 30 In order listing: 25791015172030
137
Finding a Node 10 5 15 2 9 20 7 17 30 runtime:
Node *& find(Comparable x, Node * root) { if (root == NULL) return root; else if (x < root->key) return find(x, root->left); else if (x > root->key) root->right); else } 10 5 15 2 9 20 Now, let’s try finding a node. Find 9. This time I’ll supply the code. This should look a _lot_ like binary search! How long does it take? Log n is an easy answer, but what if the tree is very lopsided? So really, this is worst case O(n)! A better answer is theta of the depth of the node sought. If we can bound the depth of that node, we can bound the length of time a search takes. What about the code? All those &s and *s should look pretty scary. Let’s talk through them. 7 17 30 runtime:
138
Insert Concept: proceed down tree as in Find; if new key not found, then insert a new node at last spot traversed void insert(Comparable x, Node * root) { assert ( root != NULL ); if (x < root->key){ if (root->left == NULL) root->left = new Node(x); else insert( x, root->left ); } else if (x > root->key){ if (root->right == NULL) root->right = new Node(x); else insert( x, root->right ); } } Let’s do some inserts: insert(8) insert (11) insert(31)
139
BuildTree for BSTs Suppose a1, a2, …, an are inserted into an initially empty BST: a1, a2, …, an are in increasing order a1, a2, …, an are in decreasing order a1 is the median of all, a2 is the median of elements less than a1, a3 is the median of elements greater than a1, etc. data is randomly ordered OK, we had a buildHeap, let’s buildTree. How long does this take? Well, IT DEPENDS! Let’s say we want to build a tree from What happens if we insert in order? Reverse order? What about 5, then 3, then 7, then 2, then 1, then 6, then 8, then 9?
140
Examples of Building from Scratch
1, 2, 3, 4, 5, 6, 7, 8, 9 5, 3, 7, 2, 4, 6, 8, 1, 9
141
Analysis of BuildTree Worst case is O(n2)
… + n = O(n2) Average case assuming all orderings equally likely is O(n log n) not averaging over all binary trees, rather averaging over all input sequences (inserts) equivalently: average depth of a node is log n proof: see Introduction to Algorithms, Cormen, Leiserson, & Rivest Average runtime is equal to the average depth of a node in the tree. We’ll calculate the average depth by finding the sum of all depths in the tree, and dividing by the number of nodes. What’s the sum of all depths? D(n) = D(I) + D(N - I - 1) + N - 1 (left subtree = I, root is 1 node, so right = n - I - 1. D(I) is depth of left, 1 node deeper in overall, same goes for right, total of I + N - I - 1 extra depth). For BSTs, all subtree sizes are equally likely (because we pick the middle element and random and the rest fall on the left or right determinically). Each subtree then averages 1/N * sum 0 to N-1 of D(j)
142
Bonus: FindMin/FindMax
Find minimum Find maximum 10 5 15 2 9 20 Every now and then everyone succumbs to the temptation to really overuse color. 7 17 30
143
Deletion 10 5 15 2 9 20 And now for something completely different. Let’s say I want to delete a node. Why might it be harder than insertion? Might happen in the middle of the tree instead of at leaf. Then, I have to fix the BST. 7 17 30 Why might deletion be harder than insertion?
144
Deletion - Leaf Case Delete(17) 10 5 15 2 9 20 7 17 30
Alright, we did it the easy way, but what about real deletions? Leaves are easy; we just prune them. 7 17 30
145
Deletion - One Child Case
Delete(15) 10 5 15 2 9 20 Single child nodes we remove and… Do what? We can just pull up their children. Is the search tree property intact? Yes. 7 30
146
Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 replace node with value guaranteed to be between the left and right subtrees: the successor Could we have used the predecessor instead?
147
Finding the Successor Find the next larger node
in this node’s subtree. not next larger in entire tree Node * succ(Node * root) { if (root->right == NULL) return NULL; else return min(root->right); } 10 5 15 2 9 20 Here’s a little digression. Maybe it’ll even have an application at some point. Find the next larger node in 10’s subtree. Can we define it in terms of min and max? It’s the min of the right subtree! 7 17 30 How many children can the successor of a node have?
148
Predecessor Find the next smaller node in this node’s subtree. 10 5 15
Node * pred(Node * root) { if (root->left == NULL) return NULL; else return max(root->left); } 10 5 15 2 9 20 Predecessor is just the mirror problem. 7 17 30
149
Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 always easy to delete the successor – always has either 0 or 1 children!
150
Delete Code void delete(Comparable x, Node *& p) { Node * q;
if (p != NULL) { if (p->key < x) delete(x, p->right); else if (p->key > x) delete(x, p->left); else { /* p->key == x */ if (p->left == NULL) p = p->right; else if (p->right == NULL) p = p->left; else { q = successor(p); p->key = q->key; delete(q->key, p->right); } } } Here’s the code for deletion using lots of confusing reference pointers BUT no leaders, fake nodes. The iterative version of this can get somewhat messy, but it’s not really any big deal.
151
Lazy Deletion Instead of physically deleting nodes, just mark them as deleted simpler physical deletions done in batches some adds just flip deleted flag extra memory for deleted flag many lazy deletions slow finds some operations may have to be modified (e.g., min and max) 10 5 15 Now, before we move on to all the pains of true deletion, let’s do it the easy way. We’ll just pretend we delete deleted nodes. This has some real advantages: … 2 9 20 7 17 30
152
Lazy Deletion Delete(17) Delete(15) Delete(5) Find(9) Find(16)
Insert(5) Find(17) 10 5 15 2 9 20 OK, let’s do some lazy deletions. Everybody yawn, stretch, and say “Mmmm… doughnut” to get in the mood. Those of you who are already asleep have the advantage. 7 17 30
153
Dictionary Implementations
unsorted array sorted linked list BST insert find + O(n) O(n) find + O(1) O(Depth) find O(log n) delete BST’s looking good for shallow trees, i.e. the depth D is small (log n), otherwise as bad as a linked list!
154
Beauty is Only (log n) Deep
Binary Search Trees are fast if they’re shallow: e.g.: perfectly complete e.g.: perfectly complete except the “fringe” (leafs) any other good cases? What makes a good BST good? Here’s two examples. Are these the only good BSTs? No! Anything without too many long branches is good, right? Problems occur when one branch is much longer than the other! What matters here?
155
Data Structures Binary Search Trees
Alon Halevy
156
Binary Trees A Many algorithms are efficient and easy to program for the special case of binary trees Binary tree is a root left subtree (maybe empty) right subtree (maybe empty) B C D E F G H Alright, we’ll focus today on one type of trees called binary trees. Here’s one now. Is this binary tree complete? Why not? (C has just one child, right side is much deeper than left) What’s the maximum # of leaves a binary tree of depth d can have? What’s the max # of nodes a binary tree of depth d can have? Minimum? We won’t go into this, but if you take N nodes and assume all distinct trees of the nodes are equally likely, you get an average depth of SQRT(N). Is that bigger or smaller than log n? Bigger, so it’s not good enough! I J
157
Binary Search Tree Dictionary Data Structure
Search tree property all keys in left subtree smaller than root’s key all keys in right subtree larger than root’s key result: easy to find any given key inserts/deletes by changing links 8 5 11 2 6 10 12 A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing) 4 7 9 14 13
158
Example and Counter-Example
5 8 4 8 5 18 1 7 11 2 7 6 10 11 Why is the one on the left a BST? It’s not complete! (B/c BSTs don’t need to be complete) Why isn’t the one on the right a BST? Three children of 5 20 has a left child larger than it. What’s wrong with 11? Even though 15 isn’t a direct child, it _still_ needs to be less than 11! 3 4 BINARY SEARCH TREE NOT A BINARY SEARCH TREE
159
In Order Listing visit left subtree visit node visit right subtree 10
5 15 2 9 20 Anyone notice anything interesting about that in-order listing? Everything in the left subtree is listed first. Then the root. Then everything in the right subtree. OK, let’s work out the code to make the in-order listing. Is there an iterative version that doesn’t use its own stack? Not really, no. So, recursion is probably OK here. Anyway, if the tree’s too deep for recursion, you must have a huge amount of data. If (n != null) inorder(n->left) cout << n inorder(n->right) 7 17 30 In order listing: 25791015172030
160
Finding a Node 10 5 15 2 9 20 7 17 30 runtime:
Node *& find(Comparable x, Node * root) { if (root == NULL) return root; else if (x < root->key) return find(x, root->left); else if (x > root->key) root->right); else } 10 5 15 2 9 20 Now, let’s try finding a node. Find 9. This time I’ll supply the code. This should look a _lot_ like binary search! How long does it take? Log n is an easy answer, but what if the tree is very lopsided? So really, this is worst case O(n)! A better answer is theta of the depth of the node sought. If we can bound the depth of that node, we can bound the length of time a search takes. What about the code? All those &s and *s should look pretty scary. Let’s talk through them. 7 17 30 runtime:
161
Insert Concept: proceed down tree as in Find; if new key not found, then insert a new node at last spot traversed void insert(Comparable x, Node * root) { assert ( root != NULL ); if (x < root->key){ if (root->left == NULL) root->left = new Node(x); else insert( x, root->left ); } else if (x > root->key){ if (root->right == NULL) root->right = new Node(x); else insert( x, root->right ); } } Let’s do some inserts: insert(8) insert (11) insert(31)
162
BuildTree for BSTs Suppose a1, a2, …, an are inserted into an initially empty BST: a1, a2, …, an are in increasing order a1, a2, …, an are in decreasing order a1 is the median of all, a2 is the median of elements less than a1, a3 is the median of elements greater than a1, etc. data is randomly ordered OK, we had a buildHeap, let’s buildTree. How long does this take? Well, IT DEPENDS! Let’s say we want to build a tree from What happens if we insert in order? Reverse order? What about 5, then 3, then 7, then 2, then 1, then 6, then 8, then 9?
163
Examples of Building from Scratch
1, 2, 3, 4, 5, 6, 7, 8, 9 5, 3, 7, 2, 4, 6, 8, 1, 9
164
Analysis of BuildTree Worst case is O(n2)
… + n = O(n2) Average case assuming all orderings equally likely is O(n log n) not averaging over all binary trees, rather averaging over all input sequences (inserts) equivalently: average depth of a node is log n proof: see Introduction to Algorithms, Cormen, Leiserson, & Rivest Average runtime is equal to the average depth of a node in the tree. We’ll calculate the average depth by finding the sum of all depths in the tree, and dividing by the number of nodes. What’s the sum of all depths? D(n) = D(I) + D(N - I - 1) + N - 1 (left subtree = I, root is 1 node, so right = n - I - 1. D(I) is depth of left, 1 node deeper in overall, same goes for right, total of I + N - I - 1 extra depth). For BSTs, all subtree sizes are equally likely (because we pick the middle element and random and the rest fall on the left or right determinically). Each subtree then averages 1/N * sum 0 to N-1 of D(j)
165
Bonus: FindMin/FindMax
Find minimum Find maximum 10 5 15 2 9 20 Every now and then everyone succumbs to the temptation to really overuse color. 7 17 30
166
Deletion 10 5 15 2 9 20 And now for something completely different. Let’s say I want to delete a node. Why might it be harder than insertion? Might happen in the middle of the tree instead of at leaf. Then, I have to fix the BST. 7 17 30 Why might deletion be harder than insertion?
167
Deletion - Leaf Case Delete(17) 10 5 15 2 9 20 7 17 30
Alright, we did it the easy way, but what about real deletions? Leaves are easy; we just prune them. 7 17 30
168
Deletion - One Child Case
Delete(15) 10 5 15 2 9 20 Single child nodes we remove and… Do what? We can just pull up their children. Is the search tree property intact? Yes. 7 30
169
Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 replace node with value guaranteed to be between the left and right subtrees: the successor Could we have used the predecessor instead?
170
Finding the Successor Find the next larger node
in this node’s subtree. not next larger in entire tree Node * succ(Node * root) { if (root->right == NULL) return NULL; else return min(root->right); } 10 5 15 2 9 20 Here’s a little digression. Maybe it’ll even have an application at some point. Find the next larger node in 10’s subtree. Can we define it in terms of min and max? It’s the min of the right subtree! 7 17 30 How many children can the successor of a node have?
171
Predecessor Find the next smaller node in this node’s subtree. 10 5 15
Node * pred(Node * root) { if (root->left == NULL) return NULL; else return max(root->left); } 10 5 15 2 9 20 Predecessor is just the mirror problem. 7 17 30
172
Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 always easy to delete the successor – always has either 0 or 1 children!
173
Delete Code void delete(Comparable x, Node *& p) { Node * q;
if (p != NULL) { if (p->key < x) delete(x, p->right); else if (p->key > x) delete(x, p->left); else { /* p->key == x */ if (p->left == NULL) p = p->right; else if (p->right == NULL) p = p->left; else { q = successor(p); p->key = q->key; delete(q->key, p->right); } } } Here’s the code for deletion using lots of confusing reference pointers BUT no leaders, fake nodes. The iterative version of this can get somewhat messy, but it’s not really any big deal.
174
Lazy Deletion Instead of physically deleting nodes, just mark them as deleted simpler physical deletions done in batches some adds just flip deleted flag extra memory for deleted flag many lazy deletions slow finds some operations may have to be modified (e.g., min and max) 10 5 15 Now, before we move on to all the pains of true deletion, let’s do it the easy way. We’ll just pretend we delete deleted nodes. This has some real advantages: … 2 9 20 7 17 30
175
Lazy Deletion Delete(17) Delete(15) Delete(5) Find(9) Find(16)
Insert(5) Find(17) 10 5 15 2 9 20 OK, let’s do some lazy deletions. Everybody yawn, stretch, and say “Mmmm… doughnut” to get in the mood. Those of you who are already asleep have the advantage. 7 17 30
176
Dictionary Implementations
unsorted array sorted linked list BST insert find + O(n) O(n) find + O(1) O(Depth) find O(log n) delete BST’s looking good for shallow trees, i.e. the depth D is small (log n), otherwise as bad as a linked list!
177
Beauty is Only (log n) Deep
Binary Search Trees are fast if they’re shallow: e.g.: perfectly complete e.g.: perfectly complete except the “fringe” (leafs) any other good cases? What makes a good BST good? Here’s two examples. Are these the only good BSTs? No! Anything without too many long branches is good, right? Problems occur when one branch is much longer than the other! What matters here?
178
Balance Balance Balance between -1 and 1 everywhere
5 7 Balance height(left subtree) - height(right subtree) zero everywhere perfectly balanced small everywhere balanced enough We’ll use the concept of Balance to keep things shallow. Balance between -1 and 1 everywhere maximum height of 1.44 log n
179
AVL Tree Dictionary Data Structure
Binary search tree properties binary tree property search tree property Balance property balance of every node is: -1 b 1 result: depth is (log n) 8 5 11 2 6 10 12 So, AVL trees will be Binary Search Trees with one extra feature: They balance themselves! The result is that all AVL trees at any point will have a logarithmic asymptotic bound on their depths 4 7 9 13 14 15
180
An AVL Tree 10 10 3 5 15 2 9 12 20 17 30 data 3 height children 1 2 1
1 2 9 12 20 Here’s a revision of that tree that’s balanced. (Same values, similar tree) This one _is_ an AVL tree (and isn’t leftist). I also have here how we might store the nodes in the AVL tree. Notice that I’m going to keep track of height all the time. WHY? 17 30
181
Not AVL Trees 10 10 0-2 = -2 (-1)-1 = -2 5 15 15 12 20 20 17 30 3 2 2
2 0-2 = -2 (-1)-1 = -2 1 5 15 15 1 12 20 20 Here’s a revision of that tree that’s balanced. (Same values, similar tree) This one _is_ an AVL tree (and isn’t leftist). I also have here how we might store the nodes in the AVL tree. Notice that I’m going to keep track of height all the time. WHY? 17 30
182
Staying Balanced M S T Good case: inserting small, tall and middle.
Insert(middle) Insert(small) Insert(tall) 1 M Let’s make a tree from these people with their height as the keys. We’ll start by inserting [MIDDLE] first. Then, [SMALL] and finally [TALL]. Is this tree balanced? Yes! S T
183
Bad Case #1 S M T Insert(small) Insert(middle) Insert(tall) 2 1
But, let’s start over… Insert [SMALL] Now, [MIDDLE]. Now, [TALL]. Is this tree balanced? NO! Who do we need at the root? [MIDDLE!] Alright, let’s pull er up. T
184
Single Rotation S M M S T T 2 1 1 Basic operation used in AVL trees:
This is the basic operation we’ll use in AVL trees. Since this is a right child, it could legally have the parent as its left child. When we finish the rotation, we have a balanced tree! S T T Basic operation used in AVL trees: A right child could legally have its parent as its left child.
185
General Case: Insert Unbalances
h + 1 h + 2 a a h h - 1 h + 1 h - 1 b X b X h-1 h h - 1 h - 1 Z Y Z Y Here’s the general form of this. We insert into the red tree. That ups the three heights on the left. Basically, you just need to pull up on the child. Then, ensure that everything falls in place as legal subtrees of the nodes. Notice, though, the height of this subtree is the same as it was before the insert into the red tree. So? So, we don’t have to worry about ancestors of the subtree becoming imbalanced; we can just stop here!
186
General Single Rotation
h + 2 h + 1 a a X Y b Z h h + 1 h - 1 b X h h - 1 h h - 1 h - 1 Z Y Here’s the general form of this. We insert into the red tree. That ups the three heights on the left. Basically, you just need to pull up on the child. Then, ensure that everything falls in place as legal subtrees of the nodes. Notice, though, the height of this subtree is the same as it was before the insert into the red tree. So? So, we don’t have to worry about ancestors of the subtree becoming imbalanced; we can just stop here! Height of left subtree same as it was before insert! Height of all ancestors unchanged We can stop here!
187
Will a single rotation fix this?
Bad Case #2 Insert(small) Insert(tall) Insert(middle) 2 S 1 T There’s another bad case, though. What if we insert: [SMALL] [TALL] [MIDDLE] Now, is the tree imbalanced? Will a single rotation fix it? (Try it by bringing up tall; doesn’t work!) Will a single rotation fix this? M
188
Double Rotation S S M T M S T M T 2 2 1 1 1
Let’s try two single rotations, starting a bit lower down. First, we rotate up middle. Then, we rotate up middle again! Is the new tree balanced? S T M T
189
General Double Rotation
h + 2 a h + 1 h + 1 c h - 1 b Z h h b a h - 1 W h c h - 1 h - 1 X Y W Z X Y Here’s the general form of this. Notice that the difference here is that we zigged one way than zagged the other to find the problem. We don’t really know or care which of X or Y was inserted into, but one of them was. To fix it, we pull c all the way up. Then, put a, b, and the subtrees beneath it in the reasonable manner. The height is still the same at the end! h - 1? h - 1? Initially: insert into either X or Y unbalances tree (root height goes to h+2) “Zig zag” to pull up c – restores root height to h+1, left subtree height to h
190
Insert Algorithm Find spot for value Hang new node
Search back up looking for imbalance If there is an imbalance: case #1: Perform single rotation and exit case #2: Perform double rotation and exit OK, thank you BST Three! And those two cases (along with their mirror images) are the only four that can happen! So, here’s our insert algorithm. We just hang the node. Search for a spot where there’s imbalance. If there is, fix it (according to the shape of the imbalance). And then we’re done; there can only be one problem!
191
Easy Insert Insert(3) 10 5 15 2 9 12 20 17 30 3 1 2 1 Let’s insert 3.
1 2 9 12 20 Let’s insert 3. This is easy! It just goes under 2 (to the left). Update the balances: any imbalance? NO! 17 30
192
Hard Insert (Bad Case #1)
2 3 Insert(33) 10 5 15 2 9 12 20 Now, let’s insert 33. Where does it go? Left of 30. 3 17 30
193
Single Rotation 1 2 3 1 2 3 10 10 5 15 5 20 2 9 12 20 2 9 15 30 Here’s the tree with the balances updated. Now, node 15 is bad! Since the problem is in the left subtree of the left child, we can fix it with a single rotation. We pull 20 up. Hang 15 to the left. Pass 17 to 15. And, we’re done! Notice that I didn’t update 10’s height until we checked 15. Did it change after all? 3 17 30 3 12 17 33 33
194
Hard Insert (Bad Case #2)
1 2 3 Insert(18) 10 5 15 2 9 12 20 Now, let’s back up to before 33 and insert 18 instead. Goes right of 17. Again, there’s imbalance. But, this time, it’s a zig-zag! 3 17 30
195
Single Rotation (oops!)
1 2 3 1 2 3 10 10 5 15 5 20 2 9 12 20 2 9 15 30 We can try a single rotation, but we end up with another zig-zag! 3 17 30 3 12 17 18 18
196
Double Rotation (Step #1)
2 3 1 2 3 10 10 5 15 5 15 2 9 12 20 2 9 12 17 So, we’ll double rotate. Start by moving the offending grand-child up. We get an even more imbalanced tree. BUT, it’s imbalanced like a zig-zig tree now! 3 17 30 3 20 18 18 30 Look familiar?
197
Double Rotation (Step #2)
1 2 3 1 2 3 10 10 5 15 5 17 2 9 12 17 2 9 15 20 So, let’s pull 17 up again. Now, we get a balanced tree. And, again, 10’s height didn’t need to change. 3 20 3 12 18 30 18 30
198
AVL Algorithm Revisited
Recursive 1. Search downward for spot 2. Insert node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2, double rotate Iterative 1. Search downward for spot, stacking parent nodes 2. Insert node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate and exit b. If imbalance #2, double rotate and OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!
199
Single Rotation Code X Y Z root temp void RotateRight(Node *& root) {
Node * temp = root->right; root->right = temp->left; temp->left = root; root->height = max(root->right->height, root->left->height) + 1; temp->height = max(temp->right->height, temp->left->height) + 1; root = temp; } Here’s code for one of the two single rotate cases. RotateRight brings up the right child. We’ve inserted into Z, and now we want to fix it.
200
Double Rotation Code First Rotation a Z b W c X Y a Z c b X Y W
void DoubleRotateRight(Node *& root) { RotateLeft(root->right); RotateRight(root); } First Rotation a Z b W c X Y a Z c b X Y W Here’s the double rotation code. Pretty tough, eh?
201
Double Rotation Completed
First Rotation Second Rotation a Z c b X Y W c a b X W Z Y
202
Data Structures AVL II Alon Halevy
Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)
203
Deletion (Really Easy Case)
1 2 3 Delete(17) 10 5 15 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 17 30
204
Deletion (Pretty Easy Case)
1 2 3 Delete(15) 10 5 15 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 17 30
205
Deletion (Pretty Easy Case cont.)
3 Delete(15) 10 2 2 5 17 1 1 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 30
206
Deletion (Hard Case #1) Delete(12) 10 5 17 2 9 12 20 3 30 3 2 1
2 3 Delete(12) 10 5 17 2 9 12 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 3 30
207
Single Rotation on Deletion
1 2 3 3 10 10 2 1 5 17 5 20 1 2 9 20 2 9 17 30 But what happened on the fix? Something very disturbing. What? The subtree’s height changed!! So, the deletion can propagate. 3 30 3 What is different about deletion than insertion?
208
Deletion (Hard Case) Delete(9) 10 5 17 2 9 12 12 20 20 3 11 15 15 18
3 4 Delete(9) 10 5 17 2 9 12 12 20 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 1 1 3 11 15 15 18 30 30 13 13 33 33
209
Double Rotation on Deletion
Not finished! 1 2 3 4 2 1 3 4 10 10 5 17 3 17 2 2 12 20 2 5 12 20 1 1 1 1 3 11 15 18 30 11 15 18 30 13 33 13 33
210
Deletion with Propagation
2 1 3 4 10 What different about this case? 3 17 2 5 12 20 1 1 We get to choose whether to single or double rotate! 11 15 18 30 13 33
211
Propagated Single Rotation
2 1 3 4 4 10 17 3 2 3 17 10 20 1 2 1 2 5 12 20 3 12 18 30 1 1 1 11 15 18 30 2 5 11 15 33 13 33 13
212
Propagated Double Rotation
2 1 3 4 4 10 12 2 3 3 17 10 17 1 1 2 2 5 12 20 3 11 15 20 1 1 1 11 15 18 30 2 5 13 18 30 13 33 33
213
AVL Deletion Algorithm
Recursive If at node, delete it Otherwise recurse to find it in 3. Correct heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care), double rotate Iterative 1. Search downward for node, stacking parent nodes 2. Delete node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care) double rotate OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!
214
Fun with AVL Trees Input: sequence of n keys (unordered) 19 3 4 18 7
Insert each into initially empty AVL tree Print using inorder traversal O(n) Result? Are we having fun yet?
215
Is There a Faster Way? But suppose input is already sorted 3 4 7 18 19
Can we do better than O(n log n)?
216
AVL buildTree 5 8 10 15 17 20 30 35 40 Divide & Conquer 17
Divide the problem into parts Solve each part recursively Merge the parts into a general solution 17 IT DEPENDS! How long does divide & conquer take? 8 10 15 5 20 30 35 40
217
BuildTree Example 5 8 10 15 17 20 30 35 40 3 17 5 8 10 15 2 2 20 30 35 40 10 35 20 30 5 8 1 1 8 15 30 40 5 20
218
BuildTree Analysis (Approximate)
T(n) = 2T(n/2) + 1 T(n) = 2(2T(n/4)+1) + 1 T(n) = 4T(n/4) T(n) = 4(2T(n/8)+1) T(n) = 8T(n/8) T(n) = 2kT(n/2k) + let 2k = n, log n = k T(n) = nT(1) + T(n) = (n) Summation is 2^logn + 2^logn-1 + 2^logn-2+… n+n/2+n/4+n/8+… ~2n
219
BuildTree Analysis (Exact)
Precise Analysis: T(0) = b T(n) = T( ) + T( ) + c By induction on n: T(n) = (b+c)n + b Base case: T(0) = b = (b+c)0 + b Induction step: T(n) = (b+c) + b + (b+c) + b + c = (b+c)n + b QED: T(n) = (b+c)n + b = (n)
220
Application: Batch Deletion
Suppose we are using lazy deletion When there are lots of deleted nodes (n/2), need to flush them all out Batch deletion: Print non-deleted nodes into an array How? Divide & conquer AVL Treebuild Total time:
221
Thinking About AVL Observations
+ Worst case height of an AVL tree is about 1.44 log n + Insert, Find, Delete in worst case O(log n) + Only one (single or double) rotation needed on insertion - O(log n) rotations needed on deletion + Compatible with lazy deletion - Height fields must be maintained (or 2-bit balance)
222
Alternatives to AVL Trees
Weight balanced trees keep about the same number of nodes in each subtree not nearly as nice Splay trees “blind” adjusting version of AVL trees no height information maintained! insert/find always rotates node to the root! worst case time is O(n) amortized time for all operations is O(log n) mysterious, but often faster than AVL trees in practice (better low-order terms)
223
Data Structures AVL II Alon Halevy
Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)
224
Imbalance in AVL Trees Last week’s conjecture: in AVL trees, if you remove the bottom level, then you get a complete tree. This week’s theorems: All nodes, except parents of the leaves and the leaves have two children. Single-child nodes can be arbitrarily far from the leaves.
225
AVL Tree with Slight Imbalance
8 5 11 2 6 10 12 So, AVL trees will be Binary Search Trees with one extra feature: They balance themselves! The result is that all AVL trees at any point will have a logarithmic asymptotic bound on their depths 4 7 9 13 14 15
226
Where can we Find Leaves?
Suppose the node N has no children. What is the maximal height of N’s parent? What is the maximal height of N’s grandparent? What is the maximal height of N’s great-grandparent? Conclusion: at what depth can we find a leaf?
227
Deletion (Hard Case #1) Delete(12) 10 5 17 2 9 12 20 3 30 3 2 1
2 3 Delete(12) 10 5 17 2 9 12 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 3 30
228
Single Rotation on Deletion
1 2 3 3 10 10 2 1 5 17 5 20 1 2 9 20 2 9 17 30 But what happened on the fix? Something very disturbing. What? The subtree’s height changed!! So, the deletion can propagate. 3 30 3 What is different about deletion than insertion?
229
Deletion (Hard Case #2) Delete(9) 10 5 17 2 9 12 12 20 20 3 11 15 15
3 4 Delete(9) 10 5 17 2 9 12 12 20 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 1 1 3 11 15 15 18 30 30 13 13 33 33
230
Double Rotation on Deletion
Not finished! 1 2 3 4 2 1 3 4 10 10 5 17 3 17 2 2 12 20 2 5 12 20 1 1 1 1 3 11 15 18 30 11 15 18 30 13 33 13 33
231
Deletion with Propagation
2 1 3 4 10 What different about this case? 3 17 2 5 12 20 1 1 We get to choose whether to single or double rotate! 11 15 18 30 13 33
232
Propagated Single Rotation
2 1 3 4 4 10 17 3 2 3 17 10 20 1 2 1 2 5 12 20 3 12 18 30 1 1 1 11 15 18 30 2 5 11 15 33 13 33 13
233
Propagated Double Rotation
2 1 3 4 4 10 12 2 3 3 17 10 17 1 1 2 2 5 12 20 3 11 15 20 1 1 1 11 15 18 30 2 5 13 18 30 13 33 33
234
AVL Deletion Algorithm
Recursive If at node, delete it Otherwise recurse to find it in 3. Correct heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care), double rotate Iterative 1. Search downward for node, stacking parent nodes 2. Delete node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care) double rotate OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!
235
Fun with AVL Trees Input: sequence of n keys (unordered) 19 3 4 18 7
Insert each into initially empty AVL tree Print using inorder traversal O(n) Result? Are we having fun yet?
236
Is There a Faster Way? But suppose input is already sorted 3 4 7 18 19
Can we do better than O(n log n)?
237
AVL buildTree 5 8 10 15 17 20 30 35 40 Divide & Conquer 17
Divide the problem into parts Solve each part recursively Merge the parts into a general solution 17 IT DEPENDS! How long does divide & conquer take? 8 10 15 5 20 30 35 40
238
BuildTree Example 5 8 10 15 17 20 30 35 40 3 17 5 8 10 15 2 2 20 30 35 40 10 35 20 30 5 8 1 1 8 15 30 40 5 20
239
BuildTree Analysis (Approximate)
T(n) = 2T(n/2) + 1 T(n) = 2(2T(n/4)+1) + 1 T(n) = 4T(n/4) T(n) = 4(2T(n/8)+1) T(n) = 8T(n/8) T(n) = 2kT(n/2k) + let 2k = n, log n = k T(n) = nT(1) + T(n) = (n) Summation is 2^logn + 2^logn-1 + 2^logn-2+… n+n/2+n/4+n/8+… ~2n
240
Thinking About AVL Observations
+ Worst case height of an AVL tree is about 1.44 log n + Insert, Find, Delete in worst case O(log n) + Only one (single or double) rotation needed on insertion - O(log n) rotations needed on deletion - Height fields must be maintained (or 2-bit balance)
241
Alternatives to AVL Trees
Weight balanced trees keep about the same number of nodes in each subtree not nearly as nice Splay trees (after mid-term) “blind” adjusting version of AVL trees no height information maintained! insert/find always rotates node to the root! worst case time is O(n) amortized time for all operations is O(log n) mysterious, but often faster than AVL trees in practice (better low-order terms)
242
B-Trees
243
Beyond Binary Trees One of the most important applications for search trees is databases If the DB is small enough to fit into RAM, almost any scheme for balanced trees (e.g. AVL) is okay 2000 (WalMart) RAM – 1,000,000 MB DB – 1,000,000 MB (terabyte) 1980 RAM – 1MB DB – 100 MB gap between disk and main memory growing!
244
Time Gap For many corporate and scientific databases, the search tree must mostly be on disk Accessing disk 200,000 X time slower than RAM Visiting node = accessing disk Even perfectly balance binary trees a disaster! log2( 10,000,000 ) = 24 disk accesses Goal: Decrease Height of Tree
245
M-ary Search Tree Maximum branching factor of M
Complete tree has depth = logMN Each internal node in a complete tree has M - 1 keys runtime: Here’s the general idea. We create a search tree with a branching factor of M. Each node has M-1 keys and we search between them. What’s the runtime? O(logMn)? That’s a nice thought, and it’s the best case. What about the worst case? Is the tree guaranteed to be balanced? Is it guaranteed to be complete? Might it just end up being a binary tree?
246
B-Trees B-Trees are specialized M-ary search trees
Each node has many keys subtree between two keys x and y contains values v such that x v < y binary search within a node to find correct subtree Each node takes one full page of memory. 3 7 12 21 To address these problems, we’ll use a slightly more structured M-ary tree: B-Trees. As before, each internal node has M-1 kes. To manage memory problems, we’ll tune the size of a node (or leaf) to the size of a memory unit. Usually, a page or disk block. x<3 3x<7 7x<12 12x<21 21x
247
B-Tree Properties‡ Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The properties of B-Trees (and the trees themselves) are a bit more complex than previous structures we’ve looked at. Here’s a big, gnarly list; we’ll go one step at a time. The maximum branching factor, as we said, is M (tunable for a given tree). The root has between 2 and M children or at most L keys. (L is another parameter) These restrictions will be different for the root than for other nodes. ‡These are technically B+-Trees
248
B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time All the other internal nodes (non-leaves) will have between M/2 and M children. The funky symbol is ceiling, the next higher integer above the value. The result of this is that the tree is “pretty” full. Not every node has M children but they’ve all at least got M/2 (a good number). Internal nodes contain only search keys. A search key is a value which is solely for comparison; there’s no data attached to it. The node will have one fewer search key than it has children (subtrees) so that we can search down to each child. The smallest datam between two search keys is equal to the lesser search key. This is how we find the search keys to use.
249
B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time All the leaves (again, except the root) have a similar restriction. They contain between L/2 and L keys. Notice that means you have to do a search when you get to a leaf to find the item you’re looking for. All the leaves are also at the same depth. So, the tree looks kind of complete. It has the triangle shape, and the nodes branch at least as much as M/2.
250
B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The result of all this is that the tree in the worst case is log n deep. In particular, it’s about logM/2n deep. Does this matter asymptotically? No. What about practically? YES! Since M and L are considered constants, all operations run in log n time. Each operation pulls in at most M search keys or L items at a time. So, we can tune L and M to the size of a disk block!
251
When Big-O is Not Enough
B-Tree is about logM/2 n/(L/2) deep = logM/2 n - logM/2 L/2 = O(logM/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses
252
… … B-Tree Nodes Internal node Leaf
i search keys; i+1 subtrees; M - i - 1 inactive entries k1 k2 … ki __ … __ 1 2 i M - 1 Leaf j data keys; L - j inactive entries FIX M-I to M-I-1!! Alright, before we look at any examples, let’s look at what the node structure looks like. Internal nodes are arrays of pointers to children interspersed with search keys. Why must they be arrays rather than linked lists? Because we want contiguous memory! If the node has just I+1 children, it has I search keys, and M-I empty entries. A leaf looks similar (I’ll use green for leaves), and has similar properties. Why are these different? Because internal nodes need subtrees-1 keys. k1 k2 … kj __ … __ 1 2 j L
253
Example B-Tree with M = 4 and L = 4 10 40 3 15 20 30 50 1 2 10 11 12
This is just an example B-tree. Notice that it has 24 entries with a depth of only 2. A BST would be 4 deep. Notice also that the leaves are at the same level in the tree. I’ll use integers as both key and data, but we all know that that could as well be different data at the bottom, right? 1 2 10 11 12 20 25 26 40 42 3 5 6 9 15 17 30 32 33 36 50 60 70
254
Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? The empty B-Tree
M = 3 L = 2 3 3 14 Insert(3) Insert(14) Alright, how do we insert and delete? Let’s start with the empty B-Tree. That’s one leaf as the root. Now, we’ll insert 3 and 14. Fine… What about inserting 1. Is there a problem? Now, Insert(1)?
255
Splitting the Root Insert(1) And create a new root Too many
keys in a leaf! 3 14 14 1 3 1 3 14 Insert(1) And create a new root 1 3 14 Too many keys in a leaf! Run away! How do we solve this? Well, we definitely need to split this leaf in two. But, now we don’t have a tree anymore. So, let’s make a new root and give it as children the two leaves. This is how B-Trees grow deeper. So, split the leaf.
256
Insertions and Split Ends
Too many keys in a leaf! 14 14 14 Insert(59) Insert(26) 1 3 14 26 59 1 3 14 1 3 14 59 14 26 59 So, split the leaf. Now, let’s do some more inserts. 59 is no problem. What about 26? Same problem as before. But, this time the split leaf just goes under the existing node because there’s still room. What if there weren’t room? 14 59 And add a new child 1 3 14 26 59
257
Too many keys in an internal node!
Propagating Splits 14 59 14 59 Insert(5) Add new child 1 3 5 14 26 59 1 3 14 26 59 1 3 5 Too many keys in an internal node! 5 1 3 14 26 59 5 14 26 59 1 3 When we insert 5, the leaf overflows, but its parent already has too many subtrees! What do we do? The same thing as before but this time with an internal node. We split the node. Normally, we’d hang the new subtrees under their parent, but in this case they don’t have one. Now we have two trees! Soltuion: same as before, make a new root and hang these under it. Create a new root So, split the node.
258
Insertion in Boring Text
Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root OK, here’s that process as an algorithm. The new funky symbol is floor; that’s just like regular C++ integer division. Notice that this can propagate all the way up the tree. How often will it do that? Notice that the two new leaves or internal nodes are guaranteed to have enough items (or subtrees). Because even the floor of (L+1)/2 is as big as the ceiling of L/2. This makes the tree deeper!
259
After More Routine Inserts
14 Insert(89) Insert(79) 5 59 1 3 5 14 26 59 5 1 3 14 26 59 79 89 OK, we’ve done insertion. What about deletion? For didactic purposes, I will now do two more regular old insertions (notice these cause a split).
260
Deletion Delete(59) 5 1 3 14 26 59 79 89 Now, let’s delete!
Just find the key to delete and snip it out! Easy! Done, right?
261
Deletion and Adoption A leaf has too few keys! Delete(5)
14 14 Delete(5) 5 79 89 ? 79 89 1 3 5 14 26 79 89 1 3 14 26 79 89 So, borrow from a neighbor Of course not! What if we delete an item in a leaf and drive it below L/2 items (in this case to zero)? In that case, we have two options. The easy option is to borrow a neighbor’s item. We just move it over from the neighbor and fix the parent’s key. DIGRESSION: would it be expensive to maintain neighbor pointers in B-Trees? No. Because those leaves are normally going to be huge, and two pointers per leaf is no big deal (might cut down L by 1). How about parent pointers? No problem. In fact, I’ve been assuming we have them! 3 1 14 26 79 89
262
Deletion with Propagation
A leaf has too few keys! 14 14 Delete(3) 3 79 89 ? 79 89 1 3 14 26 79 89 1 14 26 79 89 And no neighbor with surplus! But, what about if the neighbors are too low on items as well? Then, we need to propagate the delete… like an _unsplit_. We delete the node and fix up the parent. Note that if I had a larger M/L, we might have keys left in the deleted node. Why? Because the leaf just needs to drop below ceil(L/2) to be deleted. If L=100, L/2 = 50 and there are 49 keys to distribute! Solution: Give them to the neighbors. Now, what happens to the parent here? It’s down to one subtree! STRESS AGAIN THAT LARGER M and L WOULD MEAN NO NEED TO “RUN OUT”. 14 But now a node has too few subtrees! So, delete the leaf 79 89 1 14 26 79 89
263
Finishing the Propagation (More Adoption)
Adopt a neighbor 1 14 26 79 89 We just do the same thing here that we did earlier: Borrow from a rich neighbor!
264
A Bit More Adoption Delete(1) (adopt a neighbor) 79 79 14 89 26 89 1
OK, let’s do a bit of setup. This is easy, right? 1 14 26 79 89 14 26 79 89
265
Pulling out the Root A leaf has too few keys!
And no neighbor with surplus! 79 79 Delete(26) So, delete the leaf 26 89 89 14 26 79 89 14 79 89 But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Now, let’s delete 26. It can’t borrow from its neighbor, so we delete it. Its parent is too low on children now and it can’t borrow either: Delete it. Here, we give its leftovers to its neighbors as I mentioned earlier. But now the root has just one subtree!! 79 Delete the leaf 79 89 89 14 79 89 14 79 89
266
Pulling out the Root (continued)
has just one subtree! Just make the one child the new root! 79 89 14 79 89 But that’s silly! The root having just one subtree is both illegal and silly. Why have the root if it just branches straight down? So, we’ll just delete the root and replace it with its child! 79 89 14 79 89
267
Deletion in Two Boring Slides of Text
Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer than M/2 items, underflow! Why will dumping keys always work if borrowing doesn’t? Alright, that’s deletion. Let’s talk about a few of the details. Why will dumping keys always work? If the neighbors were too low on keys to loan any, they must have L/2 keys, but we have one fewer. Therefore, putting them together, we get at most L, and that’s legal.
268
Deletion Slide Two If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree The same applies here for dumping subtrees as on the previous slide for dumping keys. This reduces the height of the tree!
269
Thinking about B-Trees
B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items height 5: 2,000,000,000! B*-Trees fix thrashing. Propagation is rare because (in a good case) only about 1/L inserts cause a split and only about 1/M of those go up even one level! 30 million’s not so big, right? How about height 5? 2 billion
270
Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? How about: O(1) finds and inserts?
271
Data Structures B-Trees
Alon Halevy Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)
272
B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The result of all this is that the tree in the worst case is log n deep. In particular, it’s about logM/2n deep. Does this matter asymptotically? No. What about practically? YES! Since M and L are considered constants, all operations run in log n time. Each operation pulls in at most M search keys or L items at a time. So, we can tune L and M to the size of a disk block!
273
When Big-O is Not Enough
B-Tree is about logM/2 n/(L/2) deep = logM/2 n - logM/2 L/2 = O(logM/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses
274
… … B-Tree Nodes Internal node Leaf
i search keys; i+1 subtrees; M - i - 1 inactive entries k1 k2 … ki __ … __ 1 2 i M - 1 Leaf j data keys; L - j inactive entries FIX M-I to M-I-1!! Alright, before we look at any examples, let’s look at what the node structure looks like. Internal nodes are arrays of pointers to children interspersed with search keys. Why must they be arrays rather than linked lists? Because we want contiguous memory! If the node has just I+1 children, it has I search keys, and M-I empty entries. A leaf looks similar (I’ll use green for leaves), and has similar properties. Why are these different? Because internal nodes need subtrees-1 keys. k1 k2 … kj __ … __ 1 2 j L
275
Example B-Tree with M = 4 and L = 4 10 40 3 15 20 30 50 1 2 10 11 12
This is just an example B-tree. Notice that it has 24 entries with a depth of only 2. A BST would be 4 deep. Notice also that the leaves are at the same level in the tree. I’ll use integers as both key and data, but we all know that that could as well be different data at the bottom, right? 1 2 10 11 12 20 25 26 40 42 3 5 6 9 15 17 30 32 33 36 50 60 70
276
Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? The empty B-Tree
M = 3 L = 2 3 3 14 Insert(3) Insert(14) Alright, how do we insert and delete? Let’s start with the empty B-Tree. That’s one leaf as the root. Now, we’ll insert 3 and 14. Fine… What about inserting 1. Is there a problem? Now, Insert(1)?
277
Splitting the Root Insert(1) And create a new root Too many
keys in a leaf! 3 14 14 1 3 1 3 14 Insert(1) And create a new root 1 3 14 Too many keys in a leaf! Run away! How do we solve this? Well, we definitely need to split this leaf in two. But, now we don’t have a tree anymore. So, let’s make a new root and give it as children the two leaves. This is how B-Trees grow deeper. So, split the leaf.
278
Insertions and Split Ends
Too many keys in a leaf! 14 14 14 Insert(59) Insert(26) 1 3 14 26 59 1 3 14 1 3 14 59 14 26 59 So, split the leaf. Now, let’s do some more inserts. 59 is no problem. What about 26? Same problem as before. But, this time the split leaf just goes under the existing node because there’s still room. What if there weren’t room? 14 59 And add a new child 1 3 14 26 59
279
Too many keys in an internal node!
Propagating Splits 14 59 14 59 Insert(5) Add new child 1 3 5 14 26 59 1 3 14 26 59 1 3 5 Too many keys in an internal node! 5 1 3 14 26 59 5 14 26 59 1 3 When we insert 5, the leaf overflows, but its parent already has too many subtrees! What do we do? The same thing as before but this time with an internal node. We split the node. Normally, we’d hang the new subtrees under their parent, but in this case they don’t have one. Now we have two trees! Soltuion: same as before, make a new root and hang these under it. Create a new root So, split the node.
280
Insertion in Boring Text
Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root OK, here’s that process as an algorithm. The new funky symbol is floor; that’s just like regular C++ integer division. Notice that this can propagate all the way up the tree. How often will it do that? Notice that the two new leaves or internal nodes are guaranteed to have enough items (or subtrees). Because even the floor of (L+1)/2 is as big as the ceiling of L/2. This makes the tree deeper!
281
Deletion in B-trees Come to section tomorrow. Slides follow.
282
After More Routine Inserts
14 Insert(89) Insert(79) 5 59 1 3 5 14 26 59 5 1 3 14 26 59 79 89 OK, we’ve done insertion. What about deletion? For didactic purposes, I will now do two more regular old insertions (notice these cause a split).
283
Deletion Delete(59) 5 1 3 14 26 59 79 89 Now, let’s delete!
Just find the key to delete and snip it out! Easy! Done, right?
284
Deletion and Adoption A leaf has too few keys! Delete(5)
14 14 Delete(5) 5 79 89 ? 79 89 1 3 5 14 26 79 89 1 3 14 26 79 89 So, borrow from a neighbor Of course not! What if we delete an item in a leaf and drive it below L/2 items (in this case to zero)? In that case, we have two options. The easy option is to borrow a neighbor’s item. We just move it over from the neighbor and fix the parent’s key. DIGRESSION: would it be expensive to maintain neighbor pointers in B-Trees? No. Because those leaves are normally going to be huge, and two pointers per leaf is no big deal (might cut down L by 1). How about parent pointers? No problem. In fact, I’ve been assuming we have them! 3 1 14 26 79 89
285
Deletion with Propagation
A leaf has too few keys! 14 14 Delete(3) 3 79 89 ? 79 89 1 3 14 26 79 89 1 14 26 79 89 And no neighbor with surplus! But, what about if the neighbors are too low on items as well? Then, we need to propagate the delete… like an _unsplit_. We delete the node and fix up the parent. Note that if I had a larger M/L, we might have keys left in the deleted node. Why? Because the leaf just needs to drop below ceil(L/2) to be deleted. If L=100, L/2 = 50 and there are 49 keys to distribute! Solution: Give them to the neighbors. Now, what happens to the parent here? It’s down to one subtree! STRESS AGAIN THAT LARGER M and L WOULD MEAN NO NEED TO “RUN OUT”. 14 But now a node has too few subtrees! So, delete the leaf 79 89 1 14 26 79 89
286
Finishing the Propagation (More Adoption)
Adopt a neighbor 1 14 26 79 89 We just do the same thing here that we did earlier: Borrow from a rich neighbor!
287
A Bit More Adoption Delete(1) (adopt a neighbor) 79 79 14 89 26 89 1
OK, let’s do a bit of setup. This is easy, right? 1 14 26 79 89 14 26 79 89
288
Pulling out the Root A leaf has too few keys!
And no neighbor with surplus! 79 79 Delete(26) So, delete the leaf 26 89 89 14 26 79 89 14 79 89 But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Now, let’s delete 26. It can’t borrow from its neighbor, so we delete it. Its parent is too low on children now and it can’t borrow either: Delete it. Here, we give its leftovers to its neighbors as I mentioned earlier. But now the root has just one subtree!! 79 Delete the leaf 79 89 89 14 79 89 14 79 89
289
Pulling out the Root (continued)
has just one subtree! Just make the one child the new root! 79 89 14 79 89 But that’s silly! The root having just one subtree is both illegal and silly. Why have the root if it just branches straight down? So, we’ll just delete the root and replace it with its child! 79 89 14 79 89
290
Deletion in Two Boring Slides of Text
Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer than M/2 items, underflow! Why will dumping keys always work if borrowing doesn’t? Alright, that’s deletion. Let’s talk about a few of the details. Why will dumping keys always work? If the neighbors were too low on keys to loan any, they must have L/2 keys, but we have one fewer. Therefore, putting them together, we get at most L, and that’s legal.
291
Deletion Slide Two If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree The same applies here for dumping subtrees as on the previous slide for dumping keys. This reduces the height of the tree!
292
Thinking about B-Trees
B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items height 5: 2,000,000,000! B*-Trees fix thrashing. Propagation is rare because (in a good case) only about 1/L inserts cause a split and only about 1/M of those go up even one level! 30 million’s not so big, right? How about height 5? 2 billion
293
Tree Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? How about: O(1) finds and inserts?
294
Hash Table Approach Zasha Steve f(x) Nic Brad Ed
But… is there a problem in this pipe-dream?
295
Hash Table Dictionary Data Structure
Hash function: maps keys to integers result: can quickly find the right spot for a given entry Unordered and sparse table result: cannot efficiently list all entries, Cannot find min and max efficiently, Cannot find all items within a specified range efficiently. f(x) Zasha Steve Nic Brad Ed A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing)
296
Hash Table Terminology
hash function Zasha f(x) Steve Nic collision Brad Ed keys load factor = # of entries in table tableSize
297
Hash Table Code First Pass
Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } What should the hash function be? What should the table size be? How should we resolve collisions?
298
A Good Hash Function… is easy (fast) to compute (O(1) and practically fast). distributes the data evenly (hash(a) hash(b) ). uses the whole hash table (for all 0 k < size, there’s an i such that hash(i) % size = k).
299
Good Hash Function for Integers
Choose tableSize is prime hash(n) = n % tableSize Example: tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17) 1 2 3 4 5 6
300
Good Hash Function for Strings?
Ideas?
301
Good Hash Function for Strings?
Sum the ASCII values of the characters. Consider only the first 3 characters. Uses only 2871 out of 17,576 entries in the table on English words. Let s = s1s2s3s4…s5: choose hash(s) = s1 + s s s … + sn128n Problems: hash(“really, really big”) = well… something really, really big hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.
302
Making the String Hash Easy to Compute
Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h;
303
Universal Hashing For any fixed hash function, there will be some pathological sets of inputs everything hashes to the same cell! Solution: Universal Hashing Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! When your program starts up, pick one of the hash functions to use at random (for the entire time) Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs
304
Universal Hash Function: “Random” Vector Approach
Parameterized by prime size and vector: a = <a0 a1 … ar> where 0 <= ai < size Represent each key as r + 1 integers where ki < size size = 11, key = ==> <3,9,7,5,2> size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4> ha(k) = dot product with a “random” vector!
305
Universal Hash Function
Strengths: works on any type as long as you can form ki’s if we’re building a static table, we can try many a’s a random a has guaranteed good properties no matter what we’re hashing Weaknesses must choose prime table size larger than any ki
306
Hash Function Summary Goals of a hash function Hash functions
reproducible mapping from key to table entry evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) complete quickly Hash functions h(n) = n % size h(n) = string as base 128 number % size Universal hash function #1: dot product with random vector The idea of neighboring keys here may change from application to application. In one context, neighboring keys may be those with the same last characters or first characters… say, when hashing names in a school system. Many people may have the same last names or first names (but few will have the same of both).
307
How to Design a Hash Function
Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)
308
Collisions Pigeonhole principle says we can’t avoid all collisions
try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? open hashing: put little dictionaries in each entry closed hashing: pick a next entry to try The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.