Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Structures Lecture #1 Introduction

Similar presentations


Presentation on theme: "Data Structures Lecture #1 Introduction"— Presentation transcript:

1 Data Structures Lecture #1 Introduction
Alon Halevy Spring Quarter 2001

2 Today’s Outline Administrative Stuff Overview of 326 Survey
Introduction to Complexity

3 Course Information Instructor: Alon Halevy <alon@cs>
Office hours: Wed. 4:30-5:30, 310 Sieg Hall TA: Maya Rodrig Office hours: Monday 1:30-2:30. Meet in Sieg 226B TA (1/2) (and C++ expert): Nicholas Bone Sections are held in: BLD 392, EE1 031. Text: Data Structures & Algorithm Analysis in C++, 2nd edition, by Mark Allen Weiss

4 Course Policies Several written homeworks Several programming projects
Due at the start of class on due date Several programming projects Projects turned in electronically before 11pm on due date 10% penalty for 1 weekday late; afterward, NOT accepted Work in teams only on explicit team projects Grading homework: 25% projects: 25% midterm: 20% final: 30% Two things here worth mentioning: First, late days. Second, the final category on the grading. Whichever of the first four you do best on gets an extra 10% of weight. That’s good.

5 Course Mechanics 326 Web page: www/education/courses/326/01sp
326 course directory: /cse/courses/cse326 326 mailing list: subscribe to the mailing list using majordomo, see homepage Course labs are 232 and 329 Sieg Hall lab has NT machines w/X servers to access UNIX All programming projects graded on UNIX/g++ GO TO THIS WEB SITE! This and the mailing list will be the primary methods for distributing information. Note that the programming projects will be graded on UNIX with the g++ compiler. This means: TEST ON UNIX WITH THE G++ COMPILER! Even if you don’t develop with those tools.

6 What is this Course About?
Clever ways to organize information in order to enable efficient computation What do we mean by clever? What do we mean by efficient? clever – range from techniques with which you are already familiar – eg, representing simple lists – to ones that are more complex, such as hash tables or self-balancing trees. Elegent, mathematically deep, non obvious. making the different meanings of “efficient” precise is much of the work of this course!

7 Clever? Efficient? Insert Lists, Stacks, Queues Delete Heaps
Find Merge Shortest Paths Union Lists, Stacks, Queues Heaps Binary Search Trees AVL Trees Hash Tables Graphs Disjoint Sets Data Structures Algorithms

8 Used Everywhere! Mastery of this material separates you from: Graphics
Theory AI Applications Systems Used Everywhere! Mastery of this material separates you from: Perhaps the most important course in your CS curriculum! Guaranteed non-obsolescence!

9 Anecdote #1 N2 “pretty print” routine nearly dooms major expert system project at AT&T 10 MB data = 10 days (100 MIPS) programmer was brilliant, but he skipped 326…

10 Asymptotic Complexity
Our notion of efficiency: How the running time of an algorithm scales with the size of its input several ways to further refine: worst case average case amortized over a series of runs

11 The Apocalyptic Laptop
Seth Lloyd, SCIENCE, 31 Aug 2000

12 Big Bang Ultimate Laptop, 1 year 1 second 1000 MIPS, since Big Bang
1 day

13 Specific Goals of the Course
Become familiar with some of the fundamental data structures in computer science Improve ability to solve problems abstractly data structures are the building blocks Improve ability to analyze your algorithms prove correctness gauge (and improve) time complexity Become modestly skilled with the UNIX operating system (you’ll need this in upcoming courses) This course is designed to familiarize you with the most basic and important data structures in computer science. The ones that will form the foundation of all your future work with computers. Moreover, you’ll learn how to analyze your programs and data structures so that you know how well they work and what sort of effort in the program is acceptable. These are the goals of the course as well as my expectations of you.

14 One Preliminary Hurdle
Recall what you learned in CSE 321 … proofs by mathematical induction proofs by contradiction formulas for calculating sums and products of series recursion Know Sec 1.1 – 1.4 of text by heart!

15 A Second Hurdle Unix Experience 1975 all over again!
Try to login, edit, create a Makefile, and compile your favorite “hello world” program right away Programming Project #1 distributed Wednesday Bring your questions and frustrations to Section on Thursday!

16 A Third Hurdle: Templates
class Set_of_ints { public: insert( int x ); boolean is_member( int x ); … } template <class Obj> class Set { insert( Obj x ); boolean is_member( Obj x ); … } Set <int> SomeNumbers; Set <char *> SomeWords;

17 In Every Silver Lining, There’s a Big Dark Cloud – George Carlin
Templates were invented 12 years ago, and still no compiler correctly implements them! Using templates with multiple source files tricky See Course Web pages and TAs for best way MAINTAINING SANITY RULE Write/debug first without templates Templatize as need Keep it simple!

18 Handy Libraries From Weiss: Like arrays and char*, but provide
vector < int > MySafeIntArray; vector < double > MySafeFloatArray; string MySafeString; Like arrays and char*, but provide bounds checking memory management STL (Standard Template Library) most of CSE 326 in a box don’t use (unless told); we’ll be rolling our own

19 C++  Data Structures One of the all time great books in computer science: The Art of Computer Programming ( ) by Donald Knuth Examples in assembly language (and English)! American Scientist says: in top 12 books of the CENTURY! Very little about C++ in class.

20 Abstract Data Types Abstract Data Type (ADT) Data Types Algorithms
Mathematical description of an object and the set of operations on the object tradeoffs! Given that this is computer science, I know you’d be disappointed if there were no acronyms in the class. Here’s our first one! Now, what an ADT really is is the interface of a data structure without any specification of the implementation. In this class, we’ll study groups of data structures to implement any given abstract data type. In that context… Data Types integer, array, pointers, … Algorithms binary search, quicksort, …

21 ADT Presentation Algorithm
Present an ADT Motivate with some applications Repeat until it’s time to move on: develop a data structure and algorithms for the ADT analyze its properties efficiency correctness limitations ease of programming Contrast strengths and weaknesses Given those definitions, here’s our first algorithm. This is how I’m going to try to present each set of data structures to you. You should hold me to this! You’re not getting enough out of the presentation if you don’t see these. And look, here’s an ADT now…

22 First Example: Queue ADT
Queue operations create destroy enqueue dequeue is_empty Queue property: if x is enQed before y is enQed, then x will be deQed before y is deQed FIFO: First In First Out F E D C B enqueue dequeue G A You’ve probably seen the Queue before. If so, this is a review and a way for us to get comfortable with the format of data structure presentations in this class. If not, this is a simple but very powerful data structure, and you should make sure you understand it thoroughly. This is an ADT description of the queue. Notice that there are no implementation details. Just a general description of the interface and important properties of those interface methods.

23 Applications of the Q Hold jobs for a printer
Store packets on network routers Make waitlists fair Breadth first search Qs are used widely in computer science. This is just a handful of the high profile uses, but _many_ programs use queues.

24 Circular Array Q Data Structure
size - 1 b c d e f front back enqueue(Object x) { Q[back] = x ; back = (back + 1) % size } How test for empty list? How to find K-th element in the queue? What is complexity of these operations? Limitations of this structure? Here is a data structure implementation of the Q. The queue is stored as an array, and, to avoid shifting all the elements each time an element is dequeued, we imagine that the array wraps around on itself. This is an excellent example of how implementation can affect interface: notice the “is_full” function. There’s also another problem here. What’s wrong with the Enqueue and Dequeue functions? Your data structures should be robust! Make them robust before you even consider thinking about making them efficient! That is an order! dequeue() { x = Q[front] ; front = (front + 1) % size; return x ; }

25 Linked List Q Data Structure
b c d e f front back enqueue(Object x) { back->next = new Node(x); back = back->next; } dequeue() { saved = front->data; temp = front; front = front->next; delete temp ; return saved;} What are tradeoffs? simplicity speed robustness memory usage Notice the tricky memory management

26 To Do Return your survey before leaving!
Sign up on the cse326 mailing list Check out the web page Log on to the PCs in course labs and access an instructional UNIX server Read Chapters 1 and 2 in the book

27 CSE 326: Data Structures Lecture #2 Analysis of Algorithms
Alon Halevy Fall Quarter 2000

28 Analysis of Algorithms
Analysis of an algorithm gives insight into how long the program runs and how much memory it uses time complexity space complexity Why useful? Input size is indicated by a number n sometimes have multiple inputs, e.g. m and n Running time is a function of n n, n2, n log n, n(log n2) + 5n3

29 Simplifying the Analysis
Eliminate low order terms 4n  4n 0.5 n log n - 2n  n log n 2n + n3 + 3n  2n Eliminate constant coefficients 4n  n 0.5 n log n  n log n log n2 = 2 log n  log n log3 n = (log3 2) log n  log n We didn’t get very precise in our analysis of the UWID info finder; why? Didn’t know the machine we’d use. Is this always true? Do you buy that coefficients and low order terms don’t matter? When might they matter? (Linked list memory usage)

30 Order Notation BIG-O T(n) = O(f(n)) OMEGA T(n) =  (f(n))
Upper bound Exist constants c and n0 such that T(n)  c f(n) for all n  n0 OMEGA T(n) =  (f(n)) Lower bound T(n)  c f(n) for all n  n0 THETA T(n) = θ (f(n)) Tight bound θ(n) = O(n) =  (n) We’ll use some specific terminology to describe asymptotic behavior. There are some analogies here that you might find useful.

31 Examples n2 + 100 n = O(n2) = (n2) = (n2) n log n = O(n2)
( n n )  2 n2 for n  10 ( n n )  1 n2 for n  0 n log n = O(n2) n log n = (n log n) n log n = (n)

32 More on Order Notation Order notation is not symmetric; write
2n2 + 4n = O(n2) but never O(n2) = 2n2 + 4n right hand side is a crudification of the left Likewise O(n2) = O(n3) (n3) = (n2)

33 A Few Comparisons Function #2 Function #1 100n2 + 1000 n3 + 2n2 log n

34 Race I n3 + 2n2 vs. 100n

35 Race II n0.1 vs. log n Well, log n looked good out of the starting gate and indeed kept on looking good until about n^17 at which point n^0.1 passed it up forever. Moral of the story? N^epsilon beats log n for any eps > 0. BUT, which one of these is really better?

36 Race III n + 100n0.1 vs. 2n + 10 log n Notice that these just look like n and 2n once we get way out. That’s because the larger terms dominate. So, the left is less, but not asymptotically less. It’s a TIE!

37 Race IV 5n5 vs. n! N! is BIG!!!

38 Race V n-152n/100 vs. 1000n15 No matter how you put it, any exponential beats any polynomial. It doesn’t even take that long here (~250 input size)

39 Race VI 82log(n) vs. 3n7 + 7n We can reduce the left hand term to n^6, so they’re both polynomial and it’s an open and shut case.

40 The Losers Win Better algorithm! O(n2) O(log n) TIE O(n) O(n5) O(n15)
Function #1 n3 + 2n2 n0.1 n + 100n0.1 5n5 n-152n/100 82log n Function #2 100n log n 2n + 10 log n n! 1000n15 3n7 + 7n Welcome, everyone, to the Silicon Downs. I’m getting race results as we stand here. Let’s start with the first race. I’ll have the first row bet on race #1. Raise your hand if you bet on function #1 (the jockey is n^0.1) So on. Show the race slides after each race.

41 Common Names constant: O(1) logarithmic: O(log n) linear: O(n)
log-linear: O(n log n) superlinear: O(n1+c) (c is a constant > 0) quadratic: O(n2) polynomial: O(nk) (k is a constant) exponential: O(cn) (c is a constant > 1) Well, it turns out that the old Silicon Downs is fixed. They dope up the horses to make the first few laps interesting, but we can always find out who wins. Here’s a chart comparing some of the functions. Notice that any exponential beats any polynomial. Any superlinear beats any poly-log-linear. Also keep in mind (though I won’t show it) that sometimes the input has more than one parameter. Like if you take in two strings. In that case you need to be very careful about what is constant and what can be ignored. O(log m + 2n) is not necessarily O(2n)

42 Kinds of Analysis Running time may depend on actual data input, not just length of input Distinguish worst case your worst enemy is choosing input best case average case assumes some probabilistic distribution of inputs amortized average time over many operations We already discussed the bound flavor. All of these can be applied to any analysis case. For example, we’ll later prove that sorting in the worst case takes at least n log n time. That’s a lower bound on a worst case. Average case is hard! What does “average” mean. For example, what’s the average case for searching an unordered list (as precise as possible, not asymptotic). WRONG! It’s about n, not 1/2 n. Why? You have to search the whole thing if the elt is not there. Note there’s two senses of tight. I’ll try to avoid the terminology “asymptotically tight” and stick with the lower def’n of tight. O(inf) is not tight!

43 Analyzing Code C++ operations - constant time
consecutive stmts - sum of times conditionals - sum of branches, condition loops - sum of iterations function calls - cost of function body recursive functions - solve recursive equation Above all, use your head!

44 Nested Loops for i = 1 to n do for j = 1 to n do sum = sum + 1
This example is pretty straightforward. Each loop goes N times, constant amount of work on the inside. N*N*1 = O(N^2)

45 Nested Dependent Loops
for i = 1 to n do for j = i to n do sum = sum + 1 There’s a little twist here. J goes from I to N, not 1 to N. So, let’s do the sums inside is constant. Next loop is sum I to N of 1 which equals N - I + 1 Outer loop is sum 1 to N of N - I + 1 That’s the same as sum N to 1 of I or N(N+1)/2 or O(N^2)

46 Conditionals Conditional time  time(C) + Max( time(S1), time(S2) )
if C then S1 else S2 time  time(C) + Max( time(S1), time(S2) ) OK, so this isn’t exactly an example. Just reiterating the rule. Time <= time of C plus max of S1 and S2 <= time of C plus S1 plus S2 time <= sum of times of iterations often #of iterations * time of S (or worst time of S)

47 Coming Up Thursday Friday Unix tutorial First programming project!
Finishing up analysis A little on Stacks and Lists Homework #1 goes out

48 CSE 326: Data Structures Lecture #3 Analysis of Recursive Algorithms
Alon Halevy Fall Quarter 2000

49 Nested Dependent Loops
for i = 1 to n do for j = i to n do sum = sum + 1 There’s a little twist here. J goes from I to N, not 1 to N. So, let’s do the sums inside is constant. Next loop is sum I to N of 1 which equals N - I + 1 Outer loop is sum 1 to N of N - I + 1 That’s the same as sum N to 1 of I or N(N+1)/2 or O(N^2)

50 Recursion A recursive procedure can often be analyzed by solving a recursive equation Basic form: T(n) = if (base case) then some constant else ( time to solve subproblems + time to combine solutions ) Result depends upon how many subproblems how much smaller are subproblems how costly to combine solutions (coefficients) You may want to take notes on this slide as it just vaguely resembles a homework problem! Here’s a function defined in terms of itself. You see this a lot with recursion. This one is a lot like the profile for factorial. WORK THROUGH Answer: O(n)

51 Example: Sum of Integer Queue
sum_queue(Q){ if (Q.length == 0 ) return 0; else return Q.dequeue() + sum_queue(Q); } One subproblem Linear reduction in size (decrease by 1) Combining: constant c (+), 1×subproblem Equation: T(0)  b T(n)  c + T(n – 1) for n>0 Here’s a function defined in terms of itself. You see this a lot with recursion. This one is a lot like the profile for factorial. WORK THROUGH Answer: O(n)

52 Sum, Continued Equation: T(0)  b T(n)  c + T(n – 1) for n>0
Solution: T(n)  c + c + T(n-2)  c + c + c + T(n-3)  kc + T(n-k) for all k  nc + T(0) for k=n  cn + b = O(n)

53 Example: Binary Search
7 12 30 35 75 83 87 90 97 99 One subproblem, half as large Equation: T(1)  b T(n)  T(n/2) + c for n>1 Solution: T(n)  T(n/2) + c  T(n/4) + c + c  T(n/8) + c + c + c  T(n/2k) + kc  T(1) + c log n where k = log n  b + c log n = O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series. Tip: Look for powers/multiples of the numbers that appear in the original equation.

54 Example: MergeSort Split array in half, sort each half, merge together
2 subproblems, each half as large linear amount of work to combine T(1)  b T(n)  2T(n/2) + cn for n>1 T(n)  2T(n/2)+cn  2(2(T(n/4)+cn/2)+cn = 4T(n/4) +cn +cn  4(2(T(n/8)+c(n/4))+cn+cn = 8T(n/8)+cn+cn+cn  2kT(n/2k)+kcn 2kT(1) + cn log n where k = log n = O(n log n) This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(n log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series. Tip: Look for powers/multiples of the numbers that appear in the original equation.

55 Example: Recursive Fibonacci
int Fib(n){ if (n == 0 or n == 1) return 1 ; else return Fib(n - 1) + Fib(n - 2); } Running time: Lower bound analysis T(0), T(1)  1 T(n)  T(n - 1) + T(n - 2) + c if n > 1 Note: T(n)  Fib(n) Fact: Fib(n)  (3/2)n O( (3/2)n ) Why? This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.

56 Direct Proof of Recursive Fibonacci
int Fib(n) if (n == 0 or n == 1) return 1 else return Fib(n - 1) + Fib(n - 2) Lower bound analysis T(0), T(1) >= b T(n) >= T(n - 1) + T(n - 2) + c if n > 1 Analysis let  be (1 + 5)/2 which satisfies 2 =  + 1 show by induction on n that T(n) >= bn - 1 This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.

57 Direct Proof Continued
Basis: T(0)  b > b-1 and T(1)  b = b0 Inductive step: Assume T(m)  bm - 1 for all m < n T(n)  T(n - 1) + T(n - 2) + c  bn-2 + bn-3 + c  bn-3( + 1) + c = bn-32 + c  bn-1

58 Fibonacci Call Tree 5 3 4 3 2 2 1 1 2 1 1 1

59 Learning from Analysis
To avoid recursive calls store all basis values in a table each time you calculate an answer, store it in the table before performing any calculation for a value n check if a valid answer for n is in the table if so, return it Memoization a form of dynamic programming How much time does memoized version take?

60 Kinds of Analysis So far we have considered worst case analysis
We may want to know how an algorithm performs “on average” Several distinct senses of “on average” amortized average time per operation over a sequence of operations average case average time over a random distribution of inputs expected case average time for a randomized algorithm over different random seeds for any input

61 Amortized Analysis Consider any sequence of operations applied to a data structure your worst enemy could choose the sequence! Some operations may be fast, others slow Goal: show that the average time per operation is still good

62 Stack ADT Stack operations
B C D E F E D C B A F Stack operations push pop is_empty Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation?

63 Stretchy Stack Implementation
int data[]; int maxsize; int top; Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; copy data into temp; deallocate data; data = temp; } else { data[++top] = e; } Best case Push = O( ) Worst case Push = O( )

64 Stretchy Stack Amortized Analysis
Consider sequence of n operations push(3); push(19); push(2); … What is the max number of stretches? What is the total time? let’s say a regular push takes time a, and stretching an array contain k elements takes time kb, for some constants a and b. Amortized time = (an+b(2n-1))/n = O(1) log n

65 Wrapup Having math fun? Homework #1 out wednesday – due in one week
Programming assignment #1 handed out. Next week: linked lists

66 CSE 326: Data Structures Lecture #4
Alon Halevy Spring Quarter 2001

67 Agenda Today: Finish complexity issues.
Linked links (Read Ch 3; skip “radix sort”)

68 Direct Proof of Recursive Fibonacci
int Fib(n) if (n == 0 or n == 1) return 1 else return Fib(n - 1) + Fib(n - 2) Lower bound analysis T(0), T(1) >= b T(n) >= T(n - 1) + T(n - 2) + c if n > 1 Analysis let  be (1 + 5)/2 which satisfies 2 =  + 1 show by induction on n that T(n) >= bn - 1 This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.

69 Direct Proof Continued
Basis: T(0)  b > b-1 and T(1)  b = b0 Inductive step: Assume T(m)  bm - 1 for all m < n T(n)  T(n - 1) + T(n - 2) + c  bn-2 + bn-3 + c  bn-3( + 1) + c = bn-32 + c  bn-1

70 Fibonacci Call Tree 5 3 4 3 2 2 1 1 2 1 1 1

71 Learning from Analysis
To avoid recursive calls store all basis values in a table each time you calculate an answer, store it in the table before performing any calculation for a value n check if a valid answer for n is in the table if so, return it Memoization a form of dynamic programming How much time does memoized version take?

72 Kinds of Analysis So far we have considered worst case analysis
We may want to know how an algorithm performs “on average” Several distinct senses of “on average” amortized average time per operation over a sequence of operations average case average time over a random distribution of inputs expected case average time for a randomized algorithm over different random seeds for any input

73 Amortized Analysis Consider any sequence of operations applied to a data structure your worst enemy could choose the sequence! Some operations may be fast, others slow Goal: show that the average time per operation is still good

74 Stack ADT Stack operations
B C D E F E D C B A F Stack operations push pop is_empty Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation?

75 Stretchy Stack Implementation
int data[]; int maxsize; int top; Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; copy data into temp; deallocate data; data = temp; } else { data[++top] = e; } Best case Push = O( ) Worst case Push = O( )

76 Stretchy Stack Amortized Analysis
Consider sequence of n operations push(3); push(19); push(2); … What is the max number of stretches? What is the total time? let’s say a regular push takes time a, and stretching an array containing k elements takes time kb, for some constants a and b. Amortized = (an+b(2n-1))/n = a+2b-(1/n)= O(1) log n

77 Average Case Analysis Attempt to capture the notion of “typical” performance Imagine inputs are drawn from some random distribution Ideally this distribution is a mathematical model of the real world In practice usually is much more simple – e.g., a uniform random distribution

78 Example: Find a Red Card
Input: a deck of n cards, half red and half black Algorithm: turn over cards (from top of deck) one at a time until a red card is found. How many cards will be turned over? Best case = Worst case = Average case: over all possible inputs (ways of shuffling deck)

79 Summary Asymptotic Analysis – scaling with size of input
Upper bound O, Lower bound  O(1) or O(log n) great O(2n) almost never okay Worst case most important – strong guarantee Other kinds of analysis sometimes useful: amortized average case

80 List ADT ( A1 A2 … An-1 An ) List properties length = n Key operations
Ai precedes Ai+1 for 1  i < n Ai succeeds Ai-1 for 1 < i  n Size 0 list is defined to be the empty list Key operations Find(item) = position Find_Kth(integer) = item Insert(item, position) Delete(position) Next(position) = position What are some possible data structures? ( A1 A2 … An-1 An ) length = n Now, back to work! We’re going to talk about lists briefly and quickly get to an idea which I hope you haven’t seen. Lists are sets of values. The type of those values is arbitrary but fixed (can’t change from one to another in the same list). Each value is at a position, and those positions are totally ordered.

81 Implementations of Linked Lists
Array: 1 2 3 4 5 6 7 8 9 10 H W 1 I S E A S Y Can we apply binary search to an array representation? Linked list: (optional header) (a b c) a b c L

82 Linked List vs. Array linked list array sorted array
Find(item) = position Find_Kth(integer)=item Find_Kth(1)=item Insert(item, position) Insert(item) Delete(position) Next(position) = position

83 Tradeoffs For what kinds of applications is a linked list best?
Examples for an unsorted array? Examples for a sorted array?

84 Implementing in C++ (optional (a b c) header) 
Create separate classes for Node List (contains a pointer to the first node) List Iterator (specifies a position in a list; basically, just a pointer to a node) Pro: syntactically distinguishes uses of node pointers Con: a lot of verbage! Also, is a position in a list really distinct from a list?

85 CSE 326: Data Structures Lecture #5
Alon Halevy Spring Quarter 2001

86 Implementations of Linked Lists
Array: 1 2 3 4 5 6 7 8 9 10 H W 1 I S E A S Y Can we apply binary search to an array representation? Linked list: (optional header) (a b c) a b c L

87 Linked List vs. Array linked list array sorted array
Find(item) = position Find_Kth(integer)=item Find_Kth(1)=item Insert(item, position) Insert(item) Delete(position) Next(position) = position

88 Tradeoffs For what kinds of applications is a linked list best?
Examples for an unsorted array? Examples for a sorted array?

89 Implementing in C++ (optional (a b c) header) 
Create separate classes for Node List (contains a pointer to the first node) List Iterator (specifies a position in a list; basically, just a pointer to a node) Pro: syntactically distinguishes uses of node pointers Con: a lot of verbage! Also, is a position in a list really distinct from a list?

90 Other Data Structures for Lists
Doubly Linked List Circular List 7 11 3 2 Advantages/disadvantages (previous for doubly linked list) your book also describes header nodes. Are they just a hack? I’m not going to go into these, but: You should be able to (for a test) add and delete nodes in all these types of list; not to mention for your daily coding needs! c d e f

91 Implementing Linked Lists Using Arrays
1 2 3 4 5 6 7 8 9 10 Data F O A R N R T Next 3 8 6 4 -1 10 5 First = 2 “Cursor implementation” Ch 3.2.8 Often useful in any language Can use same array to manage a second list of unused cells

92 Application: Polynomial ADT
Ai is the coefficient of the xn-i term: 3x2 + 2x + 5 ( ) 8x + 7 ( 8 7 ) Here’s an application of the list abstract data type as a _data structure_ for another abstract data type. Is there a problem here? Why? x2 + 3 ( ) Problem?

93 3x ( ) What is it about lists that makes this a problem here and not in stacks and queues? (Answer: kth(int)!) Is there a solution? Will we get anything but zeroes overwhelming this data structure?

94 Sparse List Data Structure: 3x2001 + 4
(<4 0> <2001 3>) 4 3 2001 This slide is made possible in part by the sparse list data structure. Now, two questions: 1) Is a sparse list really a data structure or an abstract data type? (Answer: It depends but I lean toward data structure. YOUR ANSWER MUST HAVE JUSTIFICATION!) 2) Which list data structure should we use to implement it? Linked Lists or Arrays?

95 Addition of Two Polynomials
Similar to merging two sorted lists – O(n+m) 15+10x50+3x1200 p 15 10 50 3 1200 5+30x50+4x100 q 5 30 50 4 100 r 20 40 50 4 100 3 1200

96 Multiple Linked Lists Many ADTS such as graphs, relations, sparse matrices, multivariate polynomials use multiple linked lists Several options array of lists lists of lists multi lists General principle throughout the course: use one ADT to implement a more complicated one.

97 Array of Linked Lists: Adjacency List for Graphs
1 3 2 5 4 Array G of unordered linked lists Each list entry corresponds to an edge in the graph G Graphs are a very important data type. You might think as you read about your project if there are any graphs there. Here, we’re implementing graphs with adjacency lists. The reason is that this is a sparse graph. We want to have every node in an array (so we can find the first edge quickly), but we just need the edges around. 1 5 2 2 4 3 5 3 1 4 4 5 3 5

98 Reachability by Marking
Suppose we want to mark all the nodes in the graph which are reachable from a given node k. Let G[1..n] be the adjacency list rep. of the graph Let M[1..n] be the mark array, initially all falses. mark(int i){ M[i] = true; x = G[i] while (x != NULL) { if (M[x->node] == false) mark(G[x->node]) x = x->next } Here’s an algorithm that works on our adj list graph.

99 Multi-Lists Suppose we have a set of movies and cinemas, and we want a structure that stores which movies are playing where.

100 More on Multi-Lists What if we also want to store the playing times of movies?

101 CSE 326: Data Structures Lecture #6 (end of Lists, then) Trees
Alon Halevy Spring Quarter 2001

102 Application: Polynomial ADT
Ai is the coefficient of the xn-i term: 3x2 + 2x + 5 ( ) 8x + 7 ( 8 7 ) Here’s an application of the list abstract data type as a _data structure_ for another abstract data type. Is there a problem here? Why? x2 + 3 ( ) Problem?

103 3x ( ) What is it about lists that makes this a problem here and not in stacks and queues? (Answer: kth(int)!) Is there a solution? Will we get anything but zeroes overwhelming this data structure?

104 Sparse List Data Structure: 3x2001 + 4
(<4 0> <2001 3>) 4 3 2001 This slide is made possible in part by the sparse list data structure. Now, two questions: 1) Is a sparse list really a data structure or an abstract data type? (Answer: It depends but I lean toward data structure. YOUR ANSWER MUST HAVE JUSTIFICATION!) 2) Which list data structure should we use to implement it? Linked Lists or Arrays?

105 Addition of Two Polynomials
Similar to merging two sorted lists – O(n+m) 15+10x50+3x1200 p 15 10 50 3 1200 5+30x50+4x100 q 5 30 50 4 100 r 20 40 50 4 100 3 1200

106 Multiple Linked Lists Many ADTS such as graphs, relations, sparse matrices, multivariate polynomials use multiple linked lists Several options array of lists lists of lists multi lists General principle throughout the course: use one ADT to implement a more complicated one.

107 Array of Linked Lists: Adjacency List for Graphs
1 3 2 5 4 Array G of unordered linked lists Each list entry corresponds to an edge in the graph G Graphs are a very important data type. You might think as you read about your project if there are any graphs there. Here, we’re implementing graphs with adjacency lists. The reason is that this is a sparse graph. We want to have every node in an array (so we can find the first edge quickly), but we just need the edges around. 1 5 2 2 4 3 5 3 1 4 4 5 3 5

108 Reachability by Marking
Suppose we want to mark all the nodes in the graph which are reachable from a given node k. Let G[1..n] be the adjacency list rep. of the graph Let M[1..n] be the mark array, initially all falses. mark(int i){ M[i] = true; x = G[i] while (x != NULL) { if (M[x->node] == false) mark(G[x->node]) x = x->next } Here’s an algorithm that works on our adj list graph.

109 Multi-Lists Suppose we have a set of movies and cinemas, and we want a structure that stores which movies are playing where.

110 More on Multi-Lists What if we also want to store the playing times of movies?

111 Trees Family Trees Organization Charts Classification trees
is this mushroom poisonous? File directory structure Parse Trees (x+y*z) Search Trees often better than lists for sorted data

112 Definition of a Tree Recursive definition: r T1 T2 T3
empty tree has no root given trees T1,…,Tk and a node r, there is a tree T where r is the root of T the children of r are the roots of T1, T2, …, Tk r T1 T2 T3

113 Tree Terminology root child parent sibling path descendent ancestor a
j b f k l e c g Let’s review the words: root: A leaf: DEFJKLMNI child:A - C or H - K leaves have no children parent: C - A or L - H the root has no parent sibling: D - E or F or J - K,L,M, or N grandparent: G to A grandchild: C to H or I ancestor: the node itself or any ancestor’s parent descendent: the node itself or any child’s descendent subtree: a node and all its descendents

114 More Tree Terminology a subtree leaf depth height branching factor
n-ary complete e b c d h i j f g Let’s review the words: root: A leaf: DEFJKLMNI child:A - C or H - K leaves have no children parent: C - A or L - H the root has no parent sibling: D - E or F or J - K,L,M, or N grandparent: G to A grandchild: C to H or I ancestor: the node itself or any ancestor’s parent descendent: the node itself or any child’s descendent subtree: a node and all its descendents k l

115 Basic Tree Data Structure
first_child next_sibling a b c d e

116 Logical View of Tree a i d h j b f k l e c g

117 Actual Data Structure a b c d e h i j f g k l

118 Combined View of Tree a b c d e h i j f g k l

119 Traversals Many algorithms involve walking through a tree, and performing some computation at each node Walking through a tree is called a traversal Common kinds of traversal Pre-order Post-order Level-order

120 Pre-Order Traversal Perform computation at the node, then recursively perform computation on each child preorder(node * n){ node * c; if (! n==NULL){ DO SOMETHING; c = n->first_child; while (! c==NULL){ preorder(c); c = c->next_sibling; } }

121 Pre-Order Traversal Example
i d h j b f k l e c g Start with a -

122 Pre-Order Applications
Use when computation at node depends upon values calculated higher in the tree (closer to root) Example: computing depth depth(node) = 1 + depth( parent of node ) Another example: printing out a directory structure.

123 Computing Depth of All Nodes
Add a field “depth” to all nodes Depth(node * n, int d){ node * c; if (! n==NULL){ n->depth = d; d = d+1; c = n->first_child; while (! c==NULL){ Depth(c, d); c = c->next_sibling; } } Call Depth(root,0) to set depth field correctly

124 Depth Calculation a i d h j b f k l e c g

125 Post-Order Traversal Recursively perform computation on each child, and then perform computation at node postorder(node * n){ node * c; if (! n==NULL){ c = n->first_child; while (! c==NULL){ postorder(c); c = c->next_sibling; } DO SOMETHING;

126 Post-Order Applications
Use when computation at node depends upon values calculated lower in the tree (closer to leafs) Example: computing height height(node) = 1 + MAX( height(child1), height(child2), … height(childk)) Example: size of tree rooted at node size(node) = 1 + size(child1) + size(child2) + … + size(childk))

127 Computing Size of Tree Size(node * n){ node * c; if (! n==NULL) return 0; else { int m=1; c = n->first_child; while (! c==NULL){ m = m + Size(c); c = c->next_sibling; } return m; } Call Size(root) to compute number of nodes in tree

128 Depth-First Search Both Pre-Order and Post-Order traversals are examples of depth-first search nodes are visited deeply on the left-most branches before any nodes are visited on the right-most branches visiting the right branches deeply before the left would still be depth-first! Crucial idea is “go deep first!” In DFS the nodes “being worked on” are kept on a stack (where?)

129 Level-Order/Breadth-first Traversal
Consider task of traversing tree level by level from top to bottom (alphabetic order) What data structure to use to keep track of nodes?? a i d h j b f k l e c g

130 Level-Order (Breadth First) Traversal
Put root in a Queue Repeat until Queue is empty: Dequeue a node Process it Add it’s children to queue

131 Example: Printing the Tree
print(node * root){ node * n, c; queue Q; Q.enqueue(root); while (! Q.empty()){ n = Q.dequeue(); print n->data; c = n->first_child; while (! c==NULL){ Q.enqueue(c); c = c->next_sibling; } } }

132 QUEUE a i d h j b f k l e c g a b c d e c d e f g d e f g e f g h i j
h i j k i j k j k l k l l a i d h j b f k l e c g

133 Applications of BFS Find the shortest path from the root to a given node N if N is at depth k, BFS will never visit a node at depth>k important for really deep trees Generalizes to finding shortest paths in graphs Spidering the world wide web From a root URL, fetch pages that are further and further away

134 Coming Up Binary Trees & Binary Search Trees (finally!)
Weiss 4.2 – 4.3 Section 4.3 is quite long. It will probably take us two lectures to get through it.

135 CSE 326: Data Structures Lecture #7 Binary Search Trees
Alon Halevy Spring Quarter 2001

136 Binary Trees A Many algorithms are efficient and easy to program for the special case of binary trees Binary tree is a root left subtree (maybe empty) right subtree (maybe empty) B C D E F G H Alright, we’ll focus today on one type of trees called binary trees. Here’s one now. Is this binary tree complete? Why not? (C has just one child, right side is much deeper than left) What’s the maximum # of leaves a binary tree of depth d can have? What’s the max # of nodes a binary tree of depth d can have? Minimum? We won’t go into this, but if you take N nodes and assume all distinct trees of the nodes are equally likely, you get an average depth of SQRT(N). Is that bigger or smaller than log n? Bigger, so it’s not good enough! I J

137 Representation A Data right pointer left A B C B C D E F D E F

138 Properties of Binary Trees
Max # of leafs in a tree of height h = Max # of nodes in a tree of height h = A B C D E F G

139 Dictionary & Search ADTs
Operations create destroy insert find delete Dictionary: Stores values associated with user-specified keys keys may be any (homogenous) comparable type values may be any (homogenous) type implementation: data field is a struct with two parts Search ADT: keys = values kim chi spicy cabbage kreplach tasty stuffed dough kiwi Australian fruit insert kohlrabi - upscale tuber find(kreplach) kreplach - tasty stuffed dough Dictionaries associate some key with a value, just like a real dictionary (where the key is a word and the value is its definition). In this example, I’ve stored user-IDs associated with descriptions of their coolness level. This is probably the most valuable and widely used ADT we’ll hit. I’ll give you an example in a minute that should firmly entrench this concept.

140 Naïve Implementations
unsorted array sorted linked list insert find + O(n) O(n) find + O(1) find O(log n) delete (if no shrink) Goal: fast find like sorted array, dynamic inserts/deletes like linked list

141 Binary Search Tree Dictionary Data Structure
Search tree property all keys in left subtree smaller than root’s key all keys in right subtree larger than root’s key result: easy to find any given key inserts/deletes by changing links 8 5 11 2 6 10 12 A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing) 4 7 9 14 13

142 Example and Counter-Example
5 8 4 8 5 18 1 7 11 2 7 6 10 11 Why is the one on the left a BST? It’s not complete! (B/c BSTs don’t need to be complete) Why isn’t the one on the right a BST? Three children of 5 20 has a left child larger than it. What’s wrong with 11? Even though 15 isn’t a direct child, it _still_ needs to be less than 11! 3 4 BINARY SEARCH TREE NOT A BINARY SEARCH TREE

143 In Order Listing visit left subtree visit node visit right subtree 10
5 15 2 9 20 Anyone notice anything interesting about that in-order listing? Everything in the left subtree is listed first. Then the root. Then everything in the right subtree. OK, let’s work out the code to make the in-order listing. Is there an iterative version that doesn’t use its own stack? Not really, no. So, recursion is probably OK here. Anyway, if the tree’s too deep for recursion, you must have a huge amount of data. If (n != null) inorder(n->left) cout << n inorder(n->right) 7 17 30 In order listing: 25791015172030

144 Finding a Node 10 5 15 2 9 20 7 17 30 runtime:
Node *& find(Comparable x, Node * root) { if (root == NULL) return root; else if (x < root->key) return find(x, root->left); else if (x > root->key) root->right); else } 10 5 15 2 9 20 Now, let’s try finding a node. Find 9. This time I’ll supply the code. This should look a _lot_ like binary search! How long does it take? Log n is an easy answer, but what if the tree is very lopsided? So really, this is worst case O(n)! A better answer is theta of the depth of the node sought. If we can bound the depth of that node, we can bound the length of time a search takes. What about the code? All those &s and *s should look pretty scary. Let’s talk through them. 7 17 30 runtime:

145 Insert Concept: proceed down tree as in Find; if new key not found, then insert a new node at last spot traversed void insert(Comparable x, Node * root) { assert ( root != NULL ); if (x < root->key){ if (root->left == NULL) root->left = new Node(x); else insert( x, root->left ); } else if (x > root->key){ if (root->right == NULL) root->right = new Node(x); else insert( x, root->right ); } } Let’s do some inserts: insert(8) insert (11) insert(31)

146 BuildTree for BSTs Suppose a1, a2, …, an are inserted into an initially empty BST: a1, a2, …, an are in increasing order a1, a2, …, an are in decreasing order a1 is the median of all, a2 is the median of elements less than a1, a3 is the median of elements greater than a1, etc. data is randomly ordered OK, we had a buildHeap, let’s buildTree. How long does this take? Well, IT DEPENDS! Let’s say we want to build a tree from What happens if we insert in order? Reverse order? What about 5, then 3, then 7, then 2, then 1, then 6, then 8, then 9?

147 Examples of Building from Scratch
1, 2, 3, 4, 5, 6, 7, 8, 9 5, 3, 7, 2, 4, 6, 8, 1, 9

148 Analysis of BuildTree Worst case is O(n2)
… + n = O(n2) Average case assuming all orderings equally likely is O(n log n) not averaging over all binary trees, rather averaging over all input sequences (inserts) equivalently: average depth of a node is log n proof: see Introduction to Algorithms, Cormen, Leiserson, & Rivest Average runtime is equal to the average depth of a node in the tree. We’ll calculate the average depth by finding the sum of all depths in the tree, and dividing by the number of nodes. What’s the sum of all depths? D(n) = D(I) + D(N - I - 1) + N - 1 (left subtree = I, root is 1 node, so right = n - I - 1. D(I) is depth of left, 1 node deeper in overall, same goes for right, total of I + N - I - 1 extra depth). For BSTs, all subtree sizes are equally likely (because we pick the middle element and random and the rest fall on the left or right determinically). Each subtree then averages 1/N * sum 0 to N-1 of D(j)

149 Bonus: FindMin/FindMax
Find minimum Find maximum 10 5 15 2 9 20 Every now and then everyone succumbs to the temptation to really overuse color. 7 17 30

150 Deletion 10 5 15 2 9 20 And now for something completely different. Let’s say I want to delete a node. Why might it be harder than insertion? Might happen in the middle of the tree instead of at leaf. Then, I have to fix the BST. 7 17 30 Why might deletion be harder than insertion?

151 Deletion - Leaf Case Delete(17) 10 5 15 2 9 20 7 17 30
Alright, we did it the easy way, but what about real deletions? Leaves are easy; we just prune them. 7 17 30

152 Deletion - One Child Case
Delete(15) 10 5 15 2 9 20 Single child nodes we remove and… Do what? We can just pull up their children. Is the search tree property intact? Yes. 7 30

153 Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 replace node with value guaranteed to be between the left and right subtrees: the successor Could we have used the predecessor instead?

154 Finding the Successor Find the next larger node
in this node’s subtree. not next larger in entire tree Node * succ(Node * root) { if (root->right == NULL) return NULL; else return min(root->right); } 10 5 15 2 9 20 Here’s a little digression. Maybe it’ll even have an application at some point. Find the next larger node in 10’s subtree. Can we define it in terms of min and max? It’s the min of the right subtree! 7 17 30 How many children can the successor of a node have?

155 Predecessor Find the next smaller node in this node’s subtree. 10 5 15
Node * pred(Node * root) { if (root->left == NULL) return NULL; else return max(root->left); } 10 5 15 2 9 20 Predecessor is just the mirror problem. 7 17 30

156 Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 always easy to delete the successor – always has either 0 or 1 children!

157 Delete Code void delete(Comparable x, Node *& p) { Node * q;
if (p != NULL) { if (p->key < x) delete(x, p->right); else if (p->key > x) delete(x, p->left); else { /* p->key == x */ if (p->left == NULL) p = p->right; else if (p->right == NULL) p = p->left; else { q = successor(p); p->key = q->key; delete(q->key, p->right); } } } Here’s the code for deletion using lots of confusing reference pointers BUT no leaders, fake nodes. The iterative version of this can get somewhat messy, but it’s not really any big deal.

158 Lazy Deletion Instead of physically deleting nodes, just mark them as deleted simpler physical deletions done in batches some adds just flip deleted flag extra memory for deleted flag many lazy deletions slow finds some operations may have to be modified (e.g., min and max) 10 5 15 Now, before we move on to all the pains of true deletion, let’s do it the easy way. We’ll just pretend we delete deleted nodes. This has some real advantages: 2 9 20 7 17 30

159 Lazy Deletion Delete(17) Delete(15) Delete(5) Find(9) Find(16)
Insert(5) Find(17) 10 5 15 2 9 20 OK, let’s do some lazy deletions. Everybody yawn, stretch, and say “Mmmm… doughnut” to get in the mood. Those of you who are already asleep have the advantage. 7 17 30

160 Dictionary Implementations
unsorted array sorted linked list BST insert find + O(n) O(n) find + O(1) O(Depth) find O(log n) delete BST’s looking good for shallow trees, i.e. the depth D is small (log n), otherwise as bad as a linked list!

161 Beauty is Only (log n) Deep
Binary Search Trees are fast if they’re shallow: e.g.: perfectly complete e.g.: perfectly complete except the “fringe” (leafs) any other good cases? What makes a good BST good? Here’s two examples. Are these the only good BSTs? No! Anything without too many long branches is good, right? Problems occur when one branch is much longer than the other! What matters here?

162 CSE 326: Data Structures Lecture #8 Binary Search Trees
Alon Halevy Spring Quarter 2001

163 Binary Trees A Many algorithms are efficient and easy to program for the special case of binary trees Binary tree is a root left subtree (maybe empty) right subtree (maybe empty) B C D E F G H Alright, we’ll focus today on one type of trees called binary trees. Here’s one now. Is this binary tree complete? Why not? (C has just one child, right side is much deeper than left) What’s the maximum # of leaves a binary tree of depth d can have? What’s the max # of nodes a binary tree of depth d can have? Minimum? We won’t go into this, but if you take N nodes and assume all distinct trees of the nodes are equally likely, you get an average depth of SQRT(N). Is that bigger or smaller than log n? Bigger, so it’s not good enough! I J

164 Binary Search Tree Dictionary Data Structure
Search tree property all keys in left subtree smaller than root’s key all keys in right subtree larger than root’s key result: easy to find any given key inserts/deletes by changing links 8 5 11 2 6 10 12 A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing) 4 7 9 14 13

165 Example and Counter-Example
5 8 4 8 5 18 1 7 11 2 7 6 10 11 Why is the one on the left a BST? It’s not complete! (B/c BSTs don’t need to be complete) Why isn’t the one on the right a BST? Three children of 5 20 has a left child larger than it. What’s wrong with 11? Even though 15 isn’t a direct child, it _still_ needs to be less than 11! 3 4 BINARY SEARCH TREE NOT A BINARY SEARCH TREE

166 In Order Listing visit left subtree visit node visit right subtree 10
5 15 2 9 20 Anyone notice anything interesting about that in-order listing? Everything in the left subtree is listed first. Then the root. Then everything in the right subtree. OK, let’s work out the code to make the in-order listing. Is there an iterative version that doesn’t use its own stack? Not really, no. So, recursion is probably OK here. Anyway, if the tree’s too deep for recursion, you must have a huge amount of data. If (n != null) inorder(n->left) cout << n inorder(n->right) 7 17 30 In order listing: 25791015172030

167 Finding a Node 10 5 15 2 9 20 7 17 30 runtime:
Node *& find(Comparable x, Node * root) { if (root == NULL) return root; else if (x < root->key) return find(x, root->left); else if (x > root->key) root->right); else } 10 5 15 2 9 20 Now, let’s try finding a node. Find 9. This time I’ll supply the code. This should look a _lot_ like binary search! How long does it take? Log n is an easy answer, but what if the tree is very lopsided? So really, this is worst case O(n)! A better answer is theta of the depth of the node sought. If we can bound the depth of that node, we can bound the length of time a search takes. What about the code? All those &s and *s should look pretty scary. Let’s talk through them. 7 17 30 runtime:

168 Insert Concept: proceed down tree as in Find; if new key not found, then insert a new node at last spot traversed void insert(Comparable x, Node * root) { assert ( root != NULL ); if (x < root->key){ if (root->left == NULL) root->left = new Node(x); else insert( x, root->left ); } else if (x > root->key){ if (root->right == NULL) root->right = new Node(x); else insert( x, root->right ); } } Let’s do some inserts: insert(8) insert (11) insert(31)

169 BuildTree for BSTs Suppose a1, a2, …, an are inserted into an initially empty BST: a1, a2, …, an are in increasing order a1, a2, …, an are in decreasing order a1 is the median of all, a2 is the median of elements less than a1, a3 is the median of elements greater than a1, etc. data is randomly ordered OK, we had a buildHeap, let’s buildTree. How long does this take? Well, IT DEPENDS! Let’s say we want to build a tree from What happens if we insert in order? Reverse order? What about 5, then 3, then 7, then 2, then 1, then 6, then 8, then 9?

170 Examples of Building from Scratch
1, 2, 3, 4, 5, 6, 7, 8, 9 5, 3, 7, 2, 4, 6, 8, 1, 9

171 Analysis of BuildTree Worst case is O(n2)
… + n = O(n2) Average case assuming all orderings equally likely is O(n log n) not averaging over all binary trees, rather averaging over all input sequences (inserts) equivalently: average depth of a node is log n proof: see Introduction to Algorithms, Cormen, Leiserson, & Rivest Average runtime is equal to the average depth of a node in the tree. We’ll calculate the average depth by finding the sum of all depths in the tree, and dividing by the number of nodes. What’s the sum of all depths? D(n) = D(I) + D(N - I - 1) + N - 1 (left subtree = I, root is 1 node, so right = n - I - 1. D(I) is depth of left, 1 node deeper in overall, same goes for right, total of I + N - I - 1 extra depth). For BSTs, all subtree sizes are equally likely (because we pick the middle element and random and the rest fall on the left or right determinically). Each subtree then averages 1/N * sum 0 to N-1 of D(j)

172 Bonus: FindMin/FindMax
Find minimum Find maximum 10 5 15 2 9 20 Every now and then everyone succumbs to the temptation to really overuse color. 7 17 30

173 Deletion 10 5 15 2 9 20 And now for something completely different. Let’s say I want to delete a node. Why might it be harder than insertion? Might happen in the middle of the tree instead of at leaf. Then, I have to fix the BST. 7 17 30 Why might deletion be harder than insertion?

174 Deletion - Leaf Case Delete(17) 10 5 15 2 9 20 7 17 30
Alright, we did it the easy way, but what about real deletions? Leaves are easy; we just prune them. 7 17 30

175 Deletion - One Child Case
Delete(15) 10 5 15 2 9 20 Single child nodes we remove and… Do what? We can just pull up their children. Is the search tree property intact? Yes. 7 30

176 Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 replace node with value guaranteed to be between the left and right subtrees: the successor Could we have used the predecessor instead?

177 Finding the Successor Find the next larger node
in this node’s subtree. not next larger in entire tree Node * succ(Node * root) { if (root->right == NULL) return NULL; else return min(root->right); } 10 5 15 2 9 20 Here’s a little digression. Maybe it’ll even have an application at some point. Find the next larger node in 10’s subtree. Can we define it in terms of min and max? It’s the min of the right subtree! 7 17 30 How many children can the successor of a node have?

178 Predecessor Find the next smaller node in this node’s subtree. 10 5 15
Node * pred(Node * root) { if (root->left == NULL) return NULL; else return max(root->left); } 10 5 15 2 9 20 Predecessor is just the mirror problem. 7 17 30

179 Deletion - Two Child Case
Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 always easy to delete the successor – always has either 0 or 1 children!

180 Delete Code void delete(Comparable x, Node *& p) { Node * q;
if (p != NULL) { if (p->key < x) delete(x, p->right); else if (p->key > x) delete(x, p->left); else { /* p->key == x */ if (p->left == NULL) p = p->right; else if (p->right == NULL) p = p->left; else { q = successor(p); p->key = q->key; delete(q->key, p->right); } } } Here’s the code for deletion using lots of confusing reference pointers BUT no leaders, fake nodes. The iterative version of this can get somewhat messy, but it’s not really any big deal.

181 Lazy Deletion Instead of physically deleting nodes, just mark them as deleted simpler physical deletions done in batches some adds just flip deleted flag extra memory for deleted flag many lazy deletions slow finds some operations may have to be modified (e.g., min and max) 10 5 15 Now, before we move on to all the pains of true deletion, let’s do it the easy way. We’ll just pretend we delete deleted nodes. This has some real advantages: 2 9 20 7 17 30

182 Lazy Deletion Delete(17) Delete(15) Delete(5) Find(9) Find(16)
Insert(5) Find(17) 10 5 15 2 9 20 OK, let’s do some lazy deletions. Everybody yawn, stretch, and say “Mmmm… doughnut” to get in the mood. Those of you who are already asleep have the advantage. 7 17 30

183 Dictionary Implementations
unsorted array sorted linked list BST insert find + O(n) O(n) find + O(1) O(Depth) find O(log n) delete BST’s looking good for shallow trees, i.e. the depth D is small (log n), otherwise as bad as a linked list!

184 Beauty is Only (log n) Deep
Binary Search Trees are fast if they’re shallow: e.g.: perfectly complete e.g.: perfectly complete except the “fringe” (leafs) any other good cases? What makes a good BST good? Here’s two examples. Are these the only good BSTs? No! Anything without too many long branches is good, right? Problems occur when one branch is much longer than the other! What matters here?

185 Balance Balance Balance between -1 and 1 everywhere 
5 7 Balance height(left subtree) - height(right subtree) zero everywhere  perfectly balanced small everywhere  balanced enough We’ll use the concept of Balance to keep things shallow. Balance between -1 and 1 everywhere  maximum height of 1.44 log n

186 AVL Tree Dictionary Data Structure
Binary search tree properties binary tree property search tree property Balance property balance of every node is: -1 b  1 result: depth is (log n) 8 5 11 2 6 10 12 So, AVL trees will be Binary Search Trees with one extra feature: They balance themselves! The result is that all AVL trees at any point will have a logarithmic asymptotic bound on their depths 4 7 9 13 14 15

187 An AVL Tree 10 10 3 5 15 2 9 12 20 17 30 data 3 height children 1 2 1
1 2 9 12 20 Here’s a revision of that tree that’s balanced. (Same values, similar tree) This one _is_ an AVL tree (and isn’t leftist). I also have here how we might store the nodes in the AVL tree. Notice that I’m going to keep track of height all the time. WHY? 17 30

188 Not AVL Trees 10 10 0-2 = -2 (-1)-1 = -2 5 15 15 12 20 20 17 30 3 2 2
2 0-2 = -2 (-1)-1 = -2 1 5 15 15 1 12 20 20 Here’s a revision of that tree that’s balanced. (Same values, similar tree) This one _is_ an AVL tree (and isn’t leftist). I also have here how we might store the nodes in the AVL tree. Notice that I’m going to keep track of height all the time. WHY? 17 30

189 Staying Balanced M S T Good case: inserting small, tall and middle.
Insert(middle) Insert(small) Insert(tall) 1 M Let’s make a tree from these people with their height as the keys. We’ll start by inserting [MIDDLE] first. Then, [SMALL] and finally [TALL]. Is this tree balanced? Yes! S T

190 Bad Case #1 S M T Insert(small) Insert(middle) Insert(tall) 2 1
But, let’s start over… Insert [SMALL] Now, [MIDDLE]. Now, [TALL]. Is this tree balanced? NO! Who do we need at the root? [MIDDLE!] Alright, let’s pull er up. T

191 Single Rotation S M M S T T 2 1 1 Basic operation used in AVL trees:
This is the basic operation we’ll use in AVL trees. Since this is a right child, it could legally have the parent as its left child. When we finish the rotation, we have a balanced tree! S T T Basic operation used in AVL trees: A right child could legally have its parent as its left child.

192 General Case: Insert Unbalances
h + 1 h + 2 a a h h - 1 h + 1 h - 1 b X b X h-1 h h - 1 h - 1 Z Y Z Y Here’s the general form of this. We insert into the red tree. That ups the three heights on the left. Basically, you just need to pull up on the child. Then, ensure that everything falls in place as legal subtrees of the nodes. Notice, though, the height of this subtree is the same as it was before the insert into the red tree. So? So, we don’t have to worry about ancestors of the subtree becoming imbalanced; we can just stop here!

193 General Single Rotation
h + 2 h + 1 a a X Y b Z h h + 1 h - 1 b X h h - 1 h h - 1 h - 1 Z Y Here’s the general form of this. We insert into the red tree. That ups the three heights on the left. Basically, you just need to pull up on the child. Then, ensure that everything falls in place as legal subtrees of the nodes. Notice, though, the height of this subtree is the same as it was before the insert into the red tree. So? So, we don’t have to worry about ancestors of the subtree becoming imbalanced; we can just stop here! Height of left subtree same as it was before insert! Height of all ancestors unchanged We can stop here!

194 Will a single rotation fix this?
Bad Case #2 Insert(small) Insert(tall) Insert(middle) 2 S 1 T There’s another bad case, though. What if we insert: [SMALL] [TALL] [MIDDLE] Now, is the tree imbalanced? Will a single rotation fix it? (Try it by bringing up tall; doesn’t work!) Will a single rotation fix this? M

195 Double Rotation S S M T M S T M T 2 2 1 1 1
Let’s try two single rotations, starting a bit lower down. First, we rotate up middle. Then, we rotate up middle again! Is the new tree balanced? S T M T

196 General Double Rotation
h + 2 a h + 1 h + 1 c h - 1 b Z h h b a h - 1 W h c h - 1 h - 1 X Y W Z X Y Here’s the general form of this. Notice that the difference here is that we zigged one way than zagged the other to find the problem. We don’t really know or care which of X or Y was inserted into, but one of them was. To fix it, we pull c all the way up. Then, put a, b, and the subtrees beneath it in the reasonable manner. The height is still the same at the end! h - 1? h - 1? Initially: insert into either X or Y unbalances tree (root height goes to h+2) “Zig zag” to pull up c – restores root height to h+1, left subtree height to h

197 Insert Algorithm Find spot for value Hang new node
Search back up looking for imbalance If there is an imbalance: case #1: Perform single rotation and exit case #2: Perform double rotation and exit OK, thank you BST Three! And those two cases (along with their mirror images) are the only four that can happen! So, here’s our insert algorithm. We just hang the node. Search for a spot where there’s imbalance. If there is, fix it (according to the shape of the imbalance). And then we’re done; there can only be one problem!

198 Easy Insert Insert(3) 10 5 15 2 9 12 20 17 30 3 1 2 1 Let’s insert 3.
1 2 9 12 20 Let’s insert 3. This is easy! It just goes under 2 (to the left). Update the balances: any imbalance? NO! 17 30

199 Hard Insert (Bad Case #1)
2 3 Insert(33) 10 5 15 2 9 12 20 Now, let’s insert 33. Where does it go? Left of 30. 3 17 30

200 Single Rotation 1 2 3 1 2 3 10 10 5 15 5 20 2 9 12 20 2 9 15 30 Here’s the tree with the balances updated. Now, node 15 is bad! Since the problem is in the left subtree of the left child, we can fix it with a single rotation. We pull 20 up. Hang 15 to the left. Pass 17 to 15. And, we’re done! Notice that I didn’t update 10’s height until we checked 15. Did it change after all? 3 17 30 3 12 17 33 33

201 Hard Insert (Bad Case #2)
1 2 3 Insert(18) 10 5 15 2 9 12 20 Now, let’s back up to before 33 and insert 18 instead. Goes right of 17. Again, there’s imbalance. But, this time, it’s a zig-zag! 3 17 30

202 Single Rotation (oops!)
1 2 3 1 2 3 10 10 5 15 5 20 2 9 12 20 2 9 15 30 We can try a single rotation, but we end up with another zig-zag! 3 17 30 3 12 17 18 18

203 Double Rotation (Step #1)
2 3 1 2 3 10 10 5 15 5 15 2 9 12 20 2 9 12 17 So, we’ll double rotate. Start by moving the offending grand-child up. We get an even more imbalanced tree. BUT, it’s imbalanced like a zig-zig tree now! 3 17 30 3 20 18 18 30 Look familiar?

204 Double Rotation (Step #2)
1 2 3 1 2 3 10 10 5 15 5 17 2 9 12 17 2 9 15 20 So, let’s pull 17 up again. Now, we get a balanced tree. And, again, 10’s height didn’t need to change. 3 20 3 12 18 30 18 30

205 AVL Algorithm Revisited
Recursive 1. Search downward for spot 2. Insert node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2, double rotate Iterative 1. Search downward for spot, stacking parent nodes 2. Insert node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate and exit b. If imbalance #2, double rotate and OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!

206 Single Rotation Code X Y Z root temp void RotateRight(Node *& root) {
Node * temp = root->right; root->right = temp->left; temp->left = root; root->height = max(root->right->height, root->left->height) + 1; temp->height = max(temp->right->height, temp->left->height) + 1; root = temp; } Here’s code for one of the two single rotate cases. RotateRight brings up the right child. We’ve inserted into Z, and now we want to fix it.

207 Double Rotation Code First Rotation a Z b W c X Y a Z c b X Y W
void DoubleRotateRight(Node *& root) { RotateLeft(root->right); RotateRight(root); } First Rotation a Z b W c X Y a Z c b X Y W Here’s the double rotation code. Pretty tough, eh?

208 Double Rotation Completed
First Rotation Second Rotation a Z c b X Y W c a b X W Z Y

209 CSE 326: Data Structures Lecture #9 AVL II
Alon Halevy Spring Quarter 2001 Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)

210 Deletion (Really Easy Case)
1 2 3 Delete(17) 10 5 15 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 17 30

211 Deletion (Pretty Easy Case)
1 2 3 Delete(15) 10 5 15 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 17 30

212 Deletion (Pretty Easy Case cont.)
3 Delete(15) 10 2 2 5 17 1 1 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 30

213 Deletion (Hard Case #1) Delete(12) 10 5 17 2 9 12 20 3 30 3 2 1
2 3 Delete(12) 10 5 17 2 9 12 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 3 30

214 Single Rotation on Deletion
1 2 3 3 10 10 2 1 5 17 5 20 1 2 9 20 2 9 17 30 But what happened on the fix? Something very disturbing. What? The subtree’s height changed!! So, the deletion can propagate. 3 30 3 What is different about deletion than insertion?

215 Deletion (Hard Case) Delete(9) 10 5 17 2 9 12 12 20 20 3 11 15 15 18
3 4 Delete(9) 10 5 17 2 9 12 12 20 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 1 1 3 11 15 15 18 30 30 13 13 33 33

216 Double Rotation on Deletion
Not finished! 1 2 3 4 2 1 3 4 10 10 5 17 3 17 2 2 12 20 2 5 12 20 1 1 1 1 3 11 15 18 30 11 15 18 30 13 33 13 33

217 Deletion with Propagation
2 1 3 4 10 What different about this case? 3 17 2 5 12 20 1 1 We get to choose whether to single or double rotate! 11 15 18 30 13 33

218 Propagated Single Rotation
2 1 3 4 4 10 17 3 2 3 17 10 20 1 2 1 2 5 12 20 3 12 18 30 1 1 1 11 15 18 30 2 5 11 15 33 13 33 13

219 Propagated Double Rotation
2 1 3 4 4 10 12 2 3 3 17 10 17 1 1 2 2 5 12 20 3 11 15 20 1 1 1 11 15 18 30 2 5 13 18 30 13 33 33

220 AVL Deletion Algorithm
Recursive If at node, delete it Otherwise recurse to find it in 3. Correct heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care), double rotate Iterative 1. Search downward for node, stacking parent nodes 2. Delete node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care) double rotate OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!

221 Fun with AVL Trees Input: sequence of n keys (unordered) 19 3 4 18 7
Insert each into initially empty AVL tree Print using inorder traversal O(n) Result? Are we having fun yet?

222 Is There a Faster Way? But suppose input is already sorted 3 4 7 18 19
Can we do better than O(n log n)?

223 AVL buildTree 5 8 10 15 17 20 30 35 40 Divide & Conquer 17
Divide the problem into parts Solve each part recursively Merge the parts into a general solution 17 IT DEPENDS! How long does divide & conquer take? 8 10 15 5 20 30 35 40

224 BuildTree Example 5 8 10 15 17 20 30 35 40 3 17 5 8 10 15 2 2 20 30 35 40 10 35 20 30 5 8 1 1 8 15 30 40 5 20

225 BuildTree Analysis (Approximate)
T(n) = 2T(n/2) + 1 T(n) = 2(2T(n/4)+1) + 1 T(n) = 4T(n/4) T(n) = 4(2T(n/8)+1) T(n) = 8T(n/8) T(n) = 2kT(n/2k) + let 2k = n, log n = k T(n) = nT(1) + T(n) = (n) Summation is 2^logn + 2^logn-1 + 2^logn-2+… n+n/2+n/4+n/8+… ~2n

226 BuildTree Analysis (Exact)
Precise Analysis: T(0) = b T(n) = T( ) + T( ) + c By induction on n: T(n) = (b+c)n + b Base case: T(0) = b = (b+c)0 + b Induction step: T(n) = (b+c) + b + (b+c) + b + c = (b+c)n + b QED: T(n) = (b+c)n + b = (n)

227 Application: Batch Deletion
Suppose we are using lazy deletion When there are lots of deleted nodes (n/2), need to flush them all out Batch deletion: Print non-deleted nodes into an array How? Divide & conquer AVL Treebuild Total time:

228 Thinking About AVL Observations
+ Worst case height of an AVL tree is about 1.44 log n + Insert, Find, Delete in worst case O(log n) + Only one (single or double) rotation needed on insertion - O(log n) rotations needed on deletion + Compatible with lazy deletion - Height fields must be maintained (or 2-bit balance)

229 Alternatives to AVL Trees
Weight balanced trees keep about the same number of nodes in each subtree not nearly as nice Splay trees “blind” adjusting version of AVL trees no height information maintained! insert/find always rotates node to the root! worst case time is O(n) amortized time for all operations is O(log n) mysterious, but often faster than AVL trees in practice (better low-order terms)

230 CSE 326: Data Structures Lecture #9 AVL II
Alon Halevy Spring Quarter 2001 Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)

231 This and Next Week This week: Next week:
Finish AVL trees, start B-trees B-trees / hashing Hashing Next week: Hashing and midterm review (if you have questions) Midterm (Wednesday) Finish hashing

232 Imbalance in AVL Trees Last week’s conjecture: in AVL trees, if you remove the bottom level, then you get a complete tree. This week’s theorems: All nodes, except parents of the leaves and the leaves have two children. Single-child nodes can be arbitrarily far from the leaves.

233 AVL Tree with Slight Imbalance
8 5 11 2 6 10 12 So, AVL trees will be Binary Search Trees with one extra feature: They balance themselves! The result is that all AVL trees at any point will have a logarithmic asymptotic bound on their depths 4 7 9 13 14 15

234 Where can we Find Leaves?
Suppose the node N has no children. What is the maximal height of N’s parent? What is the maximal height of N’s grandparent? What is the maximal height of N’s great-grandparent? Conclusion: at what depth can we find a leaf?

235 Deletion (Hard Case #1) Delete(12) 10 5 17 2 9 12 20 3 30 3 2 1
2 3 Delete(12) 10 5 17 2 9 12 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 3 30

236 Single Rotation on Deletion
1 2 3 3 10 10 2 1 5 17 5 20 1 2 9 20 2 9 17 30 But what happened on the fix? Something very disturbing. What? The subtree’s height changed!! So, the deletion can propagate. 3 30 3 What is different about deletion than insertion?

237 Deletion (Hard Case #2) Delete(9) 10 5 17 2 9 12 12 20 20 3 11 15 15
3 4 Delete(9) 10 5 17 2 9 12 12 20 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 1 1 3 11 15 15 18 30 30 13 13 33 33

238 Double Rotation on Deletion
Not finished! 1 2 3 4 2 1 3 4 10 10 5 17 3 17 2 2 12 20 2 5 12 20 1 1 1 1 3 11 15 18 30 11 15 18 30 13 33 13 33

239 Deletion with Propagation
2 1 3 4 10 What different about this case? 3 17 2 5 12 20 1 1 We get to choose whether to single or double rotate! 11 15 18 30 13 33

240 Propagated Single Rotation
2 1 3 4 4 10 17 3 2 3 17 10 20 1 2 1 2 5 12 20 3 12 18 30 1 1 1 11 15 18 30 2 5 11 15 33 13 33 13

241 Propagated Double Rotation
2 1 3 4 4 10 12 2 3 3 17 10 17 1 1 2 2 5 12 20 3 11 15 20 1 1 1 11 15 18 30 2 5 13 18 30 13 33 33

242 AVL Deletion Algorithm
Recursive If at node, delete it Otherwise recurse to find it in 3. Correct heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care), double rotate Iterative 1. Search downward for node, stacking parent nodes 2. Delete node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care) double rotate OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!

243 Fun with AVL Trees Input: sequence of n keys (unordered) 19 3 4 18 7
Insert each into initially empty AVL tree Print using inorder traversal O(n) Result? Are we having fun yet?

244 Is There a Faster Way? But suppose input is already sorted 3 4 7 18 19
Can we do better than O(n log n)?

245 AVL buildTree 5 8 10 15 17 20 30 35 40 Divide & Conquer 17
Divide the problem into parts Solve each part recursively Merge the parts into a general solution 17 IT DEPENDS! How long does divide & conquer take? 8 10 15 5 20 30 35 40

246 BuildTree Example 5 8 10 15 17 20 30 35 40 3 17 5 8 10 15 2 2 20 30 35 40 10 35 20 30 5 8 1 1 8 15 30 40 5 20

247 BuildTree Analysis (Approximate)
T(n) = 2T(n/2) + 1 T(n) = 2(2T(n/4)+1) + 1 T(n) = 4T(n/4) T(n) = 4(2T(n/8)+1) T(n) = 8T(n/8) T(n) = 2kT(n/2k) + let 2k = n, log n = k T(n) = nT(1) + T(n) = (n) Summation is 2^logn + 2^logn-1 + 2^logn-2+… n+n/2+n/4+n/8+… ~2n

248 Thinking About AVL Observations
+ Worst case height of an AVL tree is about 1.44 log n + Insert, Find, Delete in worst case O(log n) + Only one (single or double) rotation needed on insertion - O(log n) rotations needed on deletion - Height fields must be maintained (or 2-bit balance)

249 Alternatives to AVL Trees
Weight balanced trees keep about the same number of nodes in each subtree not nearly as nice Splay trees (after mid-term) “blind” adjusting version of AVL trees no height information maintained! insert/find always rotates node to the root! worst case time is O(n) amortized time for all operations is O(log n) mysterious, but often faster than AVL trees in practice (better low-order terms)

250 B-Trees

251 Beyond Binary Trees One of the most important applications for search trees is databases If the DB is small enough to fit into RAM, almost any scheme for balanced trees (e.g. AVL) is okay 2000 (WalMart) RAM – 1,000,000 MB DB – 1,000,000 MB (terabyte) 1980 RAM – 1MB DB – 100 MB gap between disk and main memory growing!

252 Time Gap For many corporate and scientific databases, the search tree must mostly be on disk Accessing disk 200,000 X time slower than RAM Visiting node = accessing disk Even perfectly balance binary trees a disaster! log2( 10,000,000 ) = 24 disk accesses Goal: Decrease Height of Tree

253 M-ary Search Tree Maximum branching factor of M
Complete tree has depth = logMN Each internal node in a complete tree has M - 1 keys runtime: Here’s the general idea. We create a search tree with a branching factor of M. Each node has M-1 keys and we search between them. What’s the runtime? O(logMn)? That’s a nice thought, and it’s the best case. What about the worst case? Is the tree guaranteed to be balanced? Is it guaranteed to be complete? Might it just end up being a binary tree?

254 B-Trees B-Trees are specialized M-ary search trees
Each node has many keys subtree between two keys x and y contains values v such that x  v < y binary search within a node to find correct subtree Each node takes one full page of memory. 3 7 12 21 To address these problems, we’ll use a slightly more structured M-ary tree: B-Trees. As before, each internal node has M-1 kes. To manage memory problems, we’ll tune the size of a node (or leaf) to the size of a memory unit. Usually, a page or disk block. x<3 3x<7 7x<12 12x<21 21x

255 B-Tree Properties‡ Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The properties of B-Trees (and the trees themselves) are a bit more complex than previous structures we’ve looked at. Here’s a big, gnarly list; we’ll go one step at a time. The maximum branching factor, as we said, is M (tunable for a given tree). The root has between 2 and M children or at most L keys. (L is another parameter) These restrictions will be different for the root than for other nodes. ‡These are technically B+-Trees

256 B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time All the other internal nodes (non-leaves) will have between M/2 and M children. The funky symbol is ceiling, the next higher integer above the value. The result of this is that the tree is “pretty” full. Not every node has M children but they’ve all at least got M/2 (a good number). Internal nodes contain only search keys. A search key is a value which is solely for comparison; there’s no data attached to it. The node will have one fewer search key than it has children (subtrees) so that we can search down to each child. The smallest datam between two search keys is equal to the lesser search key. This is how we find the search keys to use.

257 B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time All the leaves (again, except the root) have a similar restriction. They contain between L/2 and L keys. Notice that means you have to do a search when you get to a leaf to find the item you’re looking for. All the leaves are also at the same depth. So, the tree looks kind of complete. It has the triangle shape, and the nodes branch at least as much as M/2.

258 B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The result of all this is that the tree in the worst case is log n deep. In particular, it’s about logM/2n deep. Does this matter asymptotically? No. What about practically? YES! Since M and L are considered constants, all operations run in log n time. Each operation pulls in at most M search keys or L items at a time. So, we can tune L and M to the size of a disk block!

259 When Big-O is Not Enough
B-Tree is about logM/2 n/(L/2) deep = logM/2 n - logM/2 L/2 = O(logM/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses

260 … … B-Tree Nodes Internal node Leaf
i search keys; i+1 subtrees; M - i - 1 inactive entries k1 k2 ki __ __ 1 2 i M - 1 Leaf j data keys; L - j inactive entries FIX M-I to M-I-1!! Alright, before we look at any examples, let’s look at what the node structure looks like. Internal nodes are arrays of pointers to children interspersed with search keys. Why must they be arrays rather than linked lists? Because we want contiguous memory! If the node has just I+1 children, it has I search keys, and M-I empty entries. A leaf looks similar (I’ll use green for leaves), and has similar properties. Why are these different? Because internal nodes need subtrees-1 keys. k1 k2 kj __ __ 1 2 j L

261 Example B-Tree with M = 4 and L = 4 10 40 3 15 20 30 50 1 2 10 11 12
This is just an example B-tree. Notice that it has 24 entries with a depth of only 2. A BST would be 4 deep. Notice also that the leaves are at the same level in the tree. I’ll use integers as both key and data, but we all know that that could as well be different data at the bottom, right? 1 2 10 11 12 20 25 26 40 42 3 5 6 9 15 17 30 32 33 36 50 60 70

262 Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? The empty B-Tree
M = 3 L = 2 3 3 14 Insert(3) Insert(14) Alright, how do we insert and delete? Let’s start with the empty B-Tree. That’s one leaf as the root. Now, we’ll insert 3 and 14. Fine… What about inserting 1. Is there a problem? Now, Insert(1)?

263 Splitting the Root Insert(1) And create a new root Too many
keys in a leaf! 3 14 14 1 3 1 3 14 Insert(1) And create a new root 1 3 14 Too many keys in a leaf! Run away! How do we solve this? Well, we definitely need to split this leaf in two. But, now we don’t have a tree anymore. So, let’s make a new root and give it as children the two leaves. This is how B-Trees grow deeper. So, split the leaf.

264 Insertions and Split Ends
Too many keys in a leaf! 14 14 14 Insert(59) Insert(26) 1 3 14 26 59 1 3 14 1 3 14 59 14 26 59 So, split the leaf. Now, let’s do some more inserts. 59 is no problem. What about 26? Same problem as before. But, this time the split leaf just goes under the existing node because there’s still room. What if there weren’t room? 14 59 And add a new child 1 3 14 26 59

265 Too many keys in an internal node!
Propagating Splits 14 59 14 59 Insert(5) Add new child 1 3 5 14 26 59 1 3 14 26 59 1 3 5 Too many keys in an internal node! 5 1 3 14 26 59 5 14 26 59 1 3 When we insert 5, the leaf overflows, but its parent already has too many subtrees! What do we do? The same thing as before but this time with an internal node. We split the node. Normally, we’d hang the new subtrees under their parent, but in this case they don’t have one. Now we have two trees! Soltuion: same as before, make a new root and hang these under it. Create a new root So, split the node.

266 Insertion in Boring Text
Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root OK, here’s that process as an algorithm. The new funky symbol is floor; that’s just like regular C++ integer division. Notice that this can propagate all the way up the tree. How often will it do that? Notice that the two new leaves or internal nodes are guaranteed to have enough items (or subtrees). Because even the floor of (L+1)/2 is as big as the ceiling of L/2. This makes the tree deeper!

267 After More Routine Inserts
14 Insert(89) Insert(79) 5 59 1 3 5 14 26 59 5 1 3 14 26 59 79 89 OK, we’ve done insertion. What about deletion? For didactic purposes, I will now do two more regular old insertions (notice these cause a split).

268 Deletion Delete(59) 5 1 3 14 26 59 79 89 Now, let’s delete!
Just find the key to delete and snip it out! Easy! Done, right?

269 Deletion and Adoption A leaf has too few keys! Delete(5)
14 14 Delete(5) 5 79 89 ? 79 89 1 3 5 14 26 79 89 1 3 14 26 79 89 So, borrow from a neighbor Of course not! What if we delete an item in a leaf and drive it below L/2 items (in this case to zero)? In that case, we have two options. The easy option is to borrow a neighbor’s item. We just move it over from the neighbor and fix the parent’s key. DIGRESSION: would it be expensive to maintain neighbor pointers in B-Trees? No. Because those leaves are normally going to be huge, and two pointers per leaf is no big deal (might cut down L by 1). How about parent pointers? No problem. In fact, I’ve been assuming we have them! 3 1 14 26 79 89

270 Deletion with Propagation
A leaf has too few keys! 14 14 Delete(3) 3 79 89 ? 79 89 1 3 14 26 79 89 1 14 26 79 89 And no neighbor with surplus! But, what about if the neighbors are too low on items as well? Then, we need to propagate the delete… like an _unsplit_. We delete the node and fix up the parent. Note that if I had a larger M/L, we might have keys left in the deleted node. Why? Because the leaf just needs to drop below ceil(L/2) to be deleted. If L=100, L/2 = 50 and there are 49 keys to distribute! Solution: Give them to the neighbors. Now, what happens to the parent here? It’s down to one subtree! STRESS AGAIN THAT LARGER M and L WOULD MEAN NO NEED TO “RUN OUT”. 14 But now a node has too few subtrees! So, delete the leaf 79 89 1 14 26 79 89

271 Finishing the Propagation (More Adoption)
Adopt a neighbor 1 14 26 79 89 We just do the same thing here that we did earlier: Borrow from a rich neighbor!

272 A Bit More Adoption Delete(1) (adopt a neighbor) 79 79 14 89 26 89 1
OK, let’s do a bit of setup. This is easy, right? 1 14 26 79 89 14 26 79 89

273 Pulling out the Root A leaf has too few keys!
And no neighbor with surplus! 79 79 Delete(26) So, delete the leaf 26 89 89 14 26 79 89 14 79 89 But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Now, let’s delete 26. It can’t borrow from its neighbor, so we delete it. Its parent is too low on children now and it can’t borrow either: Delete it. Here, we give its leftovers to its neighbors as I mentioned earlier. But now the root has just one subtree!! 79 Delete the leaf 79 89 89 14 79 89 14 79 89

274 Pulling out the Root (continued)
has just one subtree! Just make the one child the new root! 79 89 14 79 89 But that’s silly! The root having just one subtree is both illegal and silly. Why have the root if it just branches straight down? So, we’ll just delete the root and replace it with its child! 79 89 14 79 89

275 Deletion in Two Boring Slides of Text
Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer than M/2 items, underflow! Why will dumping keys always work if borrowing doesn’t? Alright, that’s deletion. Let’s talk about a few of the details. Why will dumping keys always work? If the neighbors were too low on keys to loan any, they must have L/2 keys, but we have one fewer. Therefore, putting them together, we get at most L, and that’s legal.

276 Deletion Slide Two If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree The same applies here for dumping subtrees as on the previous slide for dumping keys. This reduces the height of the tree!

277 Thinking about B-Trees
B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items height 5: 2,000,000,000! B*-Trees fix thrashing. Propagation is rare because (in a good case) only about 1/L inserts cause a split and only about 1/M of those go up even one level! 30 million’s not so big, right? How about height 5? 2 billion

278 Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? How about: O(1) finds and inserts?

279 CSE 326: Data Structures Lecture #11 B-Trees
Alon Halevy Spring Quarter 2001 Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)

280 B-Tree Properties Properties Result maximum branching factor of M
the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The result of all this is that the tree in the worst case is log n deep. In particular, it’s about logM/2n deep. Does this matter asymptotically? No. What about practically? YES! Since M and L are considered constants, all operations run in log n time. Each operation pulls in at most M search keys or L items at a time. So, we can tune L and M to the size of a disk block!

281 When Big-O is Not Enough
B-Tree is about logM/2 n/(L/2) deep = logM/2 n - logM/2 L/2 = O(logM/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses

282 … … B-Tree Nodes Internal node Leaf
i search keys; i+1 subtrees; M - i - 1 inactive entries k1 k2 ki __ __ 1 2 i M - 1 Leaf j data keys; L - j inactive entries FIX M-I to M-I-1!! Alright, before we look at any examples, let’s look at what the node structure looks like. Internal nodes are arrays of pointers to children interspersed with search keys. Why must they be arrays rather than linked lists? Because we want contiguous memory! If the node has just I+1 children, it has I search keys, and M-I empty entries. A leaf looks similar (I’ll use green for leaves), and has similar properties. Why are these different? Because internal nodes need subtrees-1 keys. k1 k2 kj __ __ 1 2 j L

283 Example B-Tree with M = 4 and L = 4 10 40 3 15 20 30 50 1 2 10 11 12
This is just an example B-tree. Notice that it has 24 entries with a depth of only 2. A BST would be 4 deep. Notice also that the leaves are at the same level in the tree. I’ll use integers as both key and data, but we all know that that could as well be different data at the bottom, right? 1 2 10 11 12 20 25 26 40 42 3 5 6 9 15 17 30 32 33 36 50 60 70

284 Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? The empty B-Tree
M = 3 L = 2 3 3 14 Insert(3) Insert(14) Alright, how do we insert and delete? Let’s start with the empty B-Tree. That’s one leaf as the root. Now, we’ll insert 3 and 14. Fine… What about inserting 1. Is there a problem? Now, Insert(1)?

285 Splitting the Root Insert(1) And create a new root Too many
keys in a leaf! 3 14 14 1 3 1 3 14 Insert(1) And create a new root 1 3 14 Too many keys in a leaf! Run away! How do we solve this? Well, we definitely need to split this leaf in two. But, now we don’t have a tree anymore. So, let’s make a new root and give it as children the two leaves. This is how B-Trees grow deeper. So, split the leaf.

286 Insertions and Split Ends
Too many keys in a leaf! 14 14 14 Insert(59) Insert(26) 1 3 14 26 59 1 3 14 1 3 14 59 14 26 59 So, split the leaf. Now, let’s do some more inserts. 59 is no problem. What about 26? Same problem as before. But, this time the split leaf just goes under the existing node because there’s still room. What if there weren’t room? 14 59 And add a new child 1 3 14 26 59

287 Too many keys in an internal node!
Propagating Splits 14 59 14 59 Insert(5) Add new child 1 3 5 14 26 59 1 3 14 26 59 1 3 5 Too many keys in an internal node! 5 1 3 14 26 59 5 14 26 59 1 3 When we insert 5, the leaf overflows, but its parent already has too many subtrees! What do we do? The same thing as before but this time with an internal node. We split the node. Normally, we’d hang the new subtrees under their parent, but in this case they don’t have one. Now we have two trees! Soltuion: same as before, make a new root and hang these under it. Create a new root So, split the node.

288 Insertion in Boring Text
Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root OK, here’s that process as an algorithm. The new funky symbol is floor; that’s just like regular C++ integer division. Notice that this can propagate all the way up the tree. How often will it do that? Notice that the two new leaves or internal nodes are guaranteed to have enough items (or subtrees). Because even the floor of (L+1)/2 is as big as the ceiling of L/2. This makes the tree deeper!

289 Deletion in B-trees Come to section tomorrow. Slides follow.

290 After More Routine Inserts
14 Insert(89) Insert(79) 5 59 1 3 5 14 26 59 5 1 3 14 26 59 79 89 OK, we’ve done insertion. What about deletion? For didactic purposes, I will now do two more regular old insertions (notice these cause a split).

291 Deletion Delete(59) 5 1 3 14 26 59 79 89 Now, let’s delete!
Just find the key to delete and snip it out! Easy! Done, right?

292 Deletion and Adoption A leaf has too few keys! Delete(5)
14 14 Delete(5) 5 79 89 ? 79 89 1 3 5 14 26 79 89 1 3 14 26 79 89 So, borrow from a neighbor Of course not! What if we delete an item in a leaf and drive it below L/2 items (in this case to zero)? In that case, we have two options. The easy option is to borrow a neighbor’s item. We just move it over from the neighbor and fix the parent’s key. DIGRESSION: would it be expensive to maintain neighbor pointers in B-Trees? No. Because those leaves are normally going to be huge, and two pointers per leaf is no big deal (might cut down L by 1). How about parent pointers? No problem. In fact, I’ve been assuming we have them! 3 1 14 26 79 89

293 Deletion with Propagation
A leaf has too few keys! 14 14 Delete(3) 3 79 89 ? 79 89 1 3 14 26 79 89 1 14 26 79 89 And no neighbor with surplus! But, what about if the neighbors are too low on items as well? Then, we need to propagate the delete… like an _unsplit_. We delete the node and fix up the parent. Note that if I had a larger M/L, we might have keys left in the deleted node. Why? Because the leaf just needs to drop below ceil(L/2) to be deleted. If L=100, L/2 = 50 and there are 49 keys to distribute! Solution: Give them to the neighbors. Now, what happens to the parent here? It’s down to one subtree! STRESS AGAIN THAT LARGER M and L WOULD MEAN NO NEED TO “RUN OUT”. 14 But now a node has too few subtrees! So, delete the leaf 79 89 1 14 26 79 89

294 Finishing the Propagation (More Adoption)
Adopt a neighbor 1 14 26 79 89 We just do the same thing here that we did earlier: Borrow from a rich neighbor!

295 A Bit More Adoption Delete(1) (adopt a neighbor) 79 79 14 89 26 89 1
OK, let’s do a bit of setup. This is easy, right? 1 14 26 79 89 14 26 79 89

296 Pulling out the Root A leaf has too few keys!
And no neighbor with surplus! 79 79 Delete(26) So, delete the leaf 26 89 89 14 26 79 89 14 79 89 But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Now, let’s delete 26. It can’t borrow from its neighbor, so we delete it. Its parent is too low on children now and it can’t borrow either: Delete it. Here, we give its leftovers to its neighbors as I mentioned earlier. But now the root has just one subtree!! 79 Delete the leaf 79 89 89 14 79 89 14 79 89

297 Pulling out the Root (continued)
has just one subtree! Just make the one child the new root! 79 89 14 79 89 But that’s silly! The root having just one subtree is both illegal and silly. Why have the root if it just branches straight down? So, we’ll just delete the root and replace it with its child! 79 89 14 79 89

298 Deletion in Two Boring Slides of Text
Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer than M/2 items, underflow! Why will dumping keys always work if borrowing doesn’t? Alright, that’s deletion. Let’s talk about a few of the details. Why will dumping keys always work? If the neighbors were too low on keys to loan any, they must have L/2 keys, but we have one fewer. Therefore, putting them together, we get at most L, and that’s legal.

299 Deletion Slide Two If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree The same applies here for dumping subtrees as on the previous slide for dumping keys. This reduces the height of the tree!

300 Thinking about B-Trees
B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items height 5: 2,000,000,000! B*-Trees fix thrashing. Propagation is rare because (in a good case) only about 1/L inserts cause a split and only about 1/M of those go up even one level! 30 million’s not so big, right? How about height 5? 2 billion

301 Tree Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? How about: O(1) finds and inserts?

302 Hash Table Approach Zasha Steve f(x) Nic Brad Ed
But… is there a problem in this pipe-dream?

303 Hash Table Dictionary Data Structure
Hash function: maps keys to integers result: can quickly find the right spot for a given entry Unordered and sparse table result: cannot efficiently list all entries, Cannot find min and max efficiently, Cannot find all items within a specified range efficiently. f(x) Zasha Steve Nic Brad Ed A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing)

304 Hash Table Terminology
hash function Zasha f(x) Steve Nic collision Brad Ed keys load factor  = # of entries in table tableSize

305 Hash Table Code First Pass
Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } What should the hash function be? What should the table size be? How should we resolve collisions?

306 A Good Hash Function… is easy (fast) to compute (O(1) and practically fast). distributes the data evenly (hash(a)  hash(b) ). uses the whole hash table (for all 0  k < size, there’s an i such that hash(i) % size = k).

307 Good Hash Function for Integers
Choose tableSize is prime hash(n) = n % tableSize Example: tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17) 1 2 3 4 5 6

308 Good Hash Function for Strings?
Ideas?

309 Good Hash Function for Strings?
Sum the ASCII values of the characters. Consider only the first 3 characters. Uses only 2871 out of 17,576 entries in the table on English words. Let s = s1s2s3s4…s5: choose hash(s) = s1 + s s s … + sn128n Problems: hash(“really, really big”) = well… something really, really big hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

310 Making the String Hash Easy to Compute
Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h;

311 Universal Hashing For any fixed hash function, there will be some pathological sets of inputs everything hashes to the same cell! Solution: Universal Hashing Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! When your program starts up, pick one of the hash functions to use at random (for the entire time) Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs

312 Universal Hash Function: “Random” Vector Approach
Parameterized by prime size and vector: a = <a0 a1 … ar> where 0 <= ai < size Represent each key as r + 1 integers where ki < size size = 11, key = ==> <3,9,7,5,2> size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4> ha(k) = dot product with a “random” vector!

313 Universal Hash Function
Strengths: works on any type as long as you can form ki’s if we’re building a static table, we can try many a’s a random a has guaranteed good properties no matter what we’re hashing Weaknesses must choose prime table size larger than any ki

314 Hash Function Summary Goals of a hash function Hash functions
reproducible mapping from key to table entry evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) complete quickly Hash functions h(n) = n % size h(n) = string as base 128 number % size Universal hash function #1: dot product with random vector The idea of neighboring keys here may change from application to application. In one context, neighboring keys may be those with the same last characters or first characters… say, when hashing names in a school system. Many people may have the same last names or first names (but few will have the same of both).

315 How to Design a Hash Function
Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

316 Collisions Pigeonhole principle says we can’t avoid all collisions
try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? open hashing: put little dictionaries in each entry closed hashing: pick a next entry to try The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?


Download ppt "Data Structures Lecture #1 Introduction"

Similar presentations


Ads by Google