Data Structures Introduction

Slides:

Advertisements

Similar presentations

CSE 326: Data Structures Lecture #4 Alon Halevy Spring Quarter 2001.

Advertisements

CSE 326 Asymptotic Analysis David Kaplan Dept of Computer Science & Engineering Autumn 2001.

CSE 326: Data Structures Lecture #3 Analysis of Recursive Algorithms Alon Halevy Fall Quarter 2000.

Come on down! Take and fill out a survey Get a copy of lecture slides Please sit in the first 5 rows!

CSE 326: Data Structures Lecture #2 Analysis of Algorithms Alon Halevy Fall Quarter 2000.

CSE 326: Data Structures Lecture #7 Binary Search Trees Alon Halevy Spring Quarter 2001.

CSE 326: Data Structures Lecture #6 (end of Lists, then) Trees Alon Halevy Spring Quarter 2001.

CSE 326: Data Structures Lecture #8 Binary Search Trees Alon Halevy Spring Quarter 2001.

CSE 326: Data Structures Lecture #5 Alon Halevy Spring Quarter 2001.

Come on down! Take and fill out a survey Get a copy of lecture slides Please sit in the first 5 rows!

Week 8 - Wednesday.  What did we talk about last time?  Level order traversal  BST delete  2-3 trees.

Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.

Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.

CSE 326: Data Structures Class #4 Analysis of Algorithms III Analysis of Recursive Algorithms Henry Kautz Winter 2002.

CSE373: Data Structures & Algorithms Lecture 8: AVL Trees and Priority Queues Linda Shapiro Spring 2016.

Algorithm Analysis 1.

CSE332: Data Abstractions Lecture 7: AVL Trees

CSE373: Data Structures & Algorithms Priority Queues

CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)

CSE 326: Data Structures Trees

Week 7 - Friday CS221.

CSE 326: Data Structures Lecture #8 Pruning (and Pruning for the Lazy)

Binary Search Trees One of the tree applications in Chapter 10 is binary search trees. In Chapter 10, binary search trees are used to implement bags.

Analysis of Algorithms

CPSC 221: Algorithms and Data Structures Lecture #6 Balancing Act

Binary Search Trees.

CSE373: Data Structures & Algorithms Lecture 7: AVL Trees

October 30th – Priority QUeues

CSE373: Data Structures & Algorithms Lecture 7: AVL Trees

Week 11 - Friday CS221.

Hashing Exercises.

Cse 373 April 26th – Exam Review.

i206: Lecture 13: Recursion, continued Trees

Binary Search Trees Why this is a useful data structure. Terminology

CSE373: Data Structures & Algorithms Lecture 7: AVL Trees

Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.

CSE373: Data Structures & Algorithms Lecture 6: Binary Search Trees

CSE373: Data Structures & Algorithms Lecture 6: Binary Search Trees

CMSC 341: Data Structures Priority Queues – Binary Heaps

Algorithm design and Analysis

Data Structures & Algorithms

Binary Search Trees One of the tree applications in Chapter 10 is binary search trees. In Chapter 10, binary search trees are used to implement bags.

David Kaplan Dept of Computer Science & Engineering Autumn 2001

CPSC 221: Algorithms and Data Structures Lecture #6 Balancing Act

Heaps and the Heapsort Heaps and priority queues

Data Structures Review Session

CSE 332: Data Abstractions Binary Search Trees

Defining Efficiency Asymptotic Complexity - how running time scales as function of size of input Two problems: What is the “input size” ? How do we express.

CSE373: Data Structures & Algorithms Lecture 5: AVL Trees

CS Data Structure: Heaps.

Trees CMSC 202, Version 5/02.

CMSC 202 Trees.

CSE 332: Data Abstractions AVL Trees

CSE 326: Data Structures Lecture #7 Dendrology

CSE 373 Data Structures Lecture 5

Searching, Sorting, and Asymptotic Complexity

Mathematical Background 2

CSE 326: Data Structures: Sorting

CSE 326: Data Structures Lecture #6 Asymptotic Analysis

CSE 326: Data Structures Lecture #8 Balanced Dendrology

CSE 326: Data Structures Lecture #7 Binary Search Trees

CSE 326: Data Structures Lecture #9 AVL II

Richard Anderson Spring 2016

CSE 326: Data Structures Lecture #1 Lists, MultiLists & Trees

David Kauchak cs161 Summer 2009

CSE 326: Data Structures Lecture #2 Analysis of Algorithms I (And A Little More About Linked Lists) Henry Kautz Winter 2002.

CSE 326: Data Structures Lecture #14

326 Lecture 9 Henry Kautz Winter Quarter 2002

Divide-and-Conquer 7 2  9 4   2   4   7

Presentation transcript:

Data Structures Introduction Alon Halevy

Clever? Efficient? Insert Lists, Stacks, Queues Delete Heaps Find Merge Shortest Paths Union Lists, Stacks, Queues Heaps Binary Search Trees AVL Trees Hash Tables Graphs Disjoint Sets Data Structures Algorithms

Used Everywhere! Mastery of this material separates you from: Graphics Theory AI Applications Systems Used Everywhere! Mastery of this material separates you from: Perhaps the most important course in your CS curriculum! Guaranteed non-obsolescence!

Anecdote #1 N2 “pretty print” routine nearly dooms major expert system project at AT&T 10 MB data = 10 days (100 MIPS) programmer was brilliant, but he skipped 326…

Asymptotic Complexity Our notion of efficiency: How the running time of an algorithm scales with the size of its input several ways to further refine: worst case average case amortized over a series of runs

The Apocalyptic Laptop Seth Lloyd, SCIENCE, 31 Aug 2000

Big Bang Ultimate Laptop, 1 year 1 second 1000 MIPS, since Big Bang 1 day

Specific Goals of the Course Become familiar with some of the fundamental data structures in computer science Improve ability to solve problems abstractly data structures are the building blocks Improve ability to analyze your algorithms prove correctness gauge (and improve) time complexity Become modestly skilled with the UNIX operating system (you’ll need this in upcoming courses) This course is designed to familiarize you with the most basic and important data structures in computer science. The ones that will form the foundation of all your future work with computers. Moreover, you’ll learn how to analyze your programs and data structures so that you know how well they work and what sort of effort in the program is acceptable. These are the goals of the course as well as my expectations of you.

One Preliminary Hurdle Recall what you learned in CSE 321 … proofs by mathematical induction proofs by contradiction formulas for calculating sums and products of series recursion Know Sec 1.1 – 1.4 of text by heart!

A Second Hurdle Unix Experience 1975 all over again! Try to login, edit, create a Makefile, and compile your favorite “hello world” program right away Programming Project #1 distributed Wednesday Bring your questions and frustrations to Section on Thursday!

A Third Hurdle: Templates class Set_of_ints { public: insert( int x ); boolean is_member( int x ); … } template <class Obj> class Set { insert( Obj x ); boolean is_member( Obj x ); … } Set <int> SomeNumbers; Set <char *> SomeWords;

In Every Silver Lining, There’s a Big Dark Cloud – George Carlin Templates were invented 12 years ago, and still no compiler correctly implements them! Using templates with multiple source files tricky See Course Web pages and TAs for best way MAINTAINING SANITY RULE Write/debug first without templates Templatize as need Keep it simple!

Handy Libraries From Weiss: Like arrays and char*, but provide vector < int > MySafeIntArray; vector < double > MySafeFloatArray; string MySafeString; Like arrays and char*, but provide bounds checking memory management STL (Standard Template Library) most of CSE 326 in a box don’t use (unless told); we’ll be rolling our own

C++  Data Structures One of the all time great books in computer science: The Art of Computer Programming (1968-1973) by Donald Knuth Examples in assembly language (and English)! American Scientist says: in top 12 books of the CENTURY! Very little about C++ in class.

Abstract Data Types Abstract Data Type (ADT) Data Types Algorithms Mathematical description of an object and the set of operations on the object tradeoffs! Given that this is computer science, I know you’d be disappointed if there were no acronyms in the class. Here’s our first one! Now, what an ADT really is is the interface of a data structure without any specification of the implementation. In this class, we’ll study groups of data structures to implement any given abstract data type. In that context… Data Types integer, array, pointers, … Algorithms binary search, quicksort, …

ADT Presentation Algorithm Present an ADT Motivate with some applications Repeat until it’s time to move on: develop a data structure and algorithms for the ADT analyze its properties efficiency correctness limitations ease of programming Contrast strengths and weaknesses Given those definitions, here’s our first algorithm. This is how I’m going to try to present each set of data structures to you. You should hold me to this! You’re not getting enough out of the presentation if you don’t see these. And look, here’s an ADT now…

First Example: Queue ADT Queue operations create destroy enqueue dequeue is_empty Queue property: if x is enQed before y is enQed, then x will be deQed before y is deQed FIFO: First In First Out F E D C B enqueue dequeue G A You’ve probably seen the Queue before. If so, this is a review and a way for us to get comfortable with the format of data structure presentations in this class. If not, this is a simple but very powerful data structure, and you should make sure you understand it thoroughly. This is an ADT description of the queue. Notice that there are no implementation details. Just a general description of the interface and important properties of those interface methods.

Applications of the Q Hold jobs for a printer Store packets on network routers Make waitlists fair Breadth first search Qs are used widely in computer science. This is just a handful of the high profile uses, but _many_ programs use queues.

Circular Array Q Data Structure size - 1 b c d e f front back enqueue(Object x) { Q[back] = x ; back = (back + 1) % size } How test for empty list? How to find K-th element in the queue? What is complexity of these operations? Limitations of this structure? Here is a data structure implementation of the Q. The queue is stored as an array, and, to avoid shifting all the elements each time an element is dequeued, we imagine that the array wraps around on itself. This is an excellent example of how implementation can affect interface: notice the “is_full” function. There’s also another problem here. What’s wrong with the Enqueue and Dequeue functions? Your data structures should be robust! Make them robust before you even consider thinking about making them efficient! That is an order! dequeue() { x = Q[front] ; front = (front + 1) % size; return x ; }

Linked List Q Data Structure b c d e f front back enqueue(Object x) { back->next = new Node(x); back = back->next; } dequeue() { saved = front->data; temp = front; front = front->next; delete temp ; return saved;} What are tradeoffs? simplicity speed robustness memory usage Notice the tricky memory management

To Do Return your survey before leaving! Sign up on the cse326 mailing list Check out the web page Log on to the PCs in course labs and access an instructional UNIX server Read Chapters 1 and 2 in the book

Data Structures Analysis of Algorithms Alon Halevy

Analysis of Algorithms Analysis of an algorithm gives insight into how long the program runs and how much memory it uses time complexity space complexity Why useful? Input size is indicated by a number n sometimes have multiple inputs, e.g. m and n Running time is a function of n n, n2, n log n, 18 + 3n(log n2) + 5n3

Simplifying the Analysis Eliminate low order terms 4n + 5  4n 0.5 n log n - 2n + 7  0.5 n log n 2n + n3 + 3n  2n Eliminate constant coefficients 4n  n 0.5 n log n  n log n log n2 = 2 log n  log n log3 n = (log3 2) log n  log n We didn’t get very precise in our analysis of the UWID info finder; why? Didn’t know the machine we’d use. Is this always true? Do you buy that coefficients and low order terms don’t matter? When might they matter? (Linked list memory usage)

Order Notation BIG-O T(n) = O(f(n)) OMEGA T(n) =  (f(n)) Upper bound Exist constants c and n0 such that T(n)  c f(n) for all n  n0 OMEGA T(n) =  (f(n)) Lower bound T(n)  c f(n) for all n  n0 THETA T(n) = θ (f(n)) Tight bound θ(n) = O(n) =  (n) We’ll use some specific terminology to describe asymptotic behavior. There are some analogies here that you might find useful.

Examples n2 + 100 n = O(n2) = (n2) = (n2) n log n = O(n2) ( n2 + 100 n )  2 n2 for n  10 ( n2 + 100 n )  1 n2 for n  0 n log n = O(n2) n log n = (n log n) n log n = (n)

More on Order Notation Order notation is not symmetric; write 2n2 + 4n = O(n2) but never O(n2) = 2n2 + 4n right hand side is a crudification of the left Likewise O(n2) = O(n3) (n3) = (n2)

A Few Comparisons Function #2 Function #1 100n2 + 1000 n3 + 2n2 log n

Race I n3 + 2n2 vs. 100n2 + 1000

Race II n0.1 vs. log n Well, log n looked good out of the starting gate and indeed kept on looking good until about n^17 at which point n^0.1 passed it up forever. Moral of the story? N^epsilon beats log n for any eps > 0. BUT, which one of these is really better?

Race III n + 100n0.1 vs. 2n + 10 log n Notice that these just look like n and 2n once we get way out. That’s because the larger terms dominate. So, the left is less, but not asymptotically less. It’s a TIE!

Race IV 5n5 vs. n! N! is BIG!!!

Race V n-152n/100 vs. 1000n15 No matter how you put it, any exponential beats any polynomial. It doesn’t even take that long here (~250 input size)

Race VI 82log(n) vs. 3n7 + 7n We can reduce the left hand term to n^6, so they’re both polynomial and it’s an open and shut case.

The Losers Win Better algorithm! O(n2) O(log n) TIE O(n) O(n5) O(n15) Function #1 n3 + 2n2 n0.1 n + 100n0.1 5n5 n-152n/100 82log n Function #2 100n2 + 1000 log n 2n + 10 log n n! 1000n15 3n7 + 7n Welcome, everyone, to the Silicon Downs. I’m getting race results as we stand here. Let’s start with the first race. I’ll have the first row bet on race #1. Raise your hand if you bet on function #1 (the jockey is n^0.1) So on. Show the race slides after each race.

Common Names constant: O(1) logarithmic: O(log n) linear: O(n) log-linear: O(n log n) superlinear: O(n1+c) (c is a constant > 0) quadratic: O(n2) polynomial: O(nk) (k is a constant) exponential: O(cn) (c is a constant > 1) Well, it turns out that the old Silicon Downs is fixed. They dope up the horses to make the first few laps interesting, but we can always find out who wins. Here’s a chart comparing some of the functions. Notice that any exponential beats any polynomial. Any superlinear beats any poly-log-linear. Also keep in mind (though I won’t show it) that sometimes the input has more than one parameter. Like if you take in two strings. In that case you need to be very careful about what is constant and what can be ignored. O(log m + 2n) is not necessarily O(2n)

Kinds of Analysis Running time may depend on actual data input, not just length of input Distinguish worst case your worst enemy is choosing input best case average case assumes some probabilistic distribution of inputs amortized average time over many operations We already discussed the bound flavor. All of these can be applied to any analysis case. For example, we’ll later prove that sorting in the worst case takes at least n log n time. That’s a lower bound on a worst case. Average case is hard! What does “average” mean. For example, what’s the average case for searching an unordered list (as precise as possible, not asymptotic). WRONG! It’s about n, not 1/2 n. Why? You have to search the whole thing if the elt is not there. Note there’s two senses of tight. I’ll try to avoid the terminology “asymptotically tight” and stick with the lower def’n of tight. O(inf) is not tight!

Analyzing Code C++ operations - constant time consecutive stmts - sum of times conditionals - sum of branches, condition loops - sum of iterations function calls - cost of function body recursive functions - solve recursive equation Above all, use your head!

Nested Loops for i = 1 to n do for j = 1 to n do sum = sum + 1 This example is pretty straightforward. Each loop goes N times, constant amount of work on the inside. N*N*1 = O(N^2)

Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 There’s a little twist here. J goes from I to N, not 1 to N. So, let’s do the sums inside is constant. Next loop is sum I to N of 1 which equals N - I + 1 Outer loop is sum 1 to N of N - I + 1 That’s the same as sum N to 1 of I or N(N+1)/2 or O(N^2)

Conditionals Conditional time  time(C) + Max( time(S1), time(S2) ) if C then S1 else S2 time  time(C) + Max( time(S1), time(S2) ) OK, so this isn’t exactly an example. Just reiterating the rule. Time <= time of C plus max of S1 and S2 <= time of C plus S1 plus S2 time <= sum of times of iterations often #of iterations * time of S (or worst time of S)

Coming Up Thursday Friday Unix tutorial First programming project! Finishing up analysis A little on Stacks and Lists Homework #1 goes out

Data Structures Analysis of Recursive Algorithms Alon Halevy

Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 There’s a little twist here. J goes from I to N, not 1 to N. So, let’s do the sums inside is constant. Next loop is sum I to N of 1 which equals N - I + 1 Outer loop is sum 1 to N of N - I + 1 That’s the same as sum N to 1 of I or N(N+1)/2 or O(N^2)

Recursion A recursive procedure can often be analyzed by solving a recursive equation Basic form: T(n) = if (base case) then some constant else ( time to solve subproblems + time to combine solutions ) Result depends upon how many subproblems how much smaller are subproblems how costly to combine solutions (coefficients) You may want to take notes on this slide as it just vaguely resembles a homework problem! Here’s a function defined in terms of itself. You see this a lot with recursion. This one is a lot like the profile for factorial. WORK THROUGH Answer: O(n)

Example: Sum of Integer Queue sum_queue(Q){ if (Q.length == 0 ) return 0; else return Q.dequeue() + sum_queue(Q); } One subproblem Linear reduction in size (decrease by 1) Combining: constant c (+), 1×subproblem Equation: T(0)  b T(n)  c + T(n – 1) for n>0 Here’s a function defined in terms of itself. You see this a lot with recursion. This one is a lot like the profile for factorial. WORK THROUGH Answer: O(n)

Sum, Continued Equation: T(0)  b T(n)  c + T(n – 1) for n>0 Solution: T(n)  c + c + T(n-2)  c + c + c + T(n-3)  kc + T(n-k) for all k  nc + T(0) for k=n  cn + b = O(n)

Example: Binary Search 7 12 30 35 75 83 87 90 97 99 One subproblem, half as large Equation: T(1)  b T(n)  T(n/2) + c for n>1 Solution: T(n)  T(n/2) + c  T(n/4) + c + c  T(n/8) + c + c + c  T(n/2k) + kc  T(1) + c log n where k = log n  b + c log n = O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series. Tip: Look for powers/multiples of the numbers that appear in the original equation.

Example: MergeSort Split array in half, sort each half, merge together 2 subproblems, each half as large linear amount of work to combine T(1)  b T(n)  2T(n/2) + cn for n>1 T(n)  2T(n/2)+cn  2(2(T(n/4)+cn/2)+cn = 4T(n/4) +cn +cn  4(2(T(n/8)+c(n/4))+cn+cn = 8T(n/8)+cn+cn+cn  2kT(n/2k)+kcn 2kT(1) + cn log n where k = log n = O(n log n) This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(n log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series. Tip: Look for powers/multiples of the numbers that appear in the original equation.

Example: Recursive Fibonacci int Fib(n){ if (n == 0 or n == 1) return 1 ; else return Fib(n - 1) + Fib(n - 2); } Running time: Lower bound analysis T(0), T(1)  1 T(n)  T(n - 1) + T(n - 2) + c if n > 1 Note: T(n)  Fib(n) Fact: Fib(n)  (3/2)n O( (3/2)n ) Why? This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.

Direct Proof of Recursive Fibonacci int Fib(n) if (n == 0 or n == 1) return 1 else return Fib(n - 1) + Fib(n - 2) Lower bound analysis T(0), T(1) >= b T(n) >= T(n - 1) + T(n - 2) + c if n > 1 Analysis let  be (1 + 5)/2 which satisfies 2 =  + 1 show by induction on n that T(n) >= bn - 1 This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.

Direct Proof Continued Basis: T(0)  b > b-1 and T(1)  b = b0 Inductive step: Assume T(m)  bm - 1 for all m < n T(n)  T(n - 1) + T(n - 2) + c  bn-2 + bn-3 + c  bn-3( + 1) + c = bn-32 + c  bn-1

Fibonacci Call Tree 5 3 4 3 2 2 1 1 2 1 1 1

Learning from Analysis To avoid recursive calls store all basis values in a table each time you calculate an answer, store it in the table before performing any calculation for a value n check if a valid answer for n is in the table if so, return it Memoization a form of dynamic programming How much time does memoized version take?

Kinds of Analysis So far we have considered worst case analysis We may want to know how an algorithm performs “on average” Several distinct senses of “on average” amortized average time per operation over a sequence of operations average case average time over a random distribution of inputs expected case average time for a randomized algorithm over different random seeds for any input

Amortized Analysis Consider any sequence of operations applied to a data structure your worst enemy could choose the sequence! Some operations may be fast, others slow Goal: show that the average time per operation is still good

Stack ADT Stack operations B C D E F E D C B A F Stack operations push pop is_empty Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation?

Stretchy Stack Implementation int data[]; int maxsize; int top; Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; copy data into temp; deallocate data; data = temp; } else { data[++top] = e; } Best case Push = O( ) Worst case Push = O( )

Stretchy Stack Amortized Analysis Consider sequence of n operations push(3); push(19); push(2); … What is the max number of stretches? What is the total time? let’s say a regular push takes time a, and stretching an array contain k elements takes time kb, for some constants a and b. Amortized time = (an+b(2n-1))/n = O(1) log n

Wrapup Having math fun? Homework #1 out wednesday – due in one week Programming assignment #1 handed out. Next week: linked lists

Data Structures Alon Halevy

Direct Proof of Recursive Fibonacci int Fib(n) if (n == 0 or n == 1) return 1 else return Fib(n - 1) + Fib(n - 2) Lower bound analysis T(0), T(1) >= b T(n) >= T(n - 1) + T(n - 2) + c if n > 1 Analysis let  be (1 + 5)/2 which satisfies 2 =  + 1 show by induction on n that T(n) >= bn - 1 This is the same sort of analysis as last slide. Here’s a function defined in terms of itself. WORK THROUGH Answer: O(log n) Generally, then, the strategy is to keep expanding these things out until you see a pattern. Then, write the general form. Finally, sub in for the series bounds to make T(?) come out to a known value and solve all the series.

Direct Proof Continued Basis: T(0)  b > b-1 and T(1)  b = b0 Inductive step: Assume T(m)  bm - 1 for all m < n T(n)  T(n - 1) + T(n - 2) + c  bn-2 + bn-3 + c  bn-3( + 1) + c = bn-32 + c  bn-1

Fibonacci Call Tree 5 3 4 3 2 2 1 1 2 1 1 1

Learning from Analysis To avoid recursive calls store all basis values in a table each time you calculate an answer, store it in the table before performing any calculation for a value n check if a valid answer for n is in the table if so, return it Memoization a form of dynamic programming How much time does memoized version take?

Kinds of Analysis So far we have considered worst case analysis We may want to know how an algorithm performs “on average” Several distinct senses of “on average” amortized average time per operation over a sequence of operations average case average time over a random distribution of inputs expected case average time for a randomized algorithm over different random seeds for any input

Amortized Analysis Consider any sequence of operations applied to a data structure your worst enemy could choose the sequence! Some operations may be fast, others slow Goal: show that the average time per operation is still good

Stack ADT Stack operations B C D E F E D C B A F Stack operations push pop is_empty Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation?

Stretchy Stack Implementation int data[]; int maxsize; int top; Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; copy data into temp; deallocate data; data = temp; } else { data[++top] = e; } Best case Push = O( ) Worst case Push = O( )

Stretchy Stack Amortized Analysis Consider sequence of n operations push(3); push(19); push(2); … What is the max number of stretches? What is the total time? let’s say a regular push takes time a, and stretching an array containing k elements takes time kb, for some constants a and b. Amortized = (an+b(2n-1))/n = a+2b-(1/n)= O(1) log n

Average Case Analysis Attempt to capture the notion of “typical” performance Imagine inputs are drawn from some random distribution Ideally this distribution is a mathematical model of the real world In practice usually is much more simple – e.g., a uniform random distribution

Example: Find a Red Card Input: a deck of n cards, half red and half black Algorithm: turn over cards (from top of deck) one at a time until a red card is found. How many cards will be turned over? Best case = Worst case = Average case: over all possible inputs (ways of shuffling deck)

Summary Asymptotic Analysis – scaling with size of input Upper bound O, Lower bound  O(1) or O(log n) great O(2n) almost never okay Worst case most important – strong guarantee Other kinds of analysis sometimes useful: amortized average case

List ADT ( A1 A2 … An-1 An ) List properties length = n Key operations Ai precedes Ai+1 for 1  i < n Ai succeeds Ai-1 for 1 < i  n Size 0 list is defined to be the empty list Key operations Find(item) = position Find_Kth(integer) = item Insert(item, position) Delete(position) Next(position) = position What are some possible data structures? ( A1 A2 … An-1 An ) length = n Now, back to work! We’re going to talk about lists briefly and quickly get to an idea which I hope you haven’t seen. Lists are sets of values. The type of those values is arbitrary but fixed (can’t change from one to another in the same list). Each value is at a position, and those positions are totally ordered.

Implementations of Linked Lists Array: 1 2 3 4 5 6 7 8 9 10 H W 1 I S E A S Y Can we apply binary search to an array representation? Linked list: (optional header) (a b c) a b c  L

Linked List vs. Array linked list array sorted array Find(item) = position Find_Kth(integer)=item Find_Kth(1)=item Insert(item, position) Insert(item) Delete(position) Next(position) = position

Tradeoffs For what kinds of applications is a linked list best? Examples for an unsorted array? Examples for a sorted array?

Implementing in C++ (optional (a b c) header)  Create separate classes for Node List (contains a pointer to the first node) List Iterator (specifies a position in a list; basically, just a pointer to a node) Pro: syntactically distinguishes uses of node pointers Con: a lot of verbage! Also, is a position in a list really distinct from a list?

Data Structures Alon Halevy

Implementations of Linked Lists Array: 1 2 3 4 5 6 7 8 9 10 H W 1 I S E A S Y Can we apply binary search to an array representation? Linked list: (optional header) (a b c) a b c  L

Linked List vs. Array linked list array sorted array Find(item) = position Find_Kth(integer)=item Find_Kth(1)=item Insert(item, position) Insert(item) Delete(position) Next(position) = position

Tradeoffs For what kinds of applications is a linked list best? Examples for an unsorted array? Examples for a sorted array?

Implementing in C++ (optional (a b c) header)  Create separate classes for Node List (contains a pointer to the first node) List Iterator (specifies a position in a list; basically, just a pointer to a node) Pro: syntactically distinguishes uses of node pointers Con: a lot of verbage! Also, is a position in a list really distinct from a list?

Other Data Structures for Lists Doubly Linked List Circular List 7 11 3 2 Advantages/disadvantages (previous for doubly linked list) your book also describes header nodes. Are they just a hack? I’m not going to go into these, but: You should be able to (for a test) add and delete nodes in all these types of list; not to mention for your daily coding needs! c d e f

Implementing Linked Lists Using Arrays 1 2 3 4 5 6 7 8 9 10 Data F O A R N R T Next 3 8 6 4 -1 10 5 First = 2 “Cursor implementation” Ch 3.2.8 Often useful in any language Can use same array to manage a second list of unused cells

Application: Polynomial ADT Ai is the coefficient of the xn-i term: 3x2 + 2x + 5 ( 3 2 5 ) 8x + 7 ( 8 7 ) Here’s an application of the list abstract data type as a _data structure_ for another abstract data type. Is there a problem here? Why? x2 + 3 ( 1 0 3 ) Problem?

3x2001 + 4 ( 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 ) What is it about lists that makes this a problem here and not in stacks and queues? (Answer: kth(int)!) Is there a solution? Will we get anything but zeroes overwhelming this data structure?

Sparse List Data Structure: 3x2001 + 4 (<4 0> <2001 3>) 4 3 2001 This slide is made possible in part by the sparse list data structure. Now, two questions: 1) Is a sparse list really a data structure or an abstract data type? (Answer: It depends but I lean toward data structure. YOUR ANSWER MUST HAVE JUSTIFICATION!) 2) Which list data structure should we use to implement it? Linked Lists or Arrays?

Addition of Two Polynomials Similar to merging two sorted lists – O(n+m) 15+10x50+3x1200 p 15 10 50 3 1200 5+30x50+4x100 q 5 30 50 4 100 r 20 40 50 4 100 3 1200

Multiple Linked Lists Many ADTS such as graphs, relations, sparse matrices, multivariate polynomials use multiple linked lists Several options array of lists lists of lists multi lists General principle throughout the course: use one ADT to implement a more complicated one.

Array of Linked Lists: Adjacency List for Graphs 1 3 2 5 4 Array G of unordered linked lists Each list entry corresponds to an edge in the graph G Graphs are a very important data type. You might think as you read about your project if there are any graphs there. Here, we’re implementing graphs with adjacency lists. The reason is that this is a sparse graph. We want to have every node in an array (so we can find the first edge quickly), but we just need the edges around. 1 5 2 2 4 3 5 3 1 4 4 5 3 5

Reachability by Marking Suppose we want to mark all the nodes in the graph which are reachable from a given node k. Let G[1..n] be the adjacency list rep. of the graph Let M[1..n] be the mark array, initially all falses. mark(int i){ M[i] = true; x = G[i] while (x != NULL) { if (M[x->node] == false) mark(G[x->node]) x = x->next } Here’s an algorithm that works on our adj list graph.

Multi-Lists Suppose we have a set of movies and cinemas, and we want a structure that stores which movies are playing where.

More on Multi-Lists What if we also want to store the playing times of movies?

Data Structures (end of Lists, then) Trees Alon Halevy

Application: Polynomial ADT Ai is the coefficient of the xn-i term: 3x2 + 2x + 5 ( 3 2 5 ) 8x + 7 ( 8 7 ) Here’s an application of the list abstract data type as a _data structure_ for another abstract data type. Is there a problem here? Why? x2 + 3 ( 1 0 3 ) Problem?

3x2001 + 4 ( 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 ) What is it about lists that makes this a problem here and not in stacks and queues? (Answer: kth(int)!) Is there a solution? Will we get anything but zeroes overwhelming this data structure?

Sparse List Data Structure: 3x2001 + 4 (<4 0> <2001 3>) 4 3 2001 This slide is made possible in part by the sparse list data structure. Now, two questions: 1) Is a sparse list really a data structure or an abstract data type? (Answer: It depends but I lean toward data structure. YOUR ANSWER MUST HAVE JUSTIFICATION!) 2) Which list data structure should we use to implement it? Linked Lists or Arrays?

Addition of Two Polynomials Similar to merging two sorted lists – O(n+m) 15+10x50+3x1200 p 15 10 50 3 1200 5+30x50+4x100 q 5 30 50 4 100 r 20 40 50 4 100 3 1200

Multiple Linked Lists Many ADTS such as graphs, relations, sparse matrices, multivariate polynomials use multiple linked lists Several options array of lists lists of lists multi lists General principle throughout the course: use one ADT to implement a more complicated one.

Array of Linked Lists: Adjacency List for Graphs 1 3 2 5 4 Array G of unordered linked lists Each list entry corresponds to an edge in the graph G Graphs are a very important data type. You might think as you read about your project if there are any graphs there. Here, we’re implementing graphs with adjacency lists. The reason is that this is a sparse graph. We want to have every node in an array (so we can find the first edge quickly), but we just need the edges around. 1 5 2 2 4 3 5 3 1 4 4 5 3 5

Reachability by Marking Suppose we want to mark all the nodes in the graph which are reachable from a given node k. Let G[1..n] be the adjacency list rep. of the graph Let M[1..n] be the mark array, initially all falses. mark(int i){ M[i] = true; x = G[i] while (x != NULL) { if (M[x->node] == false) mark(G[x->node]) x = x->next } Here’s an algorithm that works on our adj list graph.

Multi-Lists Suppose we have a set of movies and cinemas, and we want a structure that stores which movies are playing where.

More on Multi-Lists What if we also want to store the playing times of movies?

Trees Family Trees Organization Charts Classification trees is this mushroom poisonous? File directory structure Parse Trees (x+y*z) Search Trees often better than lists for sorted data

Definition of a Tree Recursive definition: r T1 T2 T3 empty tree has no root given trees T1,…,Tk and a node r, there is a tree T where r is the root of T the children of r are the roots of T1, T2, …, Tk r T1 T2 T3

Tree Terminology root child parent sibling path descendent ancestor a j b f k l e c g Let’s review the words: root: A leaf: DEFJKLMNI child:A - C or H - K leaves have no children parent: C - A or L - H the root has no parent sibling: D - E or F or J - K,L,M, or N grandparent: G to A grandchild: C to H or I ancestor: the node itself or any ancestor’s parent descendent: the node itself or any child’s descendent subtree: a node and all its descendents

More Tree Terminology a subtree leaf depth height branching factor n-ary complete e b c d h i j f g Let’s review the words: root: A leaf: DEFJKLMNI child:A - C or H - K leaves have no children parent: C - A or L - H the root has no parent sibling: D - E or F or J - K,L,M, or N grandparent: G to A grandchild: C to H or I ancestor: the node itself or any ancestor’s parent descendent: the node itself or any child’s descendent subtree: a node and all its descendents k l

Basic Tree Data Structure first_child next_sibling a b c d e 

Logical View of Tree a i d h j b f k l e c g

Actual Data Structure a b c d e h i j f g k l

Combined View of Tree a b c d e h i j f g k l

Traversals Many algorithms involve walking through a tree, and performing some computation at each node Walking through a tree is called a traversal Common kinds of traversal Pre-order Post-order Level-order

Pre-Order Traversal Perform computation at the node, then recursively perform computation on each child preorder(node * n){ node * c; if (! n==NULL){ DO SOMETHING; c = n->first_child; while (! c==NULL){ preorder(c); c = c->next_sibling; } }

Pre-Order Traversal Example i d h j b f k l e c g Start with a -

Pre-Order Applications Use when computation at node depends upon values calculated higher in the tree (closer to root) Example: computing depth depth(node) = 1 + depth( parent of node ) Another example: printing out a directory structure.

Computing Depth of All Nodes Add a field “depth” to all nodes Depth(node * n, int d){ node * c; if (! n==NULL){ n->depth = d; d = d+1; c = n->first_child; while (! c==NULL){ Depth(c, d); c = c->next_sibling; } } Call Depth(root,0) to set depth field correctly

Depth Calculation a i d h j b f k l e c g

Post-Order Traversal Recursively perform computation on each child, and then perform computation at node postorder(node * n){ node * c; if (! n==NULL){ c = n->first_child; while (! c==NULL){ postorder(c); c = c->next_sibling; } DO SOMETHING;

Post-Order Applications Use when computation at node depends upon values calculated lower in the tree (closer to leafs) Example: computing height height(node) = 1 + MAX( height(child1), height(child2), … height(childk)) Example: size of tree rooted at node size(node) = 1 + size(child1) + size(child2) + … + size(childk))

Computing Size of Tree Size(node * n){ node * c; if (! n==NULL) return 0; else { int m=1; c = n->first_child; while (! c==NULL){ m = m + Size(c); c = c->next_sibling; } return m; } Call Size(root) to compute number of nodes in tree

Depth-First Search Both Pre-Order and Post-Order traversals are examples of depth-first search nodes are visited deeply on the left-most branches before any nodes are visited on the right-most branches visiting the right branches deeply before the left would still be depth-first! Crucial idea is “go deep first!” In DFS the nodes “being worked on” are kept on a stack (where?)

Level-Order/Breadth-first Traversal Consider task of traversing tree level by level from top to bottom (alphabetic order) What data structure to use to keep track of nodes?? a i d h j b f k l e c g

Level-Order (Breadth First) Traversal Put root in a Queue Repeat until Queue is empty: Dequeue a node Process it Add it’s children to queue

Example: Printing the Tree print(node * root){ node * n, c; queue Q; Q.enqueue(root); while (! Q.empty()){ n = Q.dequeue(); print n->data; c = n->first_child; while (! c==NULL){ Q.enqueue(c); c = c->next_sibling; } } }

QUEUE a i d h j b f k l e c g a b c d e c d e f g d e f g e f g h i j h i j k i j k j k l k l l a i d h j b f k l e c g

Applications of BFS Find the shortest path from the root to a given node N if N is at depth k, BFS will never visit a node at depth>k important for really deep trees Generalizes to finding shortest paths in graphs Spidering the world wide web From a root URL, fetch pages that are further and further away

Data Structures Binary Search Trees Alon Halevy

Binary Trees A Many algorithms are efficient and easy to program for the special case of binary trees Binary tree is a root left subtree (maybe empty) right subtree (maybe empty) B C D E F G H Alright, we’ll focus today on one type of trees called binary trees. Here’s one now. Is this binary tree complete? Why not? (C has just one child, right side is much deeper than left) What’s the maximum # of leaves a binary tree of depth d can have? What’s the max # of nodes a binary tree of depth d can have? Minimum? We won’t go into this, but if you take N nodes and assume all distinct trees of the nodes are equally likely, you get an average depth of SQRT(N). Is that bigger or smaller than log n? Bigger, so it’s not good enough! I J

Representation A Data right pointer left A B C B C D E F D E F

Properties of Binary Trees Max # of leafs in a tree of height h = Max # of nodes in a tree of height h = A B C D E F G

Dictionary & Search ADTs Operations create destroy insert find delete Dictionary: Stores values associated with user-specified keys keys may be any (homogenous) comparable type values may be any (homogenous) type implementation: data field is a struct with two parts Search ADT: keys = values kim chi spicy cabbage kreplach tasty stuffed dough kiwi Australian fruit insert kohlrabi - upscale tuber find(kreplach) kreplach - tasty stuffed dough Dictionaries associate some key with a value, just like a real dictionary (where the key is a word and the value is its definition). In this example, I’ve stored user-IDs associated with descriptions of their coolness level. This is probably the most valuable and widely used ADT we’ll hit. I’ll give you an example in a minute that should firmly entrench this concept.

Naïve Implementations unsorted array sorted linked list insert find + O(n) O(n) find + O(1) find O(log n) delete (if no shrink) Goal: fast find like sorted array, dynamic inserts/deletes like linked list

Binary Search Tree Dictionary Data Structure Search tree property all keys in left subtree smaller than root’s key all keys in right subtree larger than root’s key result: easy to find any given key inserts/deletes by changing links 8 5 11 2 6 10 12 A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing) 4 7 9 14 13

Example and Counter-Example 5 8 4 8 5 18 1 7 11 2 7 6 10 11 Why is the one on the left a BST? It’s not complete! (B/c BSTs don’t need to be complete) Why isn’t the one on the right a BST? Three children of 5 20 has a left child larger than it. What’s wrong with 11? Even though 15 isn’t a direct child, it _still_ needs to be less than 11! 3 4 BINARY SEARCH TREE NOT A BINARY SEARCH TREE

In Order Listing visit left subtree visit node visit right subtree 10 5 15 2 9 20 Anyone notice anything interesting about that in-order listing? Everything in the left subtree is listed first. Then the root. Then everything in the right subtree. OK, let’s work out the code to make the in-order listing. Is there an iterative version that doesn’t use its own stack? Not really, no. So, recursion is probably OK here. Anyway, if the tree’s too deep for recursion, you must have a huge amount of data. If (n != null) inorder(n->left) cout << n inorder(n->right) 7 17 30 In order listing: 25791015172030

Finding a Node 10 5 15 2 9 20 7 17 30 runtime: Node *& find(Comparable x, Node * root) { if (root == NULL) return root; else if (x < root->key) return find(x, root->left); else if (x > root->key) root->right); else } 10 5 15 2 9 20 Now, let’s try finding a node. Find 9. This time I’ll supply the code. This should look a _lot_ like binary search! How long does it take? Log n is an easy answer, but what if the tree is very lopsided? So really, this is worst case O(n)! A better answer is theta of the depth of the node sought. If we can bound the depth of that node, we can bound the length of time a search takes. What about the code? All those &s and *s should look pretty scary. Let’s talk through them. 7 17 30 runtime:

Insert Concept: proceed down tree as in Find; if new key not found, then insert a new node at last spot traversed void insert(Comparable x, Node * root) { assert ( root != NULL ); if (x < root->key){ if (root->left == NULL) root->left = new Node(x); else insert( x, root->left ); } else if (x > root->key){ if (root->right == NULL) root->right = new Node(x); else insert( x, root->right ); } } Let’s do some inserts: insert(8) insert (11) insert(31)

BuildTree for BSTs Suppose a1, a2, …, an are inserted into an initially empty BST: a1, a2, …, an are in increasing order a1, a2, …, an are in decreasing order a1 is the median of all, a2 is the median of elements less than a1, a3 is the median of elements greater than a1, etc. data is randomly ordered OK, we had a buildHeap, let’s buildTree. How long does this take? Well, IT DEPENDS! Let’s say we want to build a tree from 123456789 What happens if we insert in order? Reverse order? What about 5, then 3, then 7, then 2, then 1, then 6, then 8, then 9?

Examples of Building from Scratch 1, 2, 3, 4, 5, 6, 7, 8, 9 5, 3, 7, 2, 4, 6, 8, 1, 9

Analysis of BuildTree Worst case is O(n2) 1 + 2 + 3 + … + n = O(n2) Average case assuming all orderings equally likely is O(n log n) not averaging over all binary trees, rather averaging over all input sequences (inserts) equivalently: average depth of a node is log n proof: see Introduction to Algorithms, Cormen, Leiserson, & Rivest Average runtime is equal to the average depth of a node in the tree. We’ll calculate the average depth by finding the sum of all depths in the tree, and dividing by the number of nodes. What’s the sum of all depths? D(n) = D(I) + D(N - I - 1) + N - 1 (left subtree = I, root is 1 node, so right = n - I - 1. D(I) is depth of left, 1 node deeper in overall, same goes for right, total of I + N - I - 1 extra depth). For BSTs, all subtree sizes are equally likely (because we pick the middle element and random and the rest fall on the left or right determinically). Each subtree then averages 1/N * sum 0 to N-1 of D(j)

Bonus: FindMin/FindMax Find minimum Find maximum 10 5 15 2 9 20 Every now and then everyone succumbs to the temptation to really overuse color. 7 17 30

Deletion 10 5 15 2 9 20 And now for something completely different. Let’s say I want to delete a node. Why might it be harder than insertion? Might happen in the middle of the tree instead of at leaf. Then, I have to fix the BST. 7 17 30 Why might deletion be harder than insertion?

Deletion - Leaf Case Delete(17) 10 5 15 2 9 20 7 17 30 Alright, we did it the easy way, but what about real deletions? Leaves are easy; we just prune them. 7 17 30

Deletion - One Child Case Delete(15) 10 5 15 2 9 20 Single child nodes we remove and… Do what? We can just pull up their children. Is the search tree property intact? Yes. 7 30

Deletion - Two Child Case Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 replace node with value guaranteed to be between the left and right subtrees: the successor Could we have used the predecessor instead?

Finding the Successor Find the next larger node in this node’s subtree. not next larger in entire tree Node * succ(Node * root) { if (root->right == NULL) return NULL; else return min(root->right); } 10 5 15 2 9 20 Here’s a little digression. Maybe it’ll even have an application at some point. Find the next larger node in 10’s subtree. Can we define it in terms of min and max? It’s the min of the right subtree! 7 17 30 How many children can the successor of a node have?

Predecessor Find the next smaller node in this node’s subtree. 10 5 15 Node * pred(Node * root) { if (root->left == NULL) return NULL; else return max(root->left); } 10 5 15 2 9 20 Predecessor is just the mirror problem. 7 17 30

Deletion - Two Child Case Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 always easy to delete the successor – always has either 0 or 1 children!

Delete Code void delete(Comparable x, Node *& p) { Node * q; if (p != NULL) { if (p->key < x) delete(x, p->right); else if (p->key > x) delete(x, p->left); else { /* p->key == x */ if (p->left == NULL) p = p->right; else if (p->right == NULL) p = p->left; else { q = successor(p); p->key = q->key; delete(q->key, p->right); } } } Here’s the code for deletion using lots of confusing reference pointers BUT no leaders, fake nodes. The iterative version of this can get somewhat messy, but it’s not really any big deal.

Lazy Deletion Instead of physically deleting nodes, just mark them as deleted simpler physical deletions done in batches some adds just flip deleted flag extra memory for deleted flag many lazy deletions slow finds some operations may have to be modified (e.g., min and max) 10 5 15 Now, before we move on to all the pains of true deletion, let’s do it the easy way. We’ll just pretend we delete deleted nodes. This has some real advantages: … 2 9 20 7 17 30

Lazy Deletion Delete(17) Delete(15) Delete(5) Find(9) Find(16) Insert(5) Find(17) 10 5 15 2 9 20 OK, let’s do some lazy deletions. Everybody yawn, stretch, and say “Mmmm… doughnut” to get in the mood. Those of you who are already asleep have the advantage. 7 17 30

Dictionary Implementations unsorted array sorted linked list BST insert find + O(n) O(n) find + O(1) O(Depth) find O(log n) delete BST’s looking good for shallow trees, i.e. the depth D is small (log n), otherwise as bad as a linked list!

Beauty is Only (log n) Deep Binary Search Trees are fast if they’re shallow: e.g.: perfectly complete e.g.: perfectly complete except the “fringe” (leafs) any other good cases? What makes a good BST good? Here’s two examples. Are these the only good BSTs? No! Anything without too many long branches is good, right? Problems occur when one branch is much longer than the other! What matters here?

Data Structures Binary Search Trees Alon Halevy

Binary Trees A Many algorithms are efficient and easy to program for the special case of binary trees Binary tree is a root left subtree (maybe empty) right subtree (maybe empty) B C D E F G H Alright, we’ll focus today on one type of trees called binary trees. Here’s one now. Is this binary tree complete? Why not? (C has just one child, right side is much deeper than left) What’s the maximum # of leaves a binary tree of depth d can have? What’s the max # of nodes a binary tree of depth d can have? Minimum? We won’t go into this, but if you take N nodes and assume all distinct trees of the nodes are equally likely, you get an average depth of SQRT(N). Is that bigger or smaller than log n? Bigger, so it’s not good enough! I J

Binary Search Tree Dictionary Data Structure Search tree property all keys in left subtree smaller than root’s key all keys in right subtree larger than root’s key result: easy to find any given key inserts/deletes by changing links 8 5 11 2 6 10 12 A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing) 4 7 9 14 13

Example and Counter-Example 5 8 4 8 5 18 1 7 11 2 7 6 10 11 Why is the one on the left a BST? It’s not complete! (B/c BSTs don’t need to be complete) Why isn’t the one on the right a BST? Three children of 5 20 has a left child larger than it. What’s wrong with 11? Even though 15 isn’t a direct child, it _still_ needs to be less than 11! 3 4 BINARY SEARCH TREE NOT A BINARY SEARCH TREE

In Order Listing visit left subtree visit node visit right subtree 10 5 15 2 9 20 Anyone notice anything interesting about that in-order listing? Everything in the left subtree is listed first. Then the root. Then everything in the right subtree. OK, let’s work out the code to make the in-order listing. Is there an iterative version that doesn’t use its own stack? Not really, no. So, recursion is probably OK here. Anyway, if the tree’s too deep for recursion, you must have a huge amount of data. If (n != null) inorder(n->left) cout << n inorder(n->right) 7 17 30 In order listing: 25791015172030

Finding a Node 10 5 15 2 9 20 7 17 30 runtime: Node *& find(Comparable x, Node * root) { if (root == NULL) return root; else if (x < root->key) return find(x, root->left); else if (x > root->key) root->right); else } 10 5 15 2 9 20 Now, let’s try finding a node. Find 9. This time I’ll supply the code. This should look a _lot_ like binary search! How long does it take? Log n is an easy answer, but what if the tree is very lopsided? So really, this is worst case O(n)! A better answer is theta of the depth of the node sought. If we can bound the depth of that node, we can bound the length of time a search takes. What about the code? All those &s and *s should look pretty scary. Let’s talk through them. 7 17 30 runtime:

Insert Concept: proceed down tree as in Find; if new key not found, then insert a new node at last spot traversed void insert(Comparable x, Node * root) { assert ( root != NULL ); if (x < root->key){ if (root->left == NULL) root->left = new Node(x); else insert( x, root->left ); } else if (x > root->key){ if (root->right == NULL) root->right = new Node(x); else insert( x, root->right ); } } Let’s do some inserts: insert(8) insert (11) insert(31)

BuildTree for BSTs Suppose a1, a2, …, an are inserted into an initially empty BST: a1, a2, …, an are in increasing order a1, a2, …, an are in decreasing order a1 is the median of all, a2 is the median of elements less than a1, a3 is the median of elements greater than a1, etc. data is randomly ordered OK, we had a buildHeap, let’s buildTree. How long does this take? Well, IT DEPENDS! Let’s say we want to build a tree from 123456789 What happens if we insert in order? Reverse order? What about 5, then 3, then 7, then 2, then 1, then 6, then 8, then 9?

Examples of Building from Scratch 1, 2, 3, 4, 5, 6, 7, 8, 9 5, 3, 7, 2, 4, 6, 8, 1, 9

Analysis of BuildTree Worst case is O(n2) 1 + 2 + 3 + … + n = O(n2) Average case assuming all orderings equally likely is O(n log n) not averaging over all binary trees, rather averaging over all input sequences (inserts) equivalently: average depth of a node is log n proof: see Introduction to Algorithms, Cormen, Leiserson, & Rivest Average runtime is equal to the average depth of a node in the tree. We’ll calculate the average depth by finding the sum of all depths in the tree, and dividing by the number of nodes. What’s the sum of all depths? D(n) = D(I) + D(N - I - 1) + N - 1 (left subtree = I, root is 1 node, so right = n - I - 1. D(I) is depth of left, 1 node deeper in overall, same goes for right, total of I + N - I - 1 extra depth). For BSTs, all subtree sizes are equally likely (because we pick the middle element and random and the rest fall on the left or right determinically). Each subtree then averages 1/N * sum 0 to N-1 of D(j)

Bonus: FindMin/FindMax Find minimum Find maximum 10 5 15 2 9 20 Every now and then everyone succumbs to the temptation to really overuse color. 7 17 30

Deletion 10 5 15 2 9 20 And now for something completely different. Let’s say I want to delete a node. Why might it be harder than insertion? Might happen in the middle of the tree instead of at leaf. Then, I have to fix the BST. 7 17 30 Why might deletion be harder than insertion?

Deletion - Leaf Case Delete(17) 10 5 15 2 9 20 7 17 30 Alright, we did it the easy way, but what about real deletions? Leaves are easy; we just prune them. 7 17 30

Deletion - One Child Case Delete(15) 10 5 15 2 9 20 Single child nodes we remove and… Do what? We can just pull up their children. Is the search tree property intact? Yes. 7 30

Deletion - Two Child Case Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 replace node with value guaranteed to be between the left and right subtrees: the successor Could we have used the predecessor instead?

Finding the Successor Find the next larger node in this node’s subtree. not next larger in entire tree Node * succ(Node * root) { if (root->right == NULL) return NULL; else return min(root->right); } 10 5 15 2 9 20 Here’s a little digression. Maybe it’ll even have an application at some point. Find the next larger node in 10’s subtree. Can we define it in terms of min and max? It’s the min of the right subtree! 7 17 30 How many children can the successor of a node have?

Predecessor Find the next smaller node in this node’s subtree. 10 5 15 Node * pred(Node * root) { if (root->left == NULL) return NULL; else return max(root->left); } 10 5 15 2 9 20 Predecessor is just the mirror problem. 7 17 30

Deletion - Two Child Case Delete(5) 10 5 20 2 9 30 Ah, now the hard case. How do we delete a two child node? We remove it and replace it with what? It has all these left and right children that need to be greater and less than the new value (respectively). Is there any value that is guaranteed to be between the two subtrees? Two of them: the successor and predecessor! So, let’s just replace the node’s value with it’s successor and then delete the succ. 7 always easy to delete the successor – always has either 0 or 1 children!

Delete Code void delete(Comparable x, Node *& p) { Node * q; if (p != NULL) { if (p->key < x) delete(x, p->right); else if (p->key > x) delete(x, p->left); else { /* p->key == x */ if (p->left == NULL) p = p->right; else if (p->right == NULL) p = p->left; else { q = successor(p); p->key = q->key; delete(q->key, p->right); } } } Here’s the code for deletion using lots of confusing reference pointers BUT no leaders, fake nodes. The iterative version of this can get somewhat messy, but it’s not really any big deal.

Lazy Deletion Instead of physically deleting nodes, just mark them as deleted simpler physical deletions done in batches some adds just flip deleted flag extra memory for deleted flag many lazy deletions slow finds some operations may have to be modified (e.g., min and max) 10 5 15 Now, before we move on to all the pains of true deletion, let’s do it the easy way. We’ll just pretend we delete deleted nodes. This has some real advantages: … 2 9 20 7 17 30

Lazy Deletion Delete(17) Delete(15) Delete(5) Find(9) Find(16) Insert(5) Find(17) 10 5 15 2 9 20 OK, let’s do some lazy deletions. Everybody yawn, stretch, and say “Mmmm… doughnut” to get in the mood. Those of you who are already asleep have the advantage. 7 17 30

Dictionary Implementations unsorted array sorted linked list BST insert find + O(n) O(n) find + O(1) O(Depth) find O(log n) delete BST’s looking good for shallow trees, i.e. the depth D is small (log n), otherwise as bad as a linked list!

Beauty is Only (log n) Deep Binary Search Trees are fast if they’re shallow: e.g.: perfectly complete e.g.: perfectly complete except the “fringe” (leafs) any other good cases? What makes a good BST good? Here’s two examples. Are these the only good BSTs? No! Anything without too many long branches is good, right? Problems occur when one branch is much longer than the other! What matters here?

Balance Balance Balance between -1 and 1 everywhere  5 7 Balance height(left subtree) - height(right subtree) zero everywhere  perfectly balanced small everywhere  balanced enough We’ll use the concept of Balance to keep things shallow. Balance between -1 and 1 everywhere  maximum height of 1.44 log n

AVL Tree Dictionary Data Structure Binary search tree properties binary tree property search tree property Balance property balance of every node is: -1 b  1 result: depth is (log n) 8 5 11 2 6 10 12 So, AVL trees will be Binary Search Trees with one extra feature: They balance themselves! The result is that all AVL trees at any point will have a logarithmic asymptotic bound on their depths 4 7 9 13 14 15

An AVL Tree 10 10 3 5 15 2 9 12 20 17 30 data 3 height children 1 2 1 1 2 9 12 20 Here’s a revision of that tree that’s balanced. (Same values, similar tree) This one _is_ an AVL tree (and isn’t leftist). I also have here how we might store the nodes in the AVL tree. Notice that I’m going to keep track of height all the time. WHY? 17 30

Not AVL Trees 10 10 0-2 = -2 (-1)-1 = -2 5 15 15 12 20 20 17 30 3 2 2 2 0-2 = -2 (-1)-1 = -2 1 5 15 15 1 12 20 20 Here’s a revision of that tree that’s balanced. (Same values, similar tree) This one _is_ an AVL tree (and isn’t leftist). I also have here how we might store the nodes in the AVL tree. Notice that I’m going to keep track of height all the time. WHY? 17 30

Staying Balanced M S T Good case: inserting small, tall and middle. Insert(middle) Insert(small) Insert(tall) 1 M Let’s make a tree from these people with their height as the keys. We’ll start by inserting [MIDDLE] first. Then, [SMALL] and finally [TALL]. Is this tree balanced? Yes! S T

Bad Case #1 S M T Insert(small) Insert(middle) Insert(tall) 2 1 But, let’s start over… Insert [SMALL] Now, [MIDDLE]. Now, [TALL]. Is this tree balanced? NO! Who do we need at the root? [MIDDLE!] Alright, let’s pull er up. T

Single Rotation S M M S T T 2 1 1 Basic operation used in AVL trees: This is the basic operation we’ll use in AVL trees. Since this is a right child, it could legally have the parent as its left child. When we finish the rotation, we have a balanced tree! S T T Basic operation used in AVL trees: A right child could legally have its parent as its left child.

General Case: Insert Unbalances h + 1 h + 2 a a h h - 1 h + 1 h - 1 b X b X h-1 h h - 1 h - 1 Z Y Z Y Here’s the general form of this. We insert into the red tree. That ups the three heights on the left. Basically, you just need to pull up on the child. Then, ensure that everything falls in place as legal subtrees of the nodes. Notice, though, the height of this subtree is the same as it was before the insert into the red tree. So? So, we don’t have to worry about ancestors of the subtree becoming imbalanced; we can just stop here!

General Single Rotation h + 2 h + 1 a a X Y b Z h h + 1 h - 1 b X h h - 1 h h - 1 h - 1 Z Y Here’s the general form of this. We insert into the red tree. That ups the three heights on the left. Basically, you just need to pull up on the child. Then, ensure that everything falls in place as legal subtrees of the nodes. Notice, though, the height of this subtree is the same as it was before the insert into the red tree. So? So, we don’t have to worry about ancestors of the subtree becoming imbalanced; we can just stop here! Height of left subtree same as it was before insert! Height of all ancestors unchanged We can stop here!

Will a single rotation fix this? Bad Case #2 Insert(small) Insert(tall) Insert(middle) 2 S 1 T There’s another bad case, though. What if we insert: [SMALL] [TALL] [MIDDLE] Now, is the tree imbalanced? Will a single rotation fix it? (Try it by bringing up tall; doesn’t work!) Will a single rotation fix this? M

Double Rotation S S M T M S T M T 2 2 1 1 1 Let’s try two single rotations, starting a bit lower down. First, we rotate up middle. Then, we rotate up middle again! Is the new tree balanced? S T M T

General Double Rotation h + 2 a h + 1 h + 1 c h - 1 b Z h h b a h - 1 W h c h - 1 h - 1 X Y W Z X Y Here’s the general form of this. Notice that the difference here is that we zigged one way than zagged the other to find the problem. We don’t really know or care which of X or Y was inserted into, but one of them was. To fix it, we pull c all the way up. Then, put a, b, and the subtrees beneath it in the reasonable manner. The height is still the same at the end! h - 1? h - 1? Initially: insert into either X or Y unbalances tree (root height goes to h+2) “Zig zag” to pull up c – restores root height to h+1, left subtree height to h

Insert Algorithm Find spot for value Hang new node Search back up looking for imbalance If there is an imbalance: case #1: Perform single rotation and exit case #2: Perform double rotation and exit OK, thank you BST Three! And those two cases (along with their mirror images) are the only four that can happen! So, here’s our insert algorithm. We just hang the node. Search for a spot where there’s imbalance. If there is, fix it (according to the shape of the imbalance). And then we’re done; there can only be one problem!

Easy Insert Insert(3) 10 5 15 2 9 12 20 17 30 3 1 2 1 Let’s insert 3. 1 2 9 12 20 Let’s insert 3. This is easy! It just goes under 2 (to the left). Update the balances: any imbalance? NO! 17 30

Hard Insert (Bad Case #1) 2 3 Insert(33) 10 5 15 2 9 12 20 Now, let’s insert 33. Where does it go? Left of 30. 3 17 30

Single Rotation 1 2 3 1 2 3 10 10 5 15 5 20 2 9 12 20 2 9 15 30 Here’s the tree with the balances updated. Now, node 15 is bad! Since the problem is in the left subtree of the left child, we can fix it with a single rotation. We pull 20 up. Hang 15 to the left. Pass 17 to 15. And, we’re done! Notice that I didn’t update 10’s height until we checked 15. Did it change after all? 3 17 30 3 12 17 33 33

Hard Insert (Bad Case #2) 1 2 3 Insert(18) 10 5 15 2 9 12 20 Now, let’s back up to before 33 and insert 18 instead. Goes right of 17. Again, there’s imbalance. But, this time, it’s a zig-zag! 3 17 30

Single Rotation (oops!) 1 2 3 1 2 3 10 10 5 15 5 20 2 9 12 20 2 9 15 30 We can try a single rotation, but we end up with another zig-zag! 3 17 30 3 12 17 18 18

Double Rotation (Step #1) 2 3 1 2 3 10 10 5 15 5 15 2 9 12 20 2 9 12 17 So, we’ll double rotate. Start by moving the offending grand-child up. We get an even more imbalanced tree. BUT, it’s imbalanced like a zig-zig tree now! 3 17 30 3 20 18 18 30 Look familiar?

Double Rotation (Step #2) 1 2 3 1 2 3 10 10 5 15 5 17 2 9 12 17 2 9 15 20 So, let’s pull 17 up again. Now, we get a balanced tree. And, again, 10’s height didn’t need to change. 3 20 3 12 18 30 18 30

AVL Algorithm Revisited Recursive 1. Search downward for spot 2. Insert node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2, double rotate Iterative 1. Search downward for spot, stacking parent nodes 2. Insert node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate and exit b. If imbalance #2, double rotate and OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!

Single Rotation Code X Y Z root temp void RotateRight(Node *& root) { Node * temp = root->right; root->right = temp->left; temp->left = root; root->height = max(root->right->height, root->left->height) + 1; temp->height = max(temp->right->height, temp->left->height) + 1; root = temp; } Here’s code for one of the two single rotate cases. RotateRight brings up the right child. We’ve inserted into Z, and now we want to fix it.

Double Rotation Code First Rotation a Z b W c X Y a Z c b X Y W void DoubleRotateRight(Node *& root) { RotateLeft(root->right); RotateRight(root); } First Rotation a Z b W c X Y a Z c b X Y W Here’s the double rotation code. Pretty tough, eh?

Double Rotation Completed First Rotation Second Rotation a Z c b X Y W c a b X W Z Y

Data Structures AVL II Alon Halevy Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)

Deletion (Really Easy Case) 1 2 3 Delete(17) 10 5 15 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 17 30

Deletion (Pretty Easy Case) 1 2 3 Delete(15) 10 5 15 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 17 30

Deletion (Pretty Easy Case cont.) 3 Delete(15) 10 2 2 5 17 1 1 2 9 12 20 OK, if we have a bit of extra time, do this. Let’s try deleting. 15 is easy! It has two children, so we do BST deletion. 17 replaces 15. 15 goes away. Did we disturb the tree? NO! 3 30

Deletion (Hard Case #1) Delete(12) 10 5 17 2 9 12 20 3 30 3 2 1 2 3 Delete(12) 10 5 17 2 9 12 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 3 30

Single Rotation on Deletion 1 2 3 3 10 10 2 1 5 17 5 20 1 2 9 20 2 9 17 30 But what happened on the fix? Something very disturbing. What? The subtree’s height changed!! So, the deletion can propagate. 3 30 3 What is different about deletion than insertion?

Deletion (Hard Case) Delete(9) 10 5 17 2 9 12 12 20 20 3 11 15 15 18 3 4 Delete(9) 10 5 17 2 9 12 12 20 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 1 1 3 11 15 15 18 30 30 13 13 33 33

Double Rotation on Deletion Not finished! 1 2 3 4 2 1 3 4 10 10 5 17 3 17 2 2 12 20 2 5 12 20 1 1 1 1 3 11 15 18 30 11 15 18 30 13 33 13 33

Deletion with Propagation 2 1 3 4 10 What different about this case? 3 17 2 5 12 20 1 1 We get to choose whether to single or double rotate! 11 15 18 30 13 33

Propagated Single Rotation 2 1 3 4 4 10 17 3 2 3 17 10 20 1 2 1 2 5 12 20 3 12 18 30 1 1 1 11 15 18 30 2 5 11 15 33 13 33 13

Propagated Double Rotation 2 1 3 4 4 10 12 2 3 3 17 10 17 1 1 2 2 5 12 20 3 11 15 20 1 1 1 11 15 18 30 2 5 13 18 30 13 33 33

AVL Deletion Algorithm Recursive If at node, delete it Otherwise recurse to find it in 3. Correct heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care), double rotate Iterative 1. Search downward for node, stacking parent nodes 2. Delete node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care) double rotate OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!

Fun with AVL Trees Input: sequence of n keys (unordered) 19 3 4 18 7 19 3 4 18 7 Insert each into initially empty AVL tree Print using inorder traversal O(n) Result? Are we having fun yet?

Is There a Faster Way? But suppose input is already sorted 3 4 7 18 19 3 4 7 18 19 Can we do better than O(n log n)?

AVL buildTree 5 8 10 15 17 20 30 35 40 Divide & Conquer 17 Divide the problem into parts Solve each part recursively Merge the parts into a general solution 17 IT DEPENDS! How long does divide & conquer take? 8 10 15 5 20 30 35 40

BuildTree Example 5 8 10 15 17 20 30 35 40 3 17 5 8 10 15 2 2 20 30 35 40 10 35 20 30 5 8 1 1 8 15 30 40 5 20

BuildTree Analysis (Approximate) T(n) = 2T(n/2) + 1 T(n) = 2(2T(n/4)+1) + 1 T(n) = 4T(n/4) + 2 + 1 T(n) = 4(2T(n/8)+1) + 2 + 1 T(n) = 8T(n/8) + 4 + 2 + 1 T(n) = 2kT(n/2k) + let 2k = n, log n = k T(n) = nT(1) + T(n) = (n) Summation is 2^logn + 2^logn-1 + 2^logn-2+… n+n/2+n/4+n/8+… ~2n

BuildTree Analysis (Exact) Precise Analysis: T(0) = b T(n) = T( ) + T( ) + c By induction on n: T(n) = (b+c)n + b Base case: T(0) = b = (b+c)0 + b Induction step: T(n) = (b+c) + b + (b+c) + b + c = (b+c)n + b QED: T(n) = (b+c)n + b = (n)

Application: Batch Deletion Suppose we are using lazy deletion When there are lots of deleted nodes (n/2), need to flush them all out Batch deletion: Print non-deleted nodes into an array How? Divide & conquer AVL Treebuild Total time:

Thinking About AVL Observations + Worst case height of an AVL tree is about 1.44 log n + Insert, Find, Delete in worst case O(log n) + Only one (single or double) rotation needed on insertion - O(log n) rotations needed on deletion + Compatible with lazy deletion - Height fields must be maintained (or 2-bit balance)

Alternatives to AVL Trees Weight balanced trees keep about the same number of nodes in each subtree not nearly as nice Splay trees “blind” adjusting version of AVL trees no height information maintained! insert/find always rotates node to the root! worst case time is O(n) amortized time for all operations is O(log n) mysterious, but often faster than AVL trees in practice (better low-order terms)

Data Structures AVL II Alon Halevy Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)

Imbalance in AVL Trees Last week’s conjecture: in AVL trees, if you remove the bottom level, then you get a complete tree. This week’s theorems: All nodes, except parents of the leaves and the leaves have two children. Single-child nodes can be arbitrarily far from the leaves.

AVL Tree with Slight Imbalance 8 5 11 2 6 10 12 So, AVL trees will be Binary Search Trees with one extra feature: They balance themselves! The result is that all AVL trees at any point will have a logarithmic asymptotic bound on their depths 4 7 9 13 14 15

Where can we Find Leaves? Suppose the node N has no children. What is the maximal height of N’s parent? What is the maximal height of N’s grandparent? What is the maximal height of N’s great-grandparent? Conclusion: at what depth can we find a leaf?

Deletion (Hard Case #1) Delete(12) 10 5 17 2 9 12 20 3 30 3 2 1 2 3 Delete(12) 10 5 17 2 9 12 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 3 30

Single Rotation on Deletion 1 2 3 3 10 10 2 1 5 17 5 20 1 2 9 20 2 9 17 30 But what happened on the fix? Something very disturbing. What? The subtree’s height changed!! So, the deletion can propagate. 3 30 3 What is different about deletion than insertion?

Deletion (Hard Case #2) Delete(9) 10 5 17 2 9 12 12 20 20 3 11 15 15 3 4 Delete(9) 10 5 17 2 9 12 12 20 20 Now, let’s delete 12. 12 goes away. Now, there’s trouble. We’ve put an imbalance in. So, we check up from the point of deletion and fix the imbalance at 17. 1 1 3 11 15 15 18 30 30 13 13 33 33

Double Rotation on Deletion Not finished! 1 2 3 4 2 1 3 4 10 10 5 17 3 17 2 2 12 20 2 5 12 20 1 1 1 1 3 11 15 18 30 11 15 18 30 13 33 13 33

Deletion with Propagation 2 1 3 4 10 What different about this case? 3 17 2 5 12 20 1 1 We get to choose whether to single or double rotate! 11 15 18 30 13 33

Propagated Single Rotation 2 1 3 4 4 10 17 3 2 3 17 10 20 1 2 1 2 5 12 20 3 12 18 30 1 1 1 11 15 18 30 2 5 11 15 33 13 33 13

Propagated Double Rotation 2 1 3 4 4 10 12 2 3 3 17 10 17 1 1 2 2 5 12 20 3 11 15 20 1 1 1 11 15 18 30 2 5 13 18 30 13 33 33

AVL Deletion Algorithm Recursive If at node, delete it Otherwise recurse to find it in 3. Correct heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care), double rotate Iterative 1. Search downward for node, stacking parent nodes 2. Delete node 3. Unwind stack, correcting heights a. If imbalance #1, single rotate b. If imbalance #2 (or don’t care) double rotate OK, here’s the algorithm again. Notice that there’s very little difference between the recursive and iterative. Why do I keep a stack for the iterative version? To go bottom to top. Can’t I go top down? Now, what’s left? Single and double rotate!

Fun with AVL Trees Input: sequence of n keys (unordered) 19 3 4 18 7 19 3 4 18 7 Insert each into initially empty AVL tree Print using inorder traversal O(n) Result? Are we having fun yet?

Is There a Faster Way? But suppose input is already sorted 3 4 7 18 19 3 4 7 18 19 Can we do better than O(n log n)?

AVL buildTree 5 8 10 15 17 20 30 35 40 Divide & Conquer 17 Divide the problem into parts Solve each part recursively Merge the parts into a general solution 17 IT DEPENDS! How long does divide & conquer take? 8 10 15 5 20 30 35 40

BuildTree Example 5 8 10 15 17 20 30 35 40 3 17 5 8 10 15 2 2 20 30 35 40 10 35 20 30 5 8 1 1 8 15 30 40 5 20

BuildTree Analysis (Approximate) T(n) = 2T(n/2) + 1 T(n) = 2(2T(n/4)+1) + 1 T(n) = 4T(n/4) + 2 + 1 T(n) = 4(2T(n/8)+1) + 2 + 1 T(n) = 8T(n/8) + 4 + 2 + 1 T(n) = 2kT(n/2k) + let 2k = n, log n = k T(n) = nT(1) + T(n) = (n) Summation is 2^logn + 2^logn-1 + 2^logn-2+… n+n/2+n/4+n/8+… ~2n

Thinking About AVL Observations + Worst case height of an AVL tree is about 1.44 log n + Insert, Find, Delete in worst case O(log n) + Only one (single or double) rotation needed on insertion - O(log n) rotations needed on deletion - Height fields must be maintained (or 2-bit balance)

Alternatives to AVL Trees Weight balanced trees keep about the same number of nodes in each subtree not nearly as nice Splay trees (after mid-term) “blind” adjusting version of AVL trees no height information maintained! insert/find always rotates node to the root! worst case time is O(n) amortized time for all operations is O(log n) mysterious, but often faster than AVL trees in practice (better low-order terms)

B-Trees

Beyond Binary Trees One of the most important applications for search trees is databases If the DB is small enough to fit into RAM, almost any scheme for balanced trees (e.g. AVL) is okay 2000 (WalMart) RAM – 1,000,000 MB DB – 1,000,000 MB (terabyte) 1980 RAM – 1MB DB – 100 MB gap between disk and main memory growing!

Time Gap For many corporate and scientific databases, the search tree must mostly be on disk Accessing disk 200,000 X time slower than RAM Visiting node = accessing disk Even perfectly balance binary trees a disaster! log2( 10,000,000 ) = 24 disk accesses Goal: Decrease Height of Tree

M-ary Search Tree Maximum branching factor of M Complete tree has depth = logMN Each internal node in a complete tree has M - 1 keys runtime: Here’s the general idea. We create a search tree with a branching factor of M. Each node has M-1 keys and we search between them. What’s the runtime? O(logMn)? That’s a nice thought, and it’s the best case. What about the worst case? Is the tree guaranteed to be balanced? Is it guaranteed to be complete? Might it just end up being a binary tree?

B-Trees B-Trees are specialized M-ary search trees Each node has many keys subtree between two keys x and y contains values v such that x  v < y binary search within a node to find correct subtree Each node takes one full page of memory. 3 7 12 21 To address these problems, we’ll use a slightly more structured M-ary tree: B-Trees. As before, each internal node has M-1 kes. To manage memory problems, we’ll tune the size of a node (or leaf) to the size of a memory unit. Usually, a page or disk block. x<3 3x<7 7x<12 12x<21 21x

B-Tree Properties‡ Properties Result maximum branching factor of M the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The properties of B-Trees (and the trees themselves) are a bit more complex than previous structures we’ve looked at. Here’s a big, gnarly list; we’ll go one step at a time. The maximum branching factor, as we said, is M (tunable for a given tree). The root has between 2 and M children or at most L keys. (L is another parameter) These restrictions will be different for the root than for other nodes. ‡These are technically B+-Trees

B-Tree Properties Properties Result maximum branching factor of M the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time All the other internal nodes (non-leaves) will have between M/2 and M children. The funky symbol is ceiling, the next higher integer above the value. The result of this is that the tree is “pretty” full. Not every node has M children but they’ve all at least got M/2 (a good number). Internal nodes contain only search keys. A search key is a value which is solely for comparison; there’s no data attached to it. The node will have one fewer search key than it has children (subtrees) so that we can search down to each child. The smallest datam between two search keys is equal to the lesser search key. This is how we find the search keys to use.

B-Tree Properties Properties Result maximum branching factor of M the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time All the leaves (again, except the root) have a similar restriction. They contain between L/2 and L keys. Notice that means you have to do a search when you get to a leaf to find the item you’re looking for. All the leaves are also at the same depth. So, the tree looks kind of complete. It has the triangle shape, and the nodes branch at least as much as M/2.

B-Tree Properties Properties Result maximum branching factor of M the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The result of all this is that the tree in the worst case is log n deep. In particular, it’s about logM/2n deep. Does this matter asymptotically? No. What about practically? YES! Since M and L are considered constants, all operations run in log n time. Each operation pulls in at most M search keys or L items at a time. So, we can tune L and M to the size of a disk block!

When Big-O is Not Enough B-Tree is about logM/2 n/(L/2) deep = logM/2 n - logM/2 L/2 = O(logM/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses

… … B-Tree Nodes Internal node Leaf i search keys; i+1 subtrees; M - i - 1 inactive entries k1 k2 … ki __ … __ 1 2 i M - 1 Leaf j data keys; L - j inactive entries FIX M-I to M-I-1!! Alright, before we look at any examples, let’s look at what the node structure looks like. Internal nodes are arrays of pointers to children interspersed with search keys. Why must they be arrays rather than linked lists? Because we want contiguous memory! If the node has just I+1 children, it has I search keys, and M-I empty entries. A leaf looks similar (I’ll use green for leaves), and has similar properties. Why are these different? Because internal nodes need subtrees-1 keys. k1 k2 … kj __ … __ 1 2 j L

Example B-Tree with M = 4 and L = 4 10 40 3 15 20 30 50 1 2 10 11 12 This is just an example B-tree. Notice that it has 24 entries with a depth of only 2. A BST would be 4 deep. Notice also that the leaves are at the same level in the tree. I’ll use integers as both key and data, but we all know that that could as well be different data at the bottom, right? 1 2 10 11 12 20 25 26 40 42 3 5 6 9 15 17 30 32 33 36 50 60 70

Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? The empty B-Tree M = 3 L = 2 3 3 14 Insert(3) Insert(14) Alright, how do we insert and delete? Let’s start with the empty B-Tree. That’s one leaf as the root. Now, we’ll insert 3 and 14. Fine… What about inserting 1. Is there a problem? Now, Insert(1)?

Splitting the Root Insert(1) And create a new root Too many keys in a leaf! 3 14 14 1 3 1 3 14 Insert(1) And create a new root 1 3 14 Too many keys in a leaf! Run away! How do we solve this? Well, we definitely need to split this leaf in two. But, now we don’t have a tree anymore. So, let’s make a new root and give it as children the two leaves. This is how B-Trees grow deeper. So, split the leaf.

Insertions and Split Ends Too many keys in a leaf! 14 14 14 Insert(59) Insert(26) 1 3 14 26 59 1 3 14 1 3 14 59 14 26 59 So, split the leaf. Now, let’s do some more inserts. 59 is no problem. What about 26? Same problem as before. But, this time the split leaf just goes under the existing node because there’s still room. What if there weren’t room? 14 59 And add a new child 1 3 14 26 59

Too many keys in an internal node! Propagating Splits 14 59 14 59 Insert(5) Add new child 1 3 5 14 26 59 1 3 14 26 59 1 3 5 Too many keys in an internal node! 5 1 3 14 26 59 5 14 26 59 1 3 When we insert 5, the leaf overflows, but its parent already has too many subtrees! What do we do? The same thing as before but this time with an internal node. We split the node. Normally, we’d hang the new subtrees under their parent, but in this case they don’t have one. Now we have two trees! Soltuion: same as before, make a new root and hang these under it. Create a new root So, split the node.

Insertion in Boring Text Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root OK, here’s that process as an algorithm. The new funky symbol is floor; that’s just like regular C++ integer division. Notice that this can propagate all the way up the tree. How often will it do that? Notice that the two new leaves or internal nodes are guaranteed to have enough items (or subtrees). Because even the floor of (L+1)/2 is as big as the ceiling of L/2. This makes the tree deeper!

After More Routine Inserts 14 Insert(89) Insert(79) 5 59 1 3 5 14 26 59 5 1 3 14 26 59 79 89 OK, we’ve done insertion. What about deletion? For didactic purposes, I will now do two more regular old insertions (notice these cause a split).

Deletion Delete(59) 5 1 3 14 26 59 79 89 Now, let’s delete! Just find the key to delete and snip it out! Easy! Done, right?

Deletion and Adoption A leaf has too few keys! Delete(5) 14 14 Delete(5) 5 79 89 ? 79 89 1 3 5 14 26 79 89 1 3 14 26 79 89 So, borrow from a neighbor Of course not! What if we delete an item in a leaf and drive it below L/2 items (in this case to zero)? In that case, we have two options. The easy option is to borrow a neighbor’s item. We just move it over from the neighbor and fix the parent’s key. DIGRESSION: would it be expensive to maintain neighbor pointers in B-Trees? No. Because those leaves are normally going to be huge, and two pointers per leaf is no big deal (might cut down L by 1). How about parent pointers? No problem. In fact, I’ve been assuming we have them! 3 1 14 26 79 89

Deletion with Propagation A leaf has too few keys! 14 14 Delete(3) 3 79 89 ? 79 89 1 3 14 26 79 89 1 14 26 79 89 And no neighbor with surplus! But, what about if the neighbors are too low on items as well? Then, we need to propagate the delete… like an _unsplit_. We delete the node and fix up the parent. Note that if I had a larger M/L, we might have keys left in the deleted node. Why? Because the leaf just needs to drop below ceil(L/2) to be deleted. If L=100, L/2 = 50 and there are 49 keys to distribute! Solution: Give them to the neighbors. Now, what happens to the parent here? It’s down to one subtree! STRESS AGAIN THAT LARGER M and L WOULD MEAN NO NEED TO “RUN OUT”. 14 But now a node has too few subtrees! So, delete the leaf 79 89 1 14 26 79 89

Finishing the Propagation (More Adoption) Adopt a neighbor 1 14 26 79 89 We just do the same thing here that we did earlier: Borrow from a rich neighbor!

A Bit More Adoption Delete(1) (adopt a neighbor) 79 79 14 89 26 89 1 OK, let’s do a bit of setup. This is easy, right? 1 14 26 79 89 14 26 79 89

Pulling out the Root A leaf has too few keys! And no neighbor with surplus! 79 79 Delete(26) So, delete the leaf 26 89 89 14 26 79 89 14 79 89 But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Now, let’s delete 26. It can’t borrow from its neighbor, so we delete it. Its parent is too low on children now and it can’t borrow either: Delete it. Here, we give its leftovers to its neighbors as I mentioned earlier. But now the root has just one subtree!! 79 Delete the leaf 79 89 89 14 79 89 14 79 89

Pulling out the Root (continued) has just one subtree! Just make the one child the new root! 79 89 14 79 89 But that’s silly! The root having just one subtree is both illegal and silly. Why have the root if it just branches straight down? So, we’ll just delete the root and replace it with its child! 79 89 14 79 89

Deletion in Two Boring Slides of Text Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer than M/2 items, underflow! Why will dumping keys always work if borrowing doesn’t? Alright, that’s deletion. Let’s talk about a few of the details. Why will dumping keys always work? If the neighbors were too low on keys to loan any, they must have L/2 keys, but we have one fewer. Therefore, putting them together, we get at most L, and that’s legal.

Deletion Slide Two If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree The same applies here for dumping subtrees as on the previous slide for dumping keys. This reduces the height of the tree!

Thinking about B-Trees B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items height 5: 2,000,000,000! B*-Trees fix thrashing. Propagation is rare because (in a good case) only about 1/L inserts cause a split and only about 1/M of those go up even one level! 30 million’s not so big, right? How about height 5? 2 billion

Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? How about: O(1) finds and inserts?

Data Structures B-Trees Alon Halevy Alright, today we’ll get a little Yin and Yang. We saw B-Trees, but they were just too hard to use! Let’s see something easier! (a bit)

B-Tree Properties Properties Result maximum branching factor of M the root has between 2 and M children other internal nodes have between M/2 and M children internal nodes contain only search keys (no data) smallest datum between search keys x and y equals x each (non-root) leaf contains between L/2 and L keys all leaves are at the same depth Result tree is (logM/2 n/(L/2)) +/- 1 deep (log n) all operations run in time proportional to depth operations pull in at least M/2 or L/2 items at a time The result of all this is that the tree in the worst case is log n deep. In particular, it’s about logM/2n deep. Does this matter asymptotically? No. What about practically? YES! Since M and L are considered constants, all operations run in log n time. Each operation pulls in at most M search keys or L items at a time. So, we can tune L and M to the size of a disk block!

When Big-O is Not Enough B-Tree is about logM/2 n/(L/2) deep = logM/2 n - logM/2 L/2 = O(logM/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses

… … B-Tree Nodes Internal node Leaf i search keys; i+1 subtrees; M - i - 1 inactive entries k1 k2 … ki __ … __ 1 2 i M - 1 Leaf j data keys; L - j inactive entries FIX M-I to M-I-1!! Alright, before we look at any examples, let’s look at what the node structure looks like. Internal nodes are arrays of pointers to children interspersed with search keys. Why must they be arrays rather than linked lists? Because we want contiguous memory! If the node has just I+1 children, it has I search keys, and M-I empty entries. A leaf looks similar (I’ll use green for leaves), and has similar properties. Why are these different? Because internal nodes need subtrees-1 keys. k1 k2 … kj __ … __ 1 2 j L

Example B-Tree with M = 4 and L = 4 10 40 3 15 20 30 50 1 2 10 11 12 This is just an example B-tree. Notice that it has 24 entries with a depth of only 2. A BST would be 4 deep. Notice also that the leaves are at the same level in the tree. I’ll use integers as both key and data, but we all know that that could as well be different data at the bottom, right? 1 2 10 11 12 20 25 26 40 42 3 5 6 9 15 17 30 32 33 36 50 60 70

Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? The empty B-Tree M = 3 L = 2 3 3 14 Insert(3) Insert(14) Alright, how do we insert and delete? Let’s start with the empty B-Tree. That’s one leaf as the root. Now, we’ll insert 3 and 14. Fine… What about inserting 1. Is there a problem? Now, Insert(1)?

Splitting the Root Insert(1) And create a new root Too many keys in a leaf! 3 14 14 1 3 1 3 14 Insert(1) And create a new root 1 3 14 Too many keys in a leaf! Run away! How do we solve this? Well, we definitely need to split this leaf in two. But, now we don’t have a tree anymore. So, let’s make a new root and give it as children the two leaves. This is how B-Trees grow deeper. So, split the leaf.

Insertions and Split Ends Too many keys in a leaf! 14 14 14 Insert(59) Insert(26) 1 3 14 26 59 1 3 14 1 3 14 59 14 26 59 So, split the leaf. Now, let’s do some more inserts. 59 is no problem. What about 26? Same problem as before. But, this time the split leaf just goes under the existing node because there’s still room. What if there weren’t room? 14 59 And add a new child 1 3 14 26 59

Too many keys in an internal node! Propagating Splits 14 59 14 59 Insert(5) Add new child 1 3 5 14 26 59 1 3 14 26 59 1 3 5 Too many keys in an internal node! 5 1 3 14 26 59 5 14 26 59 1 3 When we insert 5, the leaf overflows, but its parent already has too many subtrees! What do we do? The same thing as before but this time with an internal node. We split the node. Normally, we’d hang the new subtrees under their parent, but in this case they don’t have one. Now we have two trees! Soltuion: same as before, make a new root and hang these under it. Create a new root So, split the node.

Insertion in Boring Text Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root OK, here’s that process as an algorithm. The new funky symbol is floor; that’s just like regular C++ integer division. Notice that this can propagate all the way up the tree. How often will it do that? Notice that the two new leaves or internal nodes are guaranteed to have enough items (or subtrees). Because even the floor of (L+1)/2 is as big as the ceiling of L/2. This makes the tree deeper!

Deletion in B-trees Come to section tomorrow. Slides follow.

After More Routine Inserts 14 Insert(89) Insert(79) 5 59 1 3 5 14 26 59 5 1 3 14 26 59 79 89 OK, we’ve done insertion. What about deletion? For didactic purposes, I will now do two more regular old insertions (notice these cause a split).

Deletion Delete(59) 5 1 3 14 26 59 79 89 Now, let’s delete! Just find the key to delete and snip it out! Easy! Done, right?

Deletion and Adoption A leaf has too few keys! Delete(5) 14 14 Delete(5) 5 79 89 ? 79 89 1 3 5 14 26 79 89 1 3 14 26 79 89 So, borrow from a neighbor Of course not! What if we delete an item in a leaf and drive it below L/2 items (in this case to zero)? In that case, we have two options. The easy option is to borrow a neighbor’s item. We just move it over from the neighbor and fix the parent’s key. DIGRESSION: would it be expensive to maintain neighbor pointers in B-Trees? No. Because those leaves are normally going to be huge, and two pointers per leaf is no big deal (might cut down L by 1). How about parent pointers? No problem. In fact, I’ve been assuming we have them! 3 1 14 26 79 89

Deletion with Propagation A leaf has too few keys! 14 14 Delete(3) 3 79 89 ? 79 89 1 3 14 26 79 89 1 14 26 79 89 And no neighbor with surplus! But, what about if the neighbors are too low on items as well? Then, we need to propagate the delete… like an _unsplit_. We delete the node and fix up the parent. Note that if I had a larger M/L, we might have keys left in the deleted node. Why? Because the leaf just needs to drop below ceil(L/2) to be deleted. If L=100, L/2 = 50 and there are 49 keys to distribute! Solution: Give them to the neighbors. Now, what happens to the parent here? It’s down to one subtree! STRESS AGAIN THAT LARGER M and L WOULD MEAN NO NEED TO “RUN OUT”. 14 But now a node has too few subtrees! So, delete the leaf 79 89 1 14 26 79 89

Finishing the Propagation (More Adoption) Adopt a neighbor 1 14 26 79 89 We just do the same thing here that we did earlier: Borrow from a rich neighbor!

A Bit More Adoption Delete(1) (adopt a neighbor) 79 79 14 89 26 89 1 OK, let’s do a bit of setup. This is easy, right? 1 14 26 79 89 14 26 79 89

Pulling out the Root A leaf has too few keys! And no neighbor with surplus! 79 79 Delete(26) So, delete the leaf 26 89 89 14 26 79 89 14 79 89 But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Now, let’s delete 26. It can’t borrow from its neighbor, so we delete it. Its parent is too low on children now and it can’t borrow either: Delete it. Here, we give its leftovers to its neighbors as I mentioned earlier. But now the root has just one subtree!! 79 Delete the leaf 79 89 89 14 79 89 14 79 89

Pulling out the Root (continued) has just one subtree! Just make the one child the new root! 79 89 14 79 89 But that’s silly! The root having just one subtree is both illegal and silly. Why have the root if it just branches straight down? So, we’ll just delete the root and replace it with its child! 79 89 14 79 89

Deletion in Two Boring Slides of Text Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer than M/2 items, underflow! Why will dumping keys always work if borrowing doesn’t? Alright, that’s deletion. Let’s talk about a few of the details. Why will dumping keys always work? If the neighbors were too low on keys to loan any, they must have L/2 keys, but we have one fewer. Therefore, putting them together, we get at most L, and that’s legal.

Deletion Slide Two If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree The same applies here for dumping subtrees as on the previous slide for dumping keys. This reduces the height of the tree!

Thinking about B-Trees B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items height 5: 2,000,000,000! B*-Trees fix thrashing. Propagation is rare because (in a good case) only about 1/L inserts cause a split and only about 1/M of those go up even one level! 30 million’s not so big, right? How about height 5? 2 billion

Tree Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? How about: O(1) finds and inserts?

Hash Table Approach Zasha Steve f(x) Nic Brad Ed But… is there a problem in this pipe-dream?

Hash Table Dictionary Data Structure Hash function: maps keys to integers result: can quickly find the right spot for a given entry Unordered and sparse table result: cannot efficiently list all entries, Cannot find min and max efficiently, Cannot find all items within a specified range efficiently. f(x) Zasha Steve Nic Brad Ed A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing)

Hash Table Terminology hash function Zasha f(x) Steve Nic collision Brad Ed keys load factor  = # of entries in table tableSize

Hash Table Code First Pass Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } What should the hash function be? What should the table size be? How should we resolve collisions?

A Good Hash Function… is easy (fast) to compute (O(1) and practically fast). distributes the data evenly (hash(a)  hash(b) ). uses the whole hash table (for all 0  k < size, there’s an i such that hash(i) % size = k).

Good Hash Function for Integers Choose tableSize is prime hash(n) = n % tableSize Example: tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17) 1 2 3 4 5 6

Good Hash Function for Strings? Ideas?

Good Hash Function for Strings? Sum the ASCII values of the characters. Consider only the first 3 characters. Uses only 2871 out of 17,576 entries in the table on English words. Let s = s1s2s3s4…s5: choose hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n Problems: hash(“really, really big”) = well… something really, really big hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

Making the String Hash Easy to Compute Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h;

Universal Hashing For any fixed hash function, there will be some pathological sets of inputs everything hashes to the same cell! Solution: Universal Hashing Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! When your program starts up, pick one of the hash functions to use at random (for the entire time) Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs

Universal Hash Function: “Random” Vector Approach Parameterized by prime size and vector: a = <a0 a1 … ar> where 0 <= ai < size Represent each key as r + 1 integers where ki < size size = 11, key = 39752 ==> <3,9,7,5,2> size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4> ha(k) = dot product with a “random” vector!

Universal Hash Function Strengths: works on any type as long as you can form ki’s if we’re building a static table, we can try many a’s a random a has guaranteed good properties no matter what we’re hashing Weaknesses must choose prime table size larger than any ki

Hash Function Summary Goals of a hash function Hash functions reproducible mapping from key to table entry evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) complete quickly Hash functions h(n) = n % size h(n) = string as base 128 number % size Universal hash function #1: dot product with random vector The idea of neighboring keys here may change from application to application. In one context, neighboring keys may be those with the same last characters or first characters… say, when hashing names in a school system. Many people may have the same last names or first names (but few will have the same of both).

How to Design a Hash Function Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

Collisions Pigeonhole principle says we can’t avoid all collisions try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? open hashing: put little dictionaries in each entry closed hashing: pick a next entry to try The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?