Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties.

Slides:



Advertisements
Similar presentations
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Advertisements

Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
B-Trees. Motivation for B-Trees Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Other time considerations Source: Simon Garrett Modifications by Evan Korth.
Data Structures Topic #12.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
CS4432: Database Systems II
Important Problem Types and Fundamental Data Structures
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Binary Trees Chapter 6.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
IntroductionIntroduction  Definition of B-trees  Properties  Specialization  Examples  2-3 trees  Insertion of B-tree  Remove items from B-tree.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
DATA STRUCTURE & ALGORITHMS (BCS 1223) CHAPTER 8 : SEARCHING.
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
B-Trees And B+-Trees Jay Yim CS 157B Dr. Lee.
Tries.
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
B-Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
Copyright Curt Hill Balance in Binary Trees Impact on Performance.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Week 8 - Wednesday.  What did we talk about last time?  Level order traversal  BST delete  2-3 trees.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Copyright © Curt Hill Sorting Ordering an array.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Sorting and Runtime Complexity CS255. Sorting Different ways to sort: –Bubble –Exchange –Insertion –Merge –Quick –more…
Azita Keshmiri CS 157B Ch 12 indexing and hashing
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Are they better or worse than a B+Tree?
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
A Kind of Binary Tree Usually Stored in an Array
Indexing and Hashing Basic Concepts Ordered Indices
A Robust Data Structure
Indexing 4/11/2019.
Important Problem Types and Fundamental Data Structures
Presentation transcript:

Copyright Curt Hill Tries An N-Way tree with unusual properties

Copyright Curt Hill The word Derived from the middle of retrieve Pronounced either like “try” or “tree” Pronouncing as “try” is less confusing, so this is what we will use However, before looking at a trie we must consider a radix sort

Copyright Curt Hill Radix sort In the olden days we had card decks We typically put a sequence number in the last 8 columns If the deck every got shuffled we took it to operations and they could resort it based on that sequence They used a machine called a card sorter It had one input and 12 output bins –A card had 12 rows

Copyright Curt Hill It worked like this: The operator set which single column it would sort The deck was read in and put into one of 12 slots based on the value in that column If two cards were the same then they hit the slot in the same order that they were originally in Most sorts do not preserve the input order in equal keys

Copyright Curt Hill How should we sort a deck? Sort first or sort last? Sort first –Most of us would sort on the first character and then have 10 decks –Then sort each of those decks into 100 decks –Then sort each of those decks into 1000 decks –etc

Copyright Curt Hill Sort Last Sort last – only works because the card sorter maintains order for equal keys Sort on the last digit –Recombine the decks into one but based on their slot order –Next sort on next to last digit Recombine the decks into one but based on their slot order –Keep doing this until you run out of digits

Copyright Curt Hill Example Consider the following data: 434,214,123,432,124,431,223 Three passes based on three digits Sort on last digit –431, 432, 123, 223, 434, 214,124 Sort on middle digit –214, 123, 223, 124, 431, 432, 434 Sort on first digit –123, 124, 214, 223, 431, 432, 434

Copyright Curt Hill Quaint? The importance of the radix sort at this point is that it deals with the key as a sequence of digits rather than a unified whole A trie will do the same It will also combine tree searches and subscripting

Copyright Curt Hill Subscripting as a search Pros –The advantage of a vector or an array is that subscripting is extremely quick –The advantage of a binary or B tree is that the key can have any form –Hashing attempts to make a key from things that are not a key Cons –Vector/arrays only allow integer subscripts –Trees O(log n) search are much slower than O(C) searches of arrays –Hash tables maul the sorted order

Copyright Curt Hill Trie Again The trie is an attempt to bring subscripting back to searching The key concept is to think of a string, not as a single indivisible item, but as a sequence of characters –String is the most general key Works well for dense keys

Copyright Curt Hill The organization of the trie A trie is a multiway tree where no search occurs on the keys –Instead a subscript evaluation Suppose that we have a string of 5 digits for a key –Each node will contain 10 possibilities – one for each digit Therefore the trie is 10-way tree The root node has one subtree for each digit

Copyright Curt Hill Root and three of 10 descendents

Copyright Curt Hill Notes If the key is constant length –Only the leaves have any data –The path to leaf is the key –The digits are not actually stored in the nodes –Just the pointers to subnodes If the key is not constant length then each node has the data corresponding to that key Subscripting and pointer dereferencing is used

Copyright Curt Hill Searching We use the first digit of the search key to find the proper subtree Each subtree of the root, then uses the second letter as the basis to find the correct subtree At level N then the Nth letter of the word is used as a subscript of the node to find the subtree The depth of the tree is the length of the longest key

Copyright Curt Hill Example Consider a structure that contains English words of which there are more than a million Each entry points to a definition and other stuff –Perhaps a file ID if the whole structure does not fit in memory Twenty six entries in root Contains words: –a, an, and, am, any

Copyright Curt Hill Four levels of Another Trie a b c d … x y z a b … z… m n NULL a a b c d … x y z an a b c d … x y z and a b c d … x y z am a b c d … x y z any Level 0 has zero length items Level 1 has length one items Level 2 has length two items

Copyright Curt Hill Notes One pointer for each descendent –The data for the pointers is not needed since no comparisons are done One data item for the word currently constructed –Not the word itself (that is the key) but the data corresponding to that key –I will not show insertion/deletion etc. because it follows naturally from the structure and previous experience on trees

Copyright Curt Hill Searching Trees How does this Trie compare with other tree searches? Searching a binary tree is a log 2 operation –Each node examination splits the tree into (hopefully) two nearly equal pieces –Log 2 of 1 million is about 20 (19.9)

Copyright Curt Hill BTree Searching a BTree is a harder to determine operation Suppose that N=4 Each node should be between 4 and 8 keys Then the search should be log 6 operation Each node examination splits the tree into approximately six nearly equal pieces Log 6 of 1 million is 7.71 However, in the traversing from root to leaf in these 7 searches we also searched 7 or 8 nodes each of which had approximately 6 items in them Hence we had some sort of inspection of about 50 items

Copyright Curt Hill Trie Searching a trie using characters is a log 26 operation –Each node examination splits the tree into 26 pieces but be sure they are not equal, especially if they use English letter frequency If a product id with sequential numbers/letters then an equal distribution is possible –Log 26 of 1 million is 4.24 –Hence the tree is much shallower –The root to leaf search did four subscript evaluations rather than four searches

Copyright Curt Hill Balance There is none unless the keys accidentally cause balance Balance is something that can be done when searching but not subscripting

Copyright Curt Hill Space utilization Tries are preferred only for dense keys What is a dense key? A key where adjacent keys are relatively close to each other The key space has few holes in it English words are usually sparse

Copyright Curt Hill Example In a 55,000 entry dictionary "gunfire" and "gunlock" are adjacent –These have the first three characters the same but how many permutations are between them? –There are five letters between the f and l –There should be 5 * 26 * 26 * 26 pseudo words of length 7 between them which is greater than 87,000

Copyright Curt Hill Example Again This neglects other gunf and gunl words as well as differing lengths –Social security numbers, telephone numbers, product numbers are much more likely to be dense If a node of 26 items only uses three of them then it is pretty wasteful of space even if the subscripting is quick In the dictionary example, the second level has a number of letter combinations that do not exist –Most two consonant pairs do not exist, except as abbreviations –bb, bc, bf

Copyright Curt Hill Tree Nodes We seldom want to take a trie to the bitter end unless we have a manageable key –Key must be dense –Must be evenly distributed key –Such a key is often numeric –Combinations must occur with equal frequency A hybrid tree has a mixture of formats

Copyright Curt Hill First Example License plate numbers in ND have three letters and three digits –A binary tree would need a height of 24 –A BTree of N = 4 would still need 9 levels For a trie the top three levels would have letters and the bottom three digits –Would only be six levels deep –The quickest possible search –Two distinct forms of nodes –One form of hybrid

Copyright Curt Hill Hybrids again The more common practice is to use the trie for a small number of levels and then switch to another data structure The top structure is usually a trie for its high fanout The bottom structure is usually a binary tree for memory structures and a BTree for disk structures, but other things may also be used

Copyright Curt Hill Demonstration Notes What follows is a trie program with some unusual features –Consider these before observing code

Copyright Curt Hill Preprocessor commands This code has preprocessor conditional compilations in it It tailors the Trie to either accept a key that is only digits or only uppercase letters There is either a definition of TRIE_LETTERS or not –The value of this is not important, not even given, but the question is it defined or not Since the preprocessor is finished before the compiler starts we can generate the compiler input In this case we end up with two different versions Great amounts of similarity but still different Notice it also extends to the main C++ file

Copyright Curt Hill Key density What happens when a word like "AARDVARK" is inserted into an empty trie? –This is the problem of a sparse key What would happen to binary or B tree that had the same situation?

Copyright Curt Hill Subscripting into the node In both find and insert the search is trivial Extract the letter, adjust by the beginning of the alphabet and use as a subscript This is simple because we only allow a key to be a string of uppercase letters What would we do if we needed to allow any characters allowed in a word? –Such as hyphen, space, apostrophe or digits –Not all characters are allowed, just some

Copyright Curt Hill The what if This complicates the lookup and slows the simple lookup The procedure would be something like this: –If the character is a letter, adjust as always –If the character is a digit, adjust by subtracting the zero and adding 26 –Else do a case on the character and merely assign the subscript directly The more characters the more costly and the less simple this search will be This lack of simplicity will hinder the speed of the trie and make it less desirable, based on the probability of the characters

Copyright Curt Hill Iterator Notice the handshaking that has to go on between the Trie and Trie_iterator class –Both have to tell the other what is going on This could be a recursive routine, but uses a stack and loop instead Notice the word of a Trie node needs to be displayed first Every leaf contains an array of NULL pointers and the pointer to the data item