Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties.

Similar presentations


Presentation on theme: "Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties."— Presentation transcript:

1 Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties

2 Copyright 2004-2006 Curt Hill The word Derived from the middle of retrieve Pronounced either like “try” or “tree” Pronouncing as “try” is less confusing, so this is what we will use However, before looking at a trie we must consider a radix sort

3 Copyright 2004-2006 Curt Hill Radix sort In the olden days we had card decks We typically put a sequence number in the last 8 columns If the deck every got shuffled we took it to operations and they could resort it based on that sequence They used a machine called a card sorter It had one input and 12 output bins –A card had 12 rows

4 Copyright 2004-2006 Curt Hill It worked like this: The operator set which single column it would sort The deck was read in and put into one of 12 slots based on the value in that column If two cards were the same then they hit the slot in the same order that they were originally in Most sorts do not preserve the input order in equal keys

5 Copyright 2004-2006 Curt Hill How should we sort a deck? Sort first or sort last? Sort first –Most of us would sort on the first character and then have 10 decks –Then sort each of those decks into 100 decks –Then sort each of those decks into 1000 decks –etc

6 Copyright 2004-2006 Curt Hill Sort Last Sort last – only works because the card sorter maintains order for equal keys Sort on the last digit –Recombine the decks into one but based on their slot order –Next sort on next to last digit Recombine the decks into one but based on their slot order –Keep doing this until you run out of digits

7 Copyright 2004-2006 Curt Hill Example Consider the following data: 434,214,123,432,124,431,223 Three passes based on three digits Sort on last digit –431, 432, 123, 223, 434, 214,124 Sort on middle digit –214, 123, 223, 124, 431, 432, 434 Sort on first digit –123, 124, 214, 223, 431, 432, 434

8 Copyright 2004-2006 Curt Hill Quaint? The importance of the radix sort at this point is that it deals with the key as a sequence of digits rather than a unified whole A trie will do the same It will also combine tree searches and subscripting

9 Copyright 2004-2006 Curt Hill Subscripting as a search Pros –The advantage of a vector or an array is that subscripting is extremely quick –The advantage of a binary or B tree is that the key can have any form –Hashing attempts to make a key from things that are not a key Cons –Vector/arrays only allow integer subscripts –Trees O(log n) search are much slower than O(C) searches of arrays –Hash tables maul the sorted order

10 Copyright 2004-2006 Curt Hill Trie Again The trie is an attempt to bring subscripting back to searching The key concept is to think of a string, not as a single indivisible item, but as a sequence of characters –String is the most general key Works well for dense keys

11 Copyright 2004-2006 Curt Hill The organization of the trie A trie is a multiway tree where no search occurs on the keys –Instead a subscript evaluation Suppose that we have a string of 5 digits for a key –Each node will contain 10 possibilities – one for each digit Therefore the trie is 10-way tree The root node has one subtree for each digit

12 Copyright 2004-2006 Curt Hill Root and three of 10 descendents 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0

13 Copyright 2004-2006 Curt Hill Notes If the key is constant length –Only the leaves have any data –The path to leaf is the key –The digits are not actually stored in the nodes –Just the pointers to subnodes If the key is not constant length then each node has the data corresponding to that key Subscripting and pointer dereferencing is used

14 Copyright 2004-2006 Curt Hill Searching We use the first digit of the search key to find the proper subtree Each subtree of the root, then uses the second letter as the basis to find the correct subtree At level N then the Nth letter of the word is used as a subscript of the node to find the subtree The depth of the tree is the length of the longest key

15 Copyright 2004-2006 Curt Hill Example Consider a structure that contains English words of which there are more than a million Each entry points to a definition and other stuff –Perhaps a file ID if the whole structure does not fit in memory Twenty six entries in root Contains words: –a, an, and, am, any

16 Copyright 2004-2006 Curt Hill Four levels of Another Trie a b c d … x y z a b … z… m n NULL a a b c d … x y z an a b c d … x y z and a b c d … x y z am a b c d … x y z any Level 0 has zero length items Level 1 has length one items Level 2 has length two items

17 Copyright 2004-2006 Curt Hill Notes One pointer for each descendent –The data for the pointers is not needed since no comparisons are done One data item for the word currently constructed –Not the word itself (that is the key) but the data corresponding to that key –I will not show insertion/deletion etc. because it follows naturally from the structure and previous experience on trees

18 Copyright 2004-2006 Curt Hill Searching Trees How does this Trie compare with other tree searches? Searching a binary tree is a log 2 operation –Each node examination splits the tree into (hopefully) two nearly equal pieces –Log 2 of 1 million is about 20 (19.9)

19 Copyright 2004-2006 Curt Hill BTree Searching a BTree is a harder to determine operation Suppose that N=4 Each node should be between 4 and 8 keys Then the search should be log 6 operation Each node examination splits the tree into approximately six nearly equal pieces Log 6 of 1 million is 7.71 However, in the traversing from root to leaf in these 7 searches we also searched 7 or 8 nodes each of which had approximately 6 items in them Hence we had some sort of inspection of about 50 items

20 Copyright 2004-2006 Curt Hill Trie Searching a trie using characters is a log 26 operation –Each node examination splits the tree into 26 pieces but be sure they are not equal, especially if they use English letter frequency If a product id with sequential numbers/letters then an equal distribution is possible –Log 26 of 1 million is 4.24 –Hence the tree is much shallower –The root to leaf search did four subscript evaluations rather than four searches

21 Copyright 2004-2006 Curt Hill Balance There is none unless the keys accidentally cause balance Balance is something that can be done when searching but not subscripting

22 Copyright 2004-2006 Curt Hill Space utilization Tries are preferred only for dense keys What is a dense key? A key where adjacent keys are relatively close to each other The key space has few holes in it English words are usually sparse

23 Copyright 2004-2006 Curt Hill Example In a 55,000 entry dictionary "gunfire" and "gunlock" are adjacent –These have the first three characters the same but how many permutations are between them? –There are five letters between the f and l –There should be 5 * 26 * 26 * 26 pseudo words of length 7 between them which is greater than 87,000

24 Copyright 2004-2006 Curt Hill Example Again This neglects other gunf and gunl words as well as differing lengths –Social security numbers, telephone numbers, product numbers are much more likely to be dense If a node of 26 items only uses three of them then it is pretty wasteful of space even if the subscripting is quick In the dictionary example, the second level has a number of letter combinations that do not exist –Most two consonant pairs do not exist, except as abbreviations –bb, bc, bf

25 Copyright 2004-2006 Curt Hill Tree Nodes We seldom want to take a trie to the bitter end unless we have a manageable key –Key must be dense –Must be evenly distributed key –Such a key is often numeric –Combinations must occur with equal frequency A hybrid tree has a mixture of formats

26 Copyright 2004-2006 Curt Hill First Example License plate numbers in ND have three letters and three digits –A binary tree would need a height of 24 –A BTree of N = 4 would still need 9 levels For a trie the top three levels would have letters and the bottom three digits –Would only be six levels deep –The quickest possible search –Two distinct forms of nodes –One form of hybrid

27 Copyright 2004-2006 Curt Hill Hybrids again The more common practice is to use the trie for a small number of levels and then switch to another data structure The top structure is usually a trie for its high fanout The bottom structure is usually a binary tree for memory structures and a BTree for disk structures, but other things may also be used

28 Copyright 2004-2006 Curt Hill Demonstration Notes What follows is a trie program with some unusual features –Consider these before observing code

29 Copyright 2004-2006 Curt Hill Preprocessor commands This code has preprocessor conditional compilations in it It tailors the Trie to either accept a key that is only digits or only uppercase letters There is either a definition of TRIE_LETTERS or not –The value of this is not important, not even given, but the question is it defined or not Since the preprocessor is finished before the compiler starts we can generate the compiler input In this case we end up with two different versions Great amounts of similarity but still different Notice it also extends to the main C++ file

30 Copyright 2004-2006 Curt Hill Key density What happens when a word like "AARDVARK" is inserted into an empty trie? –This is the problem of a sparse key What would happen to binary or B tree that had the same situation?

31 Copyright 2004-2006 Curt Hill Subscripting into the node In both find and insert the search is trivial Extract the letter, adjust by the beginning of the alphabet and use as a subscript This is simple because we only allow a key to be a string of uppercase letters What would we do if we needed to allow any characters allowed in a word? –Such as hyphen, space, apostrophe or digits –Not all characters are allowed, just some

32 Copyright 2004-2006 Curt Hill The what if This complicates the lookup and slows the simple lookup The procedure would be something like this: –If the character is a letter, adjust as always –If the character is a digit, adjust by subtracting the zero and adding 26 –Else do a case on the character and merely assign the subscript directly The more characters the more costly and the less simple this search will be This lack of simplicity will hinder the speed of the trie and make it less desirable, based on the probability of the characters

33 Copyright 2004-2006 Curt Hill Iterator Notice the handshaking that has to go on between the Trie and Trie_iterator class –Both have to tell the other what is going on This could be a recursive routine, but uses a stack and loop instead Notice the word of a Trie node needs to be displayed first Every leaf contains an array of NULL pointers and the pointer to the data item


Download ppt "Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties."

Similar presentations


Ads by Google