Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Comp 122, Spring 2004 Binary Search Trees. btrees - 2 Comp 122, Spring 2004 Binary Trees  Recursive definition 1.An empty tree is a binary tree 2.A node.
Chapter 4: Trees Part II - AVL Tree
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Binary Trees, Binary Search Trees COMP171 Fall 2006.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Chapter 4: Trees Radix Search Trees Lydia Sinapova, Simpson College Mark Allen Weiss: Data Structures and Algorithm Analysis in Java.
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
Searching with Structured Keys Objectives
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
TTIT33 Algorithms and Optimization – Dalg Lecture 2 HT TTIT33 Algorithms and optimization Lecture 2 Algorithms Sorting [GT] 3.1.2, 11 [LD] ,
Chair of Software Engineering Einführung in die Programmierung Introduction to Programming Prof. Dr. Bertrand Meyer Exercise Session 10.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
CSCE 3110 Data Structures & Algorithm Analysis Binary Search Trees Reading: Chap. 4 (4.3) Weiss.
Advanced Algorithms Analysis and Design Lecture 8 (Continue Lecture 7…..) Elementry Data Structures By Engr Huma Ayub Vine.
IS 2610: Data Structures Searching March 29, 2004.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
COSC 2007 Data Structures II Chapter 15 External Methods.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
1 Searching Searching in a sorted linked list takes linear time in the worst and average case. Searching in a sorted array takes logarithmic time in the.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Lec 15 Oct 18 Binary Search Trees (Chapter 5 of text)
File Organization and Processing Week Tree Tree.
1 Tries When searching for the name “Smith” in a phone book, we first locate the group of names starting with “S”, then within those we search for “m”,
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Chapter 4: Trees Part I: General Tree Concepts Mark Allen Weiss: Data Structures and Algorithm Analysis in Java.
Binary Search Trees (BSTs) 18 February Binary Search Tree (BST) An important special kind of binary tree is the BST Each node stores some information.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
1 Lexicographic Search:Tries All of the searching methods we have seen so far compare entire keys during the search Idea: Why not consider a key to be.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
(c) University of Washington20c-1 CSC 143 Binary Search Trees.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Higher Order Tries Key = Social Security Number.
Multiway Search Trees Data may not fit into main memory
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Mark Redekopp David Kempe
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Week 11 - Friday CS221.
ITEC 2620M Introduction to Data Structures
Digital Search Trees & Binary Tries
Binary Trees, Binary Search Trees
CSCE350 Algorithms and Data Structure
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B- Trees D. Frey with apologies to Tom Anastasio
Data Structures and Algorithms for Information Processing
Search trees, binary trie, patricia trie
B- Trees D. Frey with apologies to Tom Anastasio
Digital Search Trees & Binary Tries
Higher Order Tries Key = Social Security Number.
B- Trees D. Frey with apologies to Tom Anastasio
Binary Trees, Binary Search Trees
Presentation transcript:

Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Radix-based Keys Key has multiple parts Each part is an element of some set Character Numeral Key parts can be accessed (e.g., string s[i]) Size of set is radix

Advantages of Radix-based Search Good worst-case performance Simpler than balanced trees, etc. Fast access to data Easy way to handle variable-length keys Save space (part of key in structure)

Disadvantages of Radix-based Search May be space-inefficient Performance depends on access to bytes of keys Must have distinct keys, or other way to handle duplicate keys

Digital Search Trees Similar to binary search trees Difference is that we use bits of the key to determine subtree to search Path in tree = prefix of key

Digital Search Trees Insert A-S-E-R-C-H-I-N-G Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111 A S 1 E R 10 C 1 0 H 10 I 10 N 10 G 10 Note that binary tree is not sorted in BST sense

Digital Search Trees Prop 15.1: A search or insertion into a DST takes about lg N comparisons on average, and about 2 lg N comparisons in the worst case, in a tree built from N keys. The number of comparisons is never more than the number of bits in the search key.

Tries Use bits of key to guide search like DST But keep keys in order like BST Allow recursive sort, etc. Pronounced “try-ee” or “try” Keys kept at leaves of a binary tree

Tries Defn. 15.1: A trie is a binary tree that has keys associated with each leaf, defined as follows: a trie for an empty set is a null link a trie for a single key is a leaf w/key a trie for > 1 key is an internal node with left link referring to trie for keys that start with 0, right for keys 1xxx

Tries Insert A-S-E-R-C-H-I-N-G Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111 A S 1 A 0 Construct tree to point where prefixes match

Tries Insert A-S-E-R-C-H-I-N-G Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111 A 10 AE RS S 10 A Construct tree to point where prefixes match

Tries Insert A-S-E-R-C-H-I-N-G Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G A RS A 10 C E H

Tries Insert A-S-E-R-C-H-I-N- G Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G RS A 10 C E H HI

Tries Prop. 15.2: The structure of a trie is independent of key insertion order; there is one unique trie for any given set of distinct keys. Prop. 15.3: Insertion or search for a random key in a trie built from N random keys takes about lg N bit comparisons on average, in the worst case, bounded by bits in key

Tries Annoying feature of tries: One-way branching when keys have common prefix Prop. 15.4: A trie built from N random w-bit keys has about N/lg 2 nodes on the average (about 1.44 N)

Patricia Tries Annoying feature of tries: One-way branching when keys have common prefix Two different types of nodes in trie Patricia tries: fix both of these Practical Algorithm To Retrieve Information Coded In Alphanumeric

Patricia Tries Avoid one-way branching: Keep at each node the index of the next bit to test Skip over common prefix! Avoid two types of nodes: Store data in internal nodes Replace external links with back links

Patricia Tries S R 4 H 0 1 E 2 3 C 4 A Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111

Patricia Tries S R 4 H 0 1 E 2 3 C 4 A Key Repr A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111

Patricia Tries Prop 15.5: Insertion or search in a patricia trie built from N random bitstrings takes about lg N bit comparisons on average, and about 2 lg N in the worst case, but never more than the length of the key.

Map Radix search Digital Search Trees Tries Patricia Tries Multiway tries and TSTs Text string algorithms

Multiway Tries Like radix sort, can get benefit from comparing more than one bit at a time Compare r bits, speed up search by a factor of r What could possibly be bad? Number of links is now R=2 r Can waste a lot of space!

Multiway Tries Structure is (almost) the same as binary tries Except there are R branches Search: start at root, leftmost digit Follow i th link if next R-ary digit is i If null link, then miss If reach leaf, it contains only key with prefix matching path to it - compare

Existence Tries Only keys, no records Insert/search Defn. 15.2: The existence trie for a set of keys is: Empty set: null link Non-empty set: internal node with links for each possible digit to tries built with the leading digit omitted

Existence Tries Convenient to return null on miss, dummy record on hit Convenient to have no duplicate keys and no key a prefix of another key Keys of fixed length, or Use termination character with value NULLdigit, only used as sentinel

Existence Tries No need to store any data All keys captured in trie structure If reach NULLdigit at the same time we run out of key digits, search hit Otherwise, search miss Insert: search until find null link, then add nodes for each of the remaining digits in the key

Existence Tries now is the time for a t n h e i i m e s o w f o r

Multi-way Tries R-ary branching Keys stored at leaves Path to leaf defines prefix of key stored at leaf Only build tree downward until prefixes become distinct

Multi-way Tries Defn. 15.3: The multiway trie for a set of keys associated with leaves is: Set empty: null link Singleton set: leaf with key Larger set: internal node with links for each possible digit to tries built with the leading digit omitted

Multi-way Tries Prop. 15.6: Search or insertion in a standard R-ary trie takes built from N random keys takes about log R N character comparisons, bounded by the length of the key; the number of links is about RN/ln R. Classic time-space tradeoff! Larger R = faster but more space

Ternary Search Trie (TST) Each node has a character (digit) and three links Left link refers to subtrie with current key digit less than that of the node Middle link refers to subtrie with current key digit the same Right link refers to subtrie with current key digit greater than node’s

Ternary Search Trie (TST) TST equivalent to BST that used characters for non-null links as keys Like 3-way radix sorting BSTs like QuickSort M-ary tries like RadixSort

Ternary Search Trie (TST) Search: start at root Recursively – Compare next character in key with character in node If less, take left link If greater, take right link If equal, take middle and go to next character in key Miss if encounter null link or reach end of key before NULLdigit

Ternary Search Trie (TST) Insert: start at root Search – Find location where prefix diverges Add new nodes for characters not consumed by search

Existence TST now is the time for n h e i i m e s o w f o r t

Ternary Search Trie (TST) Prop. 15.7: A search or insertion in a full TST requires time proportional to the key length. The number of links in a TST is at most three times the number of characters in all the keys.

Ternary Search Trie (TST) Can make more space efficient by putting keys in leaves at point where prefix is unique, and eliminating one-way branching as we did in Patricia Tries. Can compromise speed and space by having large branch at root (R or R 2 ) and rest of trie is regular TST. Works well if first char(s) well-distributed

Ternary Search Trie (TST) Nice for practical use Adapt to non-uniformity often seen Though character set may be large, often only a few are used, or are used after a particular prefix Don’t make many links we don’t need Structured format keys May have many symbols used But only a few at each part of key

Ternary Search Trie (TST) Nice for practical use Search misses are really fast! Can adapt for partial match searches “Don’t care” characters in search key Can adapt for “almost match” searches All but (any) one character match Access bytes or larger symbols rather than bits (like Patricia tries), which are often better supported/efficient, or more natural to the keys

Text-String-Index Recall String Index built with BST with string pointers into a large text Consider each position in text to be start of a string key that runs to the end of the text Build a symbol table with these keys Keys are all different (lengths alone suffice) Most are very long Suffix Tree = search tree for this

Text-String-Index BSTs are simple and work well for suffix trees Not likely to be a worst-case BST Patricia tries designed to do this! Need to have bit-level access Fast on misses TSTs Simple, take advantage of byte ops Can solve more complex problems Can change == to mean “prefix”

Text-String-Index If text is static, why not use Binary Search? Fast No need to support insert/delete Uses less memory (fewer links/pointers) But TSTs have some advantages Never retrace steps Support other operations Can also build FSM.. But better for linear search of new text

String Search If problem is to look for a particular string s in a large text t Naïve method: Search t linearly for s[0] When match found at t[i], Match s[j] with t[i+j] for j = 1 to |s|-1 If all |s| chars match, have a match! Else go back to searching t at t[i+1] Time? |s| times |t| - not good

FSM-based String Search Fast way to look for a particular string s in one or more (large) texts: Build FSM for search string States represent prefix matched Transition either extends match or Fails to longest suffix of what has been seen that is a prefix of s Can also build for multiple search strings

Finite State Machine a.k.a. Finite State Automaton (FSA) c a dany Set of States S – represented as nodes in graph Set of input symbols  – labels on directed edges Transition function  – for state and input, next state Initial state q 0 – where to start Final set of states F – subset of S for “accept” Start state F={q 1,q 2 } b  = {a,b,c,d} a,b,d c c q0q0 q3q3 q2q2 q1q1 q2q2 q1q1 Edge=transition  (q 1,c)=q 3

FSM-based String Search Search for abraca a a ab b abr r abra a abrac c abraca a Not aa else a b a b Build recognizer skeleton Add suffix-is-prefix links Add failure links a Start state Final state Is that all of them?

FSM-based String Search Linear time in |s| to build FSM for s Linear time (in |t|) to search large text t for all instances of s Can’t hope for better than that! What about searching for more than one string? Build FSM for all the strings! Linear time in sum of string lengths to build FSM Linear time in |t| to search all of t for all strings

Summary Radix search Digital Search Trees Tries Patricia Tries Multiway tries and TSTs Text string algorithms FSMs for fast string matching