Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.

Slides:



Advertisements
Similar presentations
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Advertisements

S. Sudarshan Based partly on material from Fawzi Emad & Chau-Wen Tseng
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Binary Trees, Binary Search Trees COMP171 Fall 2006.
CS 171: Introduction to Computer Science II
Data Structures, Search and Sort Algorithms Kar-Hai Chu
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Balanced Search Trees (Ch. 13) To implement a symbol table, Binary Search Trees work pretty well, except… The worst case is O(n) and it is embarassingly.
BST Data Structure A BST node contains: A BST contains
Data Structures Data Structures Topic #8. Today’s Agenda Continue Discussing Table Abstractions But, this time, let’s talk about them in terms of new.
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
Unit 11a 1 Unit 11: Data Structures & Complexity H We discuss in this unit Graphs and trees Binary search trees Hashing functions Recursive sorting: quicksort,
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Tirgul 7 Heaps & Priority Queues Reminder Examples Hash Tables Reminder Examples.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Ch. 13: Balanced Search Trees Symbol table: insert, delete, find, pred, succ, sort,… Binary Search Tree review: What is a BST? binary tree with a key at.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Data Structures Trees.
CSCE 3110 Data Structures & Algorithm Analysis Binary Search Trees Reading: Chap. 4 (4.3) Weiss.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Advanced Algorithms Analysis and Design Lecture 8 (Continue Lecture 7…..) Elementry Data Structures By Engr Huma Ayub Vine.
Dictionaries CS 105. L11: Dictionaries Slide 2 Definition The Dictionary Data Structure structure that facilitates searching objects are stored with search.
ADT Table and Heap Ellen Walker CPSC 201 Data Structures Hiram College.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
IS 2610: Data Structures Searching March 29, 2004.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
S EARCHING AND T REES COMP1927 Computing 15s1 Sedgewick Chapters 5, 12.
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Trees  Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search,
CS 361 – Chapter 3 Sorted dictionary ADT Implementation –Sorted array –Binary search tree.
Data Structures Haim Kaplan and Uri Zwick November 2012 Lecture 3 Dynamic Sets / Dictionaries Binary Search Trees.
1 Searching Searching in a sorted linked list takes linear time in the worst and average case. Searching in a sorted array takes logarithmic time in the.
Lec 15 Oct 18 Binary Search Trees (Chapter 5 of text)
Dictionaries CS /02/05 L7: Dictionaries Slide 2 Copyright 2005, by the authors of these slides, and Ateneo de Manila University. All rights reserved.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Data Structure II So Pak Yeung Outline Review  Array  Sorted Array  Linked List Binary Search Tree Heap Hash Table.
David Stotts Computer Science Department UNC Chapel Hill.
1 Chapter 7 Objectives Upon completion you will be able to: Create and implement binary search trees Understand the operation of the binary search tree.
Binary Search Trees (BSTs) 18 February Binary Search Tree (BST) An important special kind of binary tree is the BST Each node stores some information.
Binary Tree. Some Terminologies Short review on binary tree Tree traversals Binary Search Tree (BST)‏ Questions.
Lecture - 11 on Data Structures. Prepared by, Jesmin Akhter, Lecturer, IIT,JU Threaded Trees Binary trees have a lot of wasted space: the leaf nodes each.
1 C++ Classes and Data Structures Jeffrey S. Childs Chapter 15 Other Data Structures Jeffrey S. Childs Clarion University of PA © 2008, Prentice Hall.
Rooted Tree a b d ef i j g h c k root parent node (self) child descendent leaf (no children) e, i, k, g, h are leaves internal node (not a leaf) sibling.
Internal and External Sorting External Searching
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Dictionaries CS 110: Data Structures and Algorithms First Semester,
1 the BSTree class  BSTreeNode has same structure as binary tree nodes  elements stored in a BSTree are a key- value pair  must be a class (or a struct)
Contents What is a trie? When to use tries
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Search Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
COMP261 Lecture 23 B Trees.
CSCE 3110 Data Structures & Algorithm Analysis
Multiway Search Trees Data may not fit into main memory
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Hashing Exercises.
Tree data structure.
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Lec 12 March 9, 11 Mid-term # 1 (March 21?)
Tree data structure.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Binary Trees, Binary Search Trees
Tree A tree is a data structure in which each node is comprised of some data as well as node pointers to child nodes
Binary Trees, Binary Search Trees
Presentation transcript:

Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say 100,000,000 documents of 1000 words each  100 billion word occurrences. average word length: 8 speed determined largely by disk accesses may want boolean searches (e.g. “banana” and “slug”) order results by relevance (title, keywords, repetitions…) what data structure, algorithms? what will the space requirements of your data structure be? what will the time requirements be?

Search Engine Ideas Binary search tree With a node for each word occurrence, memory needed: 100 billion nodes, bytes each? Insert, delete, find O(log n) – would that be OK? Or one node for all occurrences of a word, with a linked list of pointers to documents? perhaps 10 million nodes, each with a 10,000 element list? keep nodes (but not lists) in RAM each element of list has URL, title, excerpt – 8K bytes? How about a list of documents with excerpts. 1. Banana Slugs, “Banana slugs are yellow, 8” long…” 8K per document would be 800 GB for the whole list.

Getting results What should we store at the nodes of the BST? A “hit list” for a word? entries? Store a pointer to a hit list instead, to minimize BST size For each hit store document number and byte offset Order hit list by relevance criteria Size of hit list: 8GB? How many disk accesses to find the hits in a BST? At 100 million * bytes per node, the BST is large. Can we store it all in RAM? How to perform a Boolean search? or: union two lists (merge) and: intersect two lists (merge-like algorithm) Total disk accesses needed? search BST + access hit list + access each document’s info

A Better Data Structure BSTs waste space. Much duplication in the keys BSTs waste comparison time, for the same reason Can we use the ideas of Radix Sort? Search by bit? or by letter? Build a search tree, but… Go left if first bit is 0, right for 1 Or, nodes have 26 children, for a..z Words at the leaves. (Different sort of node.) Each leaf node is a “hit list” Don’t need to store the words! How much space is needed? suppose you have all 11.9M 5-letter words. space for tree about 1 pointer per word, 4 bytes, vs. 20(?) in BST Space savings possible--but what about wasted pointer space?

Radix Search (Ch. 15) Radix-search methods provide reasonable worst-case performance without balanced-tree complexity Space savings are also possible. They work by comparing pieces (“bytes”) of the key rather than the whole key, as in a BST Analogous to Radix Sorting methods Called “tries” for retrieval (but, ironically, pronounced like the word tries)

Symbol Tables (Ch. 12 quickie) But first, a word about symbol tables and BSTs (review) Symbol table: store items. retrieve them by key. e.g. a compiler’s symbol table e.g. a database with primary key e.g. Perl’s hash data structure (essentially an array indexed by a word.) $phone{“john”} = “x6789”. fundamental to much of computation Symbol table ADT (with additional desirable ops): insert, delete, find select (kth largest) sort union (of two symbol tables) Extensively studied and still an area of active research(eg web)

BSTs for Symbol Tables The Binary Search Tree is a common data structure used to implement symbol tables Operations: insert, delete, find – recursive algs, O(n) worst case O(log n) worst case in balanced BSTs sort – inorder traversal O(n) kth largest? augment tree with number of descendants stored at each node O(log n) time in a balanced BST pred, succ? union?

Digital Search Trees (Ch. 15 again) Like a BST, but go left for 0, right for 1 in the bit in question Store key at node Root is most significant bit; ith level -> ith bit from left Search: like BST search, but compare appropriate bit Insert: ditto Note: not inorder! Each key is somewhere along the path specified by its bits… Can’t support sort, select Search time? O(b), b=# of bits

Digital Search Tree Insertion How to insert Z? Z=11010 Trace down bits until you find an empty spot Runtime? O(b), b=number of bits

Trie How can we keep the BST order? Trie: a binary tree with keys at the leaves: for an empty set is a null pointer for a single key a leaf containing it for many keys, a node with keys starting with bit 0 in its left subtree and nodes starting with 1 in its right subtree

Trie Insertion Perform search as usual. If search ends at null link, insert there If the search ends on a leaf, we need to add enough nodes on the way down to differentiate the leaf and the inserted node Runtime? O(b) -- or maybe better! Inserting N random bitstrings requires lg N bit comparisons on average per insertion Note that leaf nodes and internal nodes are different. Wasted space if we use only one sort. (This gets especially significant in a large radix!) Even with different node types, there may be wasted space

R-way Tries You can save search time by using a larger radix (at the expense of wasted space…) For example, have 26 children of each node, one for each letter of the alphabet

Tries for strings 26 pointers per internal node, one for each letter of the alphabet What if one word is the prefix of another? Example aardvark and aardvarkish How do you represent that “aardvark” is a word if that node’s ‘i’ pointer points to another internal node? Add a bit per letter which means “this is a word” Keys are stored implicitly – by the sequence of links taken to find it.

A Trie node for strings struct node { char isword[26]; node *links[26]; node() { for (int i=0; i<26; i++) { isword[i]=0; links[i]=0; } } }; But where is the word stored?

Insertion How do you insert a string into a trie? void insert(string word, node *n, int pos) { if (pos == word.size() - 1) { n->isword[index(word[pos])] = 1; return; } if (n->links[index(word[pos])] == NULL) n->links[index(word[pos])] = new node; insert(word, n->links[index(word[pos])], pos+1); return; } int index(char ch) { return int(ch-’a’); }

Experimental Results In my implementation a node used 132 bytes words were read in nodes were allocated Total space 6,038,604 bytes (compared with 200k size of /usr/dict/words) Average word length 7.4 characters Average comparisons per search: 7.4 one-character comparisons (compared to 15 word comparisons for a balanced BST) Much easier to implement than a balanced BST

Using a Trie: Examples Spell checker: fast but big Symbol table with lots of short symbols Boggle-playing program read /usr/dict/words into a trie generate a 4x4 square of random letters DFS (or BFS) starting at each square, not re-using letters, finding all words from trie…