Data Structures and Algorithms for Information Processing

Slides:



Advertisements
Similar presentations
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Chapter 4: Trees Radix Search Trees Lydia Sinapova, Simpson College Mark Allen Weiss: Data Structures and Algorithm Analysis in Java.
Hashing (Ch. 14) Goal: to implement a symbol table or dictionary (insert, delete, search)  What if you don’t need ordered keys--pred, succ, sort, select?
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
MA/CSSE 473 Day 28 Hashing review B-tree overview Dynamic Programming.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
90-723: Data Structures and Algorithms for Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved. 1 Lecture 9: Searching Data Structures.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
Data Structures and Algorithms Lecture (Searching) Instructor: Quratulain Date: 4 and 8 December, 2009 Faculty of Computer Science, IBA.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
1 the BSTree class  BSTreeNode has same structure as binary tree nodes  elements stored in a BSTree are a key- value pair  must be a class (or a struct)
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Hashing (part 2) CSE 2011 Winter March 2018.
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
COMP261 Lecture 23 B Trees.
Multiway Search Trees Data may not fit into main memory
IP Routers – internal view
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
Dynamic Hashing (Chapter 12)
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Data Structures and Algorithms for Information Processing
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+-Trees.
Subject Name: File Structures
Database Management Systems (CS 564)
Hashing Alexandra Stefan.
Lecture 22 Binary Search Trees Chapter 10 of textbook
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Advanced Associative Structures
Hash Table.
CMSC 341 Hashing (Continued)
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Hash Tables.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Database Management Systems (CS 564)
Hashing CS2110.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms for Information Processing
Indexing and Hashing Basic Concepts Ordered Indices
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Database Design and Programming
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Hashing.
Indexing, Access and Database System Architecture
DATA STRUCTURES-COLLISION TECHNIQUES
17CS1102 DATA STRUCTURES © 2018 KLEF – The contents of this presentation are an intellectual and copyrighted property of KL University. ALL RIGHTS RESERVED.
CO4301 – Advanced Games Development Week 12 Using Trees
CSE 326: Data Structures Lecture #14
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

Data Structures and Algorithms for Information Processing Lecture 10: Searching II Lecture 10: Searching

Outline One more O/A scheme – Ordered Hashing (Tough Schoolboy problem) Analysis of hashing algorithms Consistent Hashing Radix searching Lecture 10: Searching

Open vs. Chained Hashing How big should the table be? Open addressing can be inconvenient when the number of insertions and deletions is unpredictable - overflow. Simple solution to overflow: Resize (double) table, rehashing everything into the new table Use Knuth’s approach and double hashing to avoid clustering. Lecture 10: Searching

Variant: Ordered Hashing In linear probing, we stop search when we find an empty cell or a record with a key equal to the search key In ordered hashing we stop when we find a key less than or equal to the search key (tough schoolboy hashing) Lecture 10: Searching

Tough Schoolboy hashing 13 chairs in the classroom Each boy has a preferred seat Each boy has a jump value Boys later in the alphabet are bigger Lecture 10: Searching

Class in the morning Inserts Don prefers 3 jumps 2 Bill prefers 5 jumps 4 Al prefers 3 jumps 6 Joe prefers 3 jumps 4 Lecture 10: Searching

1 2 3 DON 4 5 6 7 8 9 10 11 12 Lecture 10: Searching

1 2 3 DON 4 5 BILL 6 7 8 9 10 11 12 Lecture 10: Searching

1 2 DON Al can’t sit here!! 4 5 BILL 6 7 8 9 10 11 12 1 2 DON Al can’t sit here!! 4 5 BILL 6 7 8 9 10 11 12 Lecture 10: Searching

1 2 DON 4 5 BILL 6 7 8 9 Al 10 11 12 Lecture 10: Searching

1 2 DON Joe kicks out Don 4 5 BILL 6 7 8 9 Al 10 11 12 1 2 DON Joe kicks out Don 4 5 BILL 6 7 8 9 Al 10 11 12 Lecture 10: Searching

1 2 Joe 4 Don kicks out Bill 6 7 8 9 Al 10 11 12 Lecture 10: Searching

9 Al and Bill argue and Al gets kicked out 10 11 1 2 Joe 4 Don 6 7 8 9 Al and Bill argue and Al gets kicked out 10 11 12 Lecture 10: Searching

1 2 AL Joe 4 Don 6 7 8 9 Bill 10 11 12 Lecture 10: Searching

Searching the classroom Search for Don, Bill, Al, and Joe Search for Ken who prefers 3 and jumps 1 Lecture 10: Searching

Variant: Ordered Hashing This reduces the time of unsuccessful search to about the same as successful search Useful for applications where we expect to have a large number of unsuccessful searches Lecture 10: Searching

Summary of Basic Searching Hashing is preferred to binary tree methods in general, since it is faster. But binary search trees are truly dynamic (no advance info on size needed). BSTs also give worst case guarantees (hash function could be lousy). BSTs support more operations — sorting. Lecture 10: Searching

Time Analysis Open address hashing methods store N records in a table of size M. M > N The performance of the operations depends on the load factor alpha = N/M For chained hashing, alpha may be greater than 1. Lecture 10: Searching

Linear Probing Open address hashing with linear probing requires, on average: 1/2 ( 1 + 1/(1-alpha)^2) operations for an unsuccessful search 1/2 ( 1 + 1/(1-alpha)) operations for a successful search E.g., for alpha = 2/3 we’ll make 5 probes for an average unsuccessful search, and 2 for a successful search Lecture 10: Searching

Double Hashing Open address hashing with double hashing requires, on average: 1/(1-alpha) operations for an unsuccessful search -log(1-alpha)/alpha operations for a successful search E.g., for alpha = 2/3 we’ll make 3 probes for an average unsuccessful search, and 1.65 for a successful search Lecture 10: Searching

Chained Hashing Chained hashing requires, on average: 1+alpha operations for an unsuccessful search 1+alpha/2 operations for a successful search E.g., for alpha = 2/3 we’ll make 1.66 probes for an average unsuccessful search, and 1.33 for a successful search Lecture 10: Searching

Time Analysis These formulas require significant mathematical analysis, which we won’t go into. Lecture 10: Searching

Average Number of Probes Successful Search Lecture 10: Searching

Consistent Hashing Not covered in old data structure texts. Developed in 1997 by Karger et al. MIT. Gave birth to Akamai. At the heart of Chord (P2P DHT). Solves problems in peer to peer networks. Amazon Dynamo and distributed storage. A lightweight alternative to databases. Data is stored in memory on many machines rather than on a disk controlled by a DBMS. Lecture 10: Searching

Consistent Hashing Given a machine’s IP address, we can hash that address with a cryptographic hash. There will likely be no collisions. Given an object to store, we can hash that object with cryptographic hash. Again, very unlikely that any collisions will occur. SHA1,e.g, generates values between 0..(2^160) -1. So, we can imagine objects and computers arranged in a circle – all with unique SHA1 values. Create a balanced BST organized by SHA1 hashes of IP’s. Store in the tree (SHA1 hash of ip, IP Address) pairs. We hash the object and do a lookup in the tree. No matches will occur – but we can find the successor node fast. The machine at this IP is responsible for this object. Lecture 10: Searching

Consistent Hashing Machines and keys share the same address space. Insert Machine at IP: hash(ip) add hash(ip), ip pair to BST take appropriate keys from successor SHA1 would produce values between 0 and (2^160) - 1 Insert object: hash(object) Look in BST for successor’s IP send object to machine with IP Delete machine at IP: Find successor in BST Move all items to successor Remove machine IP Lookup object: hash(object) Look in BST for successor’s IP request object from machine with IP BST stores hash(machine IP), IP pairs Accessible globally, perhaps from a central player or distributed. Why BST and not A Hash Table? BST is ordered. A successor is easy to find. The BST could be stored in each node but scales poorly. Next slide shows an approach that scales. Lecture 10: Searching

A 16-node Chord Network A B Each machine maintains a table A’s BST has of size log(n) entries. n is the size of the ring - 16 in this example. A’s BST has four entries. If you were using Sha1, you would need 160 entries. Scales well. B Suppose we ask machine A to find a value stored on B. Diagram by Seth Terashima Lecture 10: Searching

A 16-node Chord Network We cut the ring in half in the worst case. Diagram by Seth Terashima Lecture 10: Searching

Radix Searching For many applications, keys can be thought of as numbers Searching methods that take advantage of digital properties of these keys are called radix searches Radix searches treat keys as numbers in base M (the radix) and work with individual digits Lecture 10: Searching

Radix Searching Provide reasonable worst-case performance without complication of balanced trees. Provide way to handle variable length keys. Compete with BST and Hash Tables. Lecture 10: Searching

The Simplest Radix Search Digital Search Trees — like BSTs but branch according to the key’s bits. Key comparison replaced by function that accesses the key’s next bit. Works for variable length keys. Data is not sorted by key. Lecture 10: Searching

Digital Search Example Lecture 10: Searching

Digital Search Trees Requires O(log N) comparisons on average for lookup. Why? Requires b comparisons in the worst case for a tree built with N random b-bit keys Lecture 10: Searching

Digital Search Problem: At each node we make a full key comparison — this may be expensive, e.g. very long keys Solution: store keys only at the leaves, use radix expansion to do intermediate key comparisons Lecture 10: Searching

Radix Tries Used for Retrieval. Internal nodes used for branching, external nodes used for final key comparison, and to store data. Lecture 10: Searching

Radix Trie Example H E A C S R A 00001 S 10011 E 00101 R 10010 C 00011 Lecture 10: Searching

Radix Tries Left subtree has all keys which have 0 for the leading bit, right subtree has all keys which have 1 for the leading bit An insert or search requires O(log N) bit comparisons in the average case, and b bit comparisons in the worst case Note that the tree is in order for O(n) sorting. Lecture 10: Searching

Radix Tries Problem: lots of extra nodes for keys that differ only in low order bits (See R and S nodes in example above) This is addressed by Patricia trees, which allow “lookahead” to the next relevant bit Practical Algorithm To Retrieve Information Coded In Alphanumeric (Patricia) In the slides that follow the entire alphabet would be included in the indexes. Review two Radix Tries and then a Patricia Tree. Lecture 10: Searching

Radix Trie Empty Radix Trie Insert “ARA” # A E I P R ARA Lecture 10: Searching

# A E I P R ARA Radix Trie # A E I P R ARA K_L AREA K Insert “AREA” # Lecture 10: Searching

Radix Trie P # A E I P R ARA AREA Insert “A” P A Lecture 10: Searching

# A E I P R # A E I P R # A E I P R # A E I P R # A E I P R # A E I P PIER EIRE IPA IRE EERIE A # A E I P R # A E I P R ARA # A E I P R ERA ERIE ERE PEER ARE PEAR PER AREA Lecture 10: Searching

A L Radix Trie O What’s the problem? G G E I A N D R ADAM LOGGIA LOGGING LOGGED LOGGERHEAD Lecture 10: Searching

A L Patricia Tree 4 E 1 I 1 D R A N ADAM LOGGIA LOGGING LOGGERHEAD 4 ADAM E 1 I 1 D R A N LOGGIA LOGGING LOGGERHEAD LOGGED Lecture 10: Searching