CSCE350 Algorithms and Data Structure

Slides:



Advertisements
Similar presentations
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Advertisements

Space-for-Time Tradeoffs
Chapter 7 Space and Time Tradeoffs. Space-for-time tradeoffs Two varieties of space-for-time algorithms: b input enhancement — preprocess the input (or.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
B-Trees. Motivation for B-Trees Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to.
Other time considerations Source: Simon Garrett Modifications by Evan Korth.
Design and Analysis of Algorithms - Chapter 71 Hashing b A very efficient method for implementing a dictionary, i.e., a set with the operations: – insert.
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Theory of Algorithms: Space and Time Tradeoffs James Gain and Edwin Blake {jgain | Department of Computer Science University of Cape.
Chapter 6 Transform-and-Conquer Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Application: String Matching By Rong Ge COSC3100
MA/CSSE 473 Day 27 Hash table review Intro to string searching.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
MA/CSSE 473 Day 23 Student questions Space-time tradeoffs Hash tables review String search algorithms intro.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Lecture 15 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
Lecture 14 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Design and Analysis of Algorithms – Chapter 71 Space-Time Tradeoffs: String Matching Algorithms* Dr. Ying Lu RAIK 283: Data Structures.
CS422 Principles of Database Systems Indexes
CSG523/ Desain dan Analisis Algoritma
CE 221 Data Structures and Algorithms
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
B-Trees B-Trees.
B-Trees B-Trees.
B-Trees B-Trees.
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CSCI 210 Data Structures and Algorithms
School of Computer Science and Engineering
MA/CSSE 473 Day 24 Student questions Space-time tradeoffs
Subject Name: File Structures
Hashing Exercises.
Sorting.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Chapter 6 Transform and Conquer.
Space-for-time tradeoffs
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 7 Space and Time Tradeoffs
Other time considerations
Searching CLRS, Sections 9.1 – 9.3.
Space-for-time tradeoffs
CSIT 402 Data Structures II With thanks to TK Prasad
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Database Design and Programming
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Space-for-time tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
B-Trees.
B-Trees.
Analysis and design of algorithm
Data Structures and Algorithm Analysis Hashing
Lecture-Hashing.
Presentation transcript:

CSCE350 Algorithms and Data Structure Lecture 16 Jianjun Hu Department of Computer Science and Engineering University of South Carolina 2009.11.

Chapter 7: Space-Time Tradeoffs For many problems some extra space really pays off: extra space in tables (breathing room?) hashing non comparison-based sorting input enhancement indexing schemes (eg, B-trees) auxiliary tables (shift tables for pattern matching) tables of information that do all the work dynamic programming

String Matching pattern: a string of m characters to search for text: a (long) string of n characters to search in Brute force algorithm: Align pattern at beginning of text moving from left to right, compare each character of pattern to the corresponding character in text until all characters are found to match (successful search); or a mismatch is detected while pattern is not found and the text is not yet exhausted, realign pattern one position to the right and repeat step 2. What is the complexity of the brute-force string matching?

String Searching - History 1970: Cook shows (using finite-state machines) that problem can be solved in time proportional to n+m 1976 Knuth and Pratt find algorithm based on Cook’s idea; Morris independently discovers same algorithm in attempt to avoid “backing up” over text At about the same time Boyer and Moore find an algorithm that examines only a fraction of the text in most cases (by comparing characters in pattern and text from right to left, instead of left to right) 1980 Another algorithm proposed by Rabin and Karp virtually always runs in time proportional to n+m and has the advantage of extending easily to two-dimensional pattern matching and being almost as simple as the brute-force method.

Horspool’s Algorithm A simplified version of Boyer-Moore algorithm that retains key insights: compare pattern characters to text from right to left given a pattern, create a shift table that determines how much to shift the pattern when a mismatch occurs (input enhancement)

Consider the Problem Search pattern BARBER in some text Compare the pattern in the current text position from the right to the left If the whole match is found, done. Otherwise, decide the shift distance of the pattern (move it to the right) There are four cases! s0 … c sn-1 B A R E

Shift Distance -- Case 1: There is no ‘c’ in the pattern. Shift by the m – the length of the pattern s0 … c sn-1 B A R E s0 … S sn-1 B A R E

Shift Distance -- Case 2: There are occurrence of ‘c’ in the pattern, but it is not the last one. Shift should align the rightmost occurrence of the ‘c’ in the pattern s0 … c sn-1 B A R E s0 … B sn-1 A R E

Shift Distance -- Case 3: ‘c’ matches the last character in the pattern, but no ‘c’ among the other m-1 characters. Follow Case 1 and shift by m s0 … c sn-1 L E A D R s0 … M E R sn-1 L A D

Shift Distance -- Case 4: ‘c’ matches the last character in the pattern, and there are other ‘c’s among the other m-1 characters. Follow Case 2. s0 … c sn-1 O R D E s0 … O R sn-1 D E

Shift Table for the pattern “BARBER” We can precompute the shift distance for every possible character ‘c’ (given a pattern) Shift Table for the pattern “BARBER” c A B C D E F … R Z _ t(c) 4 2 6 1 3

Example See Section 7.2 for the pseudocode of the shift-table construction algorithm and Horspool’s algorithm Example: find the pattern BARBER from the following text J I M _ S A W E N B R H O

Algorithm Efficiency The worst-case complexity is Θ(nm) In average, it is Θ(n) It is usually much faster than the brute-force algorithm A simple exercise: Create the shift table of 26 letters and space for the pattern BAOBAB

Boyer-Moore algorithm Based on same two ideas: compare pattern characters to text from right to left given a pattern, create a shift table that determines how much to shift the pattern when a mismatch occurs (input enhancement) Uses additional shift table with same idea applied to the number of matched characters The worst-case efficiency of Boyer-Moore algorithm is linear. See Section 7.2 of the textbook for detail

Boyer-Moore algorithm s0 … S E R sn-1 B A Bad-symbol shift

Space and Time Tradeoffs: Hashing A very efficient method for implementing a dictionary, i.e., a set with the operations: insert find delete Applications: databases symbol tables Addressing by index number Addressing by content

Example Application: How to store student records into a data structure ? xxx-xx-6453 Jeffrey …..Los angels…. xxx-xx-2038 xxx-xx-0913 xxx-xx-4382 xxx-xx-9084 xxx-xx-2498 1 2 3 4 5 6 7 8 9 2038 6453 4382 9084 0913 2498

Hash tables and hash functions Hash table: an array with indices that correspond to buckets Hash function: determines the bucket for each record Example: student records, key=SSN. Hash function: h(k) = k mod m (k is a key and m is the number of buckets) if m = 1000, where is record with SSN= 315-17-4251 stored? Hash function must: be easy to compute distribute keys evenly throughout the table

Collisions If h(k1) = h(k2) then there is a collision. Good hash functions result in fewer collisions. Collisions can never be completely eliminated. Two types handle collisions differently: Open hashing - bucket points to linked list of all keys hashing to it. Closed hashing – one key per bucket in case of collision, find another bucket for one of the keys (need Collision resolution strategy) linear probing: use next bucket double hashing: use second hash function to compute increment

Example of Open Hashing Store student record into 10 bucket using hashing function h(SSN)=SSN mod 10 xxx-xx-6453 xxx-xx-2038 xxx-xx-0913 xxx-xx-4382 xxx-xx-9084 xxx-xx-2498 1 2 3 4 5 6 7 8 9 2038 6453 4382 9084 0913 2498

Average number of probes = 1+α/2 Worst-case is still linear! Open hashing If hash function distributes keys uniformly, average length of linked list will be α =n/m (load factor) Average number of probes = 1+α/2 Worst-case is still linear! Open hashing still works if n>m.

Example: Closed Hashing (Linear Probing) xxx-xx-6453 xxx-xx-2038 xxx-xx-0913 xxx-xx-4382 xxx-xx-9084 xxx-xx-2498 1 2 3 4 5 6 7 8 9 4382 6453 0913 9084 2038 2498

Hash function for strings h(SSN)=SSN mod 10 SSN must be a integer What is the key is a string? “ABADGUYISMOREWELCOME”………..Bill….

Hash function for strings h(SSN)=SSN mod 10 SSN must be a integer What is the key is a string? static unsigned long sdbm(str) unsigned char *str; { unsigned long hash = 0; int c; while (c = *str++) hash = c + (hash << 6) + (hash << 16) - hash; return hash; }

Closed Hashing (Linear Probing) Avoids pointers. Does not work if n>m. Deletions are not straightforward. Number of probes to insert/find/delete a key depends on load factor α = n/m (hash table density) successful search: (½) (1+ 1/(1- α)) unsuccessful search: (½) (1+ 1/(1- α)²) As the table gets filled (α approaches 1), number of probes increases dramatically:

B-Trees Organize data for fast queries Index for fast search For datasets of structured records, B-tree indexing is used

Motivation (cont.) Assume that we use an AVL tree to store about 20 million records We end up with a very deep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds We know we can’t improve on the log n lower bound on search for a binary tree But, the solution is to use more branches and thus reduce the height of the tree! As branching increases, depth decreases B-Trees

Constructing a B-tree 17 3 8 28 48 1 2 6 7 12 14 16 25 26 29 45 52 53 55 68 B-Trees

Definition of a B-tree A B-tree of order m is an m-way tree (i.e., a tree where each node may have up to m children) in which: 1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree 2. all leaves are on the same level 3. all non-leaf nodes except the root have at least m / 2 children 4. the root is either a leaf node, or it has from two to m children 5. a leaf node contains no more than m – 1 keys The number m should always be odd B-Trees

Comparing Trees Binary trees Multi-way trees Can become unbalanced and lose their good time complexity (big O) AVL trees are strict binary trees that overcome the balance problem Heaps remain balanced but only prioritise (not order) the keys Multi-way trees B-Trees can be m-way, they can have any (odd) number of children One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently balanced binary tree, exchanging the AVL tree’s balancing operations for insertion and (more complex) deletion operations B-Trees

Announcement Midterm Exam 2 Statistics: Points Possible: 100 Class Average: 85 (expected) High Score: 100