CMU SCS 15-826: Multimedia Databases and Data Mining Lecture #14: Text – Part I C. Faloutsos.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
Indexing DNA Sequences Using q-Grams
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
Space-for-Time Tradeoffs
String Searching Algorithm
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases Text - part I Slides by C. Faloutsos, CMU.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CSG523/ Desain dan Analisis Algoritma
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
CS 430: Information Discovery
String Processing.
Indexing and Searching (File Structures)
CSCE350 Algorithms and Data Structure
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Chapter 7 Space and Time Tradeoffs
Knuth-Morris-Pratt Algorithm.
Chap 3 String Matching 3 -.
15-826: Multimedia Databases and Data Mining
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
15-826: Multimedia Databases and Data Mining
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Lecture #14: Text – Part I C. Faloutsos

CMU SCS Copyright: C. Faloutsos (2012) Must-read Material MM Textbook, Chapter

CMU SCS Copyright: C. Faloutsos (2012) Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

CMU SCS Copyright: C. Faloutsos (2012) Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text multimedia

CMU SCS Copyright: C. Faloutsos (2012) Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI

CMU SCS Copyright: C. Faloutsos (2012) Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications:

CMU SCS Copyright: C. Faloutsos (2012) Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications: –Web –law + patent offices –digital libraries –information filtering

CMU SCS Copyright: C. Faloutsos (2012) Problem - Motivation Types of queries: –boolean (‘data’ AND ‘retrieval’ AND NOT...)

CMU SCS Copyright: C. Faloutsos (2012) Problem - Motivation Types of queries: –boolean (‘data’ AND ‘retrieval’ AND NOT...) –additional features (‘data’ ADJACENT ‘retrieval’) –keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents?

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning Build a FSA; scan c a t

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning for single term: –(naive: O(N*M)) ABRACADABRAtext CAB pattern

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning for single term: –(naive: O(N*M)) –Knuth Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRAtext CAB pattern

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning ABRACADABRAtext CAB pattern CAB

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning for single term: –(naive: O(N*M)) –Knuth Morris and Pratt (‘77) –Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRAtext CAB pattern

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning ABRACADABRAtext CAB pattern CAB

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning ABRACADABRAtext OMINOUS pattern OMINOUS Boyer+Moore: fastest, in practice Sunday (‘90): some improvements

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75) –again, build a simplified FSA in O(M) time Probabilistic algorithms: ‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’ [Wu+Manber, Baeza-Yates+, ‘92]

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning Approximate matching - string editing distance: d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning string editing distance - how to compute? A:

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning string editing distance - how to compute? A: dynamic programming cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion )

CMU SCS Copyright: C. Faloutsos (2012) String editing distance  SURVEY  S1 U2 R3 G4 E5 R6 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S1 U2 R3 G4 E5 R6 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S10 U2 R3 G4 E5 R6 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U210 R32 G43 E54 R65 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R321 G432 E543 R654 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G4321 E5432 R6543 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E54322 R65433 Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y

CMU SCS Copyright: C. Faloutsos (2012) String editing distance SURVEY S U R G E R Y subst. del

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results)

CMU SCS Copyright: C. Faloutsos (2012) Full-text scanning Conclusions: Full text scanning needs no space overhead, but is slow for large datasets