Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #14: Text – Part I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #15: Text - part II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
Space-for-Time Tradeoffs
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)
15-826: Multimedia Databases and Data Mining
Inverted Index Hongning Wang
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval Chapter 8 Indexing and Searching.
Information Retrieval in Practice
Modern Information Retrieval
Information Retrieval (IR)
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Multimedia Databases Text - part I Slides by C. Faloutsos, CMU.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
ECE 526 – Network Processing Systems Design Network Security: string matching algorithm Chapter 17: George Varghese.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
Overview of Search Engines
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
CS 430: Information Discovery
Chapter 6: Information Retrieval and Web Search
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
Evidence from Content INST 734 Module 2 Doug Oard.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Implementation Issues & IR Systems
CS 430: Information Discovery
13 Text Processing Hongfei Yan June 1, 2016.
Indexing and Searching (File Structures)
Fast Fourier Transform
Multimedia and Text Indexing
Space-for-time tradeoffs
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Multimedia and Text Indexing
15-826: Multimedia Databases and Data Mining
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
Chap 3 String Matching 3 -.
15-826: Multimedia Databases and Data Mining
Presentation transcript:

Multimedia Databases Text I

Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video databases Time Series databases Data Mining

Text - Detailed outline Text databases problem full text scanning inversion signature files clustering information filtering and LSI

Problem - Motivation Given a database of documents, find documents containing “data”, “retrieval” Applications:

…find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering Problem - Motivation

Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents? Problem - Motivation

Full-text scanning Build a FSA; scan c a t

Full-text scanning for single term: (naive: O(N*M)) ABRACADABRAtext CAB pattern

for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRAtext CAB pattern Full-text scanning

ABRACADABRAtext CAB pattern CAB... Full-text scanning

for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRAtext CAB pattern Full-text scanning

ABRACADABRAtext CAB pattern CAB Full-text scanning

ABRACADABRAtext OMINOUS pattern OMINOUS Boyer+Moore: fastest, in practice Sunday (‘90): some improvements Full-text scanning

For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75) again, build a simplified FSA in O(M) time Probabilistic algorithms: ‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’ [Wu+Manber, Baeza-Yates+, ‘92] Full-text scanning

Approximate matching - string editing distance: d(‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY Full-text scanning

string editing distance - how to compute? A: dynamic programming Idea cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t Full-text scanning

if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion ) Full-text scanning

Complexity: O(M*N) Conclusions: Full text scanning needs no space overhead, but is slow for large datasets Full-text scanning

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Text – Inverted Files

Q: space overhead? Text – Inverted Files A: mainly, the postings lists

how to organize dictionary? stemming – Y/N? Keep only the root of each word ex. inverted, inversion  invert insertions? Text – Inverted Files

how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees,... stemming – Y/N? insertions? Text – Inverted Files

Other topics: Parallelism [Tomasic+,93] Insertions [Tomasic+94], [Brown+] ‘zipf’ distributions Approximate searching (‘glimpse’ [Wu+]) Text – Inverted Files

postings list – more Zipf distr.: eg., rank- frequency plot of ‘Bible’ log(rank) log(freq) freq ~ 1/rank / ln(1.78V) Text – Inverted Files

postings lists Cutting+Pedersen (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92] geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead! Text – Inverted Files

Conclusions: needs space overhead (2%-300%), but it is the fastest Text – Inverted Files

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Signature files idea: ‘quick & dirty’ filter

then, do seq. scan on sign. file and discard ‘false alarms’ Adv.: easy insertions; faster than seq. scan Disadv.: O(N) search (with small constant) Q: how to extract signatures? Signature files

A: superimposed coding!! [Mooers49],... m (=4 bits/word) F (=12 bits sign. size) Signature files

A: superimposed coding!! [Mooers49],... data actual match Signature files

A: superimposed coding!! [Mooers49],... retrieval actual dismissal Signature files

A: superimposed coding!! [Mooers49],... nucleotic false alarm (‘false drop’) Signature files

A: superimposed coding!! [Mooers49],... ‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’ Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files? Signature files

Q1: How to choose F and m ? m (=4 bits/word) F (=12 bits sign. size) Signature files

Q1: How to choose F and m ? A: so that doc. signature is 50% full m (=4 bits/word) F (=12 bits sign. size) Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files? Signature files

Q2: Why is it called ‘false drop’? Old, but fascinating story [1949] how to find qualifying books (by title word, and/or author, and/or keyword) in O(1) time? Signature files without computers..

Solution: edge-notched cards each title word is mapped to m numbers(how?) and the corresponding holes are cut out: Signature files

Solution: edge-notched cards data ‘data’ -> #1, #39 Signature files

Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards! data ‘data’ -> #1, #39 Signature files

Q3: other apps of signature files? A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document? Signature files

Another name: Bloom Filters UNIX’s early ‘spell’ system [McIlroy] Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99] differential files [Severance+Lohman] Signature files

easy insertions; slower than inversion brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non-qualifying elements, and focus on the rest. Signature files

References Aho, A. V. and M. J. Corasick (June 1975). "Fast Pattern Matching: An Aid to Bibliographic Search." CACM 18(6): Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast String Searching Algorithm." CACM 20(10): Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. Proc. of EDBT conference, Cambridge, U.K., Springer Verlag.

References - cont’d Faloutsos, C. and H. V. Jagadish (Aug , 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia. Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2):

References - cont’d Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan. Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf. McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM- 30(1):

References - cont’d Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information Bulletin 31. Cambridge, Mass, Zator Co. Pedersen, D. C. a. J. (1990). Optimizations for dynamic inverted index maintenance. ACM SIGIR. Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.

References - cont’d Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS. Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.

References - cont’d Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool.". Zobel, J., A. Moffat, et al. (Aug , 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.