CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #14: Text – Part I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #15: Text - part II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
Chapter 5: Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Information Retrieval in Practice
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Modeling Modern Information Retrieval
Multimedia Databases Text - part I Slides by C. Faloutsos, CMU.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Vector Space Models.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 CS 430: Information Discovery Lecture 5 Ranking.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval and Web Search
CS 430: Information Discovery
Multimedia and Text Indexing
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Chapter 5: Information Retrieval and Web Search
15-826: Multimedia Databases and Data Mining
Latent Semantic Analysis
Presentation transcript:

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications:

Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering

Problem - Motivation Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT...)

Problem - Motivation Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents?

Full-text scanning Build a FSA; scan c a t

Full-text scanning for single term: (naive: O(N*M)) ABRACADABRAtext CAB pattern

Full-text scanning for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRAtext CAB pattern

Full-text scanning ABRACADABRAtext CAB pattern CAB...

Full-text scanning for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRAtext CAB pattern

Full-text scanning ABRACADABRAtext CAB pattern CAB

Full-text scanning ABRACADABRAtext OMINOUS pattern OMINOUS Boyer+Moore: fastest, in practice Sunday (‘90): some improvements

Full-text scanning For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75) again, build a simplified FSA in O(M) time Probabilistic algorithms: ‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’ [Wu+Manber, Baeza-Yates+, ‘92]

Full-text scanning Approximate matching - string editing distance: d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY

Full-text scanning string editing distance - how to compute? A:

Full-text scanning string editing distance - how to compute? A: dynamic programming cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t

Full-text scanning if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion )

Full-text scanning Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results)

Full-text scanning Conclusions: Full text scanning needs no space overhead, but is slow for large datasets

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Text - Inversion

Q: space overhead?

Text - Inversion A: mainly, the postings lists

Text - Inversion how to organize dictionary? stemming – Y/N? insertions?

Text - Inversion how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees,... stemming – Y/N? insertions?

Text – Inversion newer topics: Parallelism [Tomasic+,93] Insertions [Tomasic+94], [Brown+] ‘zipf’ distributions Approximate searching (‘glimpse’ [Wu+])

postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’ log(rank) log(freq) Text - Inversion freq ~ 1 / (rank * ln(1.78V))

Text - Inversion postings lists Cutting+Pedersen (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92] geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead!

Conclusions Conclusions: needs space overhead (2%-300%), but it is the fastest

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Signature files idea: ‘quick & dirty’ filter

Signature files idea: ‘quick & dirty’ filter then, do seq. scan on sign. file and discard ‘false alarms’ Adv.: easy insertions; faster than seq. scan Disadv.: O(N) search (with small constant) Q: how to extract signatures?

Signature files A: superimposed coding!! [Mooers49],... m (=4 bits/word) ~ (=4 bits set to “1” and the rest left as “0”) F (=12 bits sign. size) the bit patterns are OR-ed to form the document signature

Signature files A: superimposed coding!! [Mooers49],... data actual match

Signature files A: superimposed coding!! [Mooers49],... retrieval actual dismissal

Signature files A: superimposed coding!! [Mooers49],... nucleotic false alarm (‘false drop’)

Signature files A: superimposed coding!! [Mooers49],... ‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’

Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

Signature files Q1: How to choose F and m ? m (=4 bits/word) F (=12 bits sign. size)

Signature files Q1: How to choose F and m ? A: so that doc. signature is 50% full m (=4 bits/word) F (=12 bits sign. size)

Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

Signature files Q2: Why is it called ‘false drop’? Old, but fascinating story [1949] how to find qualifying books (by title word, and/or author, and/or keyword) in O(1) time? without computers

Signature files Solution: edge-notched cards each title word is mapped to m numbers(how?) and the corresponding holes are cut out:

Signature files Solution: edge-notched cards data ‘data’ -> #1, #39

Signature files Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards! data ‘data’ -> #1, #39

Signature files Also known as ‘zatocoding’, from ‘Zator’ company.

Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

Signature files Q3: other apps of signature files? A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document?

Signature files UNIX’s early ‘spell’ system [McIlroy] Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99] differential files [Severance+Lohman]

Signature files - conclusions easy insertions; slower than inversion brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non- qualifying elements, and focus on the rest.

References Aho, A. V. and M. J. Corasick (June 1975). "Fast Pattern Matching: An Aid to Bibliographic Search." CACM 18(6): Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast String Searching Algorithm." CACM 20(10): Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. Proc. of EDBT conference, Cambridge, U.K., Springer Verlag.

References - cont’d Faloutsos, C. and H. V. Jagadish (Aug , 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia. Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2):

References - cont’d Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan. Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf. McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM- 30(1):

References - cont’d Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information Bulletin 31. Cambridge, Mass, Zator Co. Pedersen, D. C. a. J. (1990). Optimizations for dynamic inverted index maintenance. ACM SIGIR. Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.

References - cont’d Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS. Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.

References - cont’d Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool.". Zobel, J., A. Moffat, et al. (Aug , 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Vector Space Model and Clustering keyword queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

Vector Space Model and Clustering main idea: document...data... aaron zoo data V (= vocabulary size) ‘indexing’

Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation? Two significant contributions ranked output relevance feedback

Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively CS TRs TU TRs

Vector Space Model and Clustering ranked output: easy! CS TRs TU TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] CS TRs TU TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] How? CS TRs TU TRs

Vector Space Model and Clustering How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones CS TRs TU TRs

Outline - detailed main idea cluster search cluster generation evaluation

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity

Cluster generation Q1: document-to-document similarity (recall: ‘bag of words’ representation) D1: {‘data’, ‘retrieval’, ‘system’} D2: {‘lung’, ‘pulmonary’, ‘system’} distance/similarity functions?

Cluster generation A1: # of words in common A2: normalized by the vocabulary sizes A3:.... etc About the same performance - prevailing one: cosine similarity

Cluster generation cosine similarity: similarity(D1, D2) = cos( θ ) = sum(v 1,i * v 2,i ) / len(v 1 )/ len(v 2 ) θ D1 D2

Cluster generation cosine similarity - observations: related to the Euclidean distance weights v i,j : according to tf/idf θ D1 D2

Cluster generation tf (‘term frequency’) high, if the term appears very often in this document. idf (‘inverse document frequency’) penalizes ‘common’ words, that appear in almost every document

Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity ?

Cluster generation A1: min distance (‘single-link’) A2: max distance (‘all-link’) A3: avg distance A4: distance to centroid ?

Cluster generation A1: min distance (‘single-link’) leads to elongated clusters A2: max distance (‘all-link’) many, small, tight clusters A3: avg distance in between the above A4: distance to centroid fast to compute

Cluster generation We have document-to-document similarity document-to-cluster similarity Q: How to group documents into ‘natural’ clusters

Cluster generation A: *many-many* algorithms - in two groups [VanRijsbergen]: theoretically sound (O(N^2)) independent of the insertion order iterative (O(N), O(N log(N))

Cluster generation - ‘sound’ methods Approach#1: dendrograms - create a hierarchy (bottom up or top-down) - choose a cut-off (how?) and cut cat tigerhorsecow

Cluster generation - ‘sound’ methods Approach#2: min. some statistical criterion (eg., sum of squares from cluster centers) like ‘k-means’ but how to decide ‘k’?

Cluster generation - ‘sound’ methods Approach#3: Graph theoretic [Zahn]: build MST; delete edges longer than 2.5* std of the local average

Cluster generation - ‘sound’ methods Result: variations Complexity?

Cluster generation - ‘iterative’ methods general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed (possibly adjusting cluster centroid) possibly, re-assign some vectors to improve clusters Fast and practical, but ‘unpredictable’

Cluster generation - ‘iterative’ methods general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed (possibly adjusting cluster centroid) possibly, re-assign some vectors to improve clusters Fast and practical, but ‘unpredictable’

Cluster generation one way to estimate # of clusters k: the ‘cover coefficient’ [Can+] ~ SVD

Outline - detailed main idea cluster search cluster generation evaluation

Evaluation Q: how to measure ‘goodness’ of one distance function vs another? A: ground truth (by humans) and ‘precision’ and ‘recall’

Evaluation precision = (retrieved & relevant) / retrieved 100% precision -> no false alarms recall = (retrieved & relevant)/ relevant 100% recall -> no false dismissals

References Can, F. and E. A. Ozkarahan (Dec. 1990). "Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases." ACM TODS 15(4): Noreault, T., M. McGill, et al. (1983). A Performance Evaluation of Similarity Measures, Document Term Weighting Schemes and Representation in a Boolean Environment. Information Retrieval Research, Butterworths. Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. The SMART Retrieval System - Experiments in Automatic Document Processing. G. Salton. Englewood Cliffs, New Jersey, Prentice-Hall Inc.

References - cont’d Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document Processing. Englewood Cliffs, New Jersey, Prentice-Hall Inc. Salton, G. and M. J. McGill (1983). Introduction to Modern Information Retrieval, McGraw-Hill. Van-Rijsbergen, C. J. (1979). Information Retrieval. London, England, Butterworths. Zahn, C. T. (Jan. 1971). "Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters." IEEE Trans. on Computers C-20(1):

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

LSI - Detailed outline LSI problem definition main idea experiments

Information Filtering + LSI [Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable news- documents Major contribution: LSI = Latent Semantic Indexing latent (‘hidden’) concepts

Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

Information Filtering + LSI Pictorially: concept-document matrix and...

Information Filtering + LSI... and concept-term matrix

Information Filtering + LSI Q: How to search, eg., for ‘system’?

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

LSI - Detailed outline LSI problem definition main idea experiments

LSI - Experiments 150 Tech Memos (TM) / month 34 users submitted ‘profiles’ (6-66 words per profile) concepts

LSI - Experiments four methods, cross-product of: vector-space or LSI, for similarity scoring keywords or document-sample, for profile specification measured: precision/recall

LSI - Experiments LSI, with document-based profiles, were better precision recall (0.25,0.65) (0.50,0.45) (0.75,0.30)

LSI - Discussion - Conclusions Great idea, to derive ‘concepts’ from documents to build a ‘statistical thesaurus’ automatically to reduce dimensionality Often leads to better precision/recall but: Needs ‘training’ set of documents ‘concept’ vectors are not sparse anymore

LSI - Discussion - Conclusions Observations Bellcore (-> Telcordia) has a patent used for multi-lingual retrieval How exactly SVD works?

Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text SVD: a powerful tool multimedia...

References Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12):