Multimedia and Text Indexing

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Multimedia Database Systems
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Multimedia and Time-series Data
Multimedia Databases (MMDB)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Chapter 6: Information Retrieval and Web Search
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Instructor : Marina Gavrilova
15-826: Multimedia Databases and Data Mining
Multimedia and Text Indexing
Basic Information Retrieval
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
15-826: Multimedia Databases and Data Mining
Information Retrieval and Web Design
Presentation transcript:

Multimedia and Text Indexing

Midterm Monday, March 20, 2017 in class Closed Books and notes Topics Spatial Temporal Spatio-temporal All papers and lecture notes on these topics I will discuss the midterm on Wed

Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video tracks) has increased in the recent years. Joint Research from Database Management, Computer Vision, Signal Processing and Pattern Recognition aims to solve problems related to multimedia data management.

Multimedia Data There are four major types of multimedia data: images, video sequences, sound tracks, and text. From the above, the easiest type to manage is text, since we can order, index, and search text using string management techniques, etc. Management of simple sounds is also possible by representing audio as signal sequences over different channels. Image retrieval has received a lot of attention in the last decade (CV and DBs). The main techniques can be extended and applied also for video retrieval.

Content-based Image Retrieval Images were traditionally managed by first annotating their contents and then using text-retrieval techniques to index them. However, with the increase of information in digital image format some drawbacks of this technique were revealed: Manual annotation requires vast amount of labor Different people may perceive differently the contents of an image; thus no objective keywords for search are defined A new research field was born in the 90’s: Content-based Image Retrieval aims at indexing and retrieving images based on their visual contents.

Feature Extraction The basis of Content-based Image Retrieval is to extract and index some visual features of the images. There are general features (e.g., color, texture, shape, etc.) and domain-specific features (e.g., objects contained in the image). Domain-specific feature extraction can vary with the application domain and is based on pattern recognition On the other hand, general features can be used independently from the image domain.

Color Features To represent the color of an image compactly, a color histogram is used. Colors are partitioned to k groups according to their similarity and the percentage of each group in the image is measured. Images are transformed to k-dimensional points and a distance metric (e.g., Euclidean distance) is used to measure the similarity between them. k-dimensional space k-bins

Using Transformations to Reduce Dimensionality In many cases the embedded dimensionality of a search problem is much lower than the actual dimensionality Some methods apply transformations on the data and approximate them with low-dimensional vectors The aim is to reduce dimensionality and at the same time maintain the data characteristics If d(a,b) is the distance between two objects a, b in real (high-dimensional) and d’(a’,b’) is their distance in the transformed low-dimensional space, we want d’(a’,b’)d(a,b). d’(a’,b’) d(a,b)

ADBIS 2003 Multimedia Indexing Every indexing model must follow a retrieval semantics. The multimedia indexing model must support the similarity queries. Two types similarity queries: range queries return documents similar more than a given threshold k nearest neighbour queries return the first k most similar documents

ADBIS 2003 Metric Indexing Feature vectors are indexed according to distances between each other. As a dissimilarity measure, a distance function d(Oi,Oj) is specified such that the metric axioms are satisfied: d(Oi,Oi) = 0 reflexivity d(Oi,Oj) > 0 positivity d(Oi,Oj) = d(Oj,Oi) symmetry d(Oi,Ok) + d(Ok,Oj) ≥ d(Oi,Oj) triangular inequality Metric structures: Main memory structures: cover-tree, vp-tree, mvp-tree Disk-based structures: M-tree, Slim-tree

ADBIS 2003 M-tree at a glance indexing objects of a general metric space (not only vector spaces) up to this time the only persistent and balanced metric tree doesn’t directly use dimensions, just the distances between objects (possible vector coordinates are handled by a metric defined by user) The correct M-tree hierarchy is guaranteed due to the triangular inequality axiom of d. The hierarchy consists of nested metric regions. better resists to the „curse of dimensionality“ (it depends on the metric) the hierarchy of nodes allows to natively implement the similarity queries

Structure of the M-tree ADBIS 2003 Structure of the M-tree The M-tree nodes contain items of two types: ground objects in leafs, representing the data objects routing objects in inner nodes, representing the metric regions

Structure of M-tree Entry for routing object: {Or, ptr(T(Or)), r(Or), d(Or,P(Or))} Entry for database object: {Oj, oid(Oj), d(Oj,P(Oj))}

Similarity queries in the M-tree A range query is specified by a query object Q and a query radius r(Q). A k-NN query is based on a modified range query (using dynamic radius) and a priority queue. During the range query evaluation, the M-tree is LIFO-passed and only the relevant (i.e. intersecting) metric regions (their nodes resp.) are further processed.

Similarity queries in the M-tree ADBIS 2003 Similarity queries in the M-tree Oq

Correctness This algorithm is correct because if d(Or, Q) > r(Q) + r(Or) every object Oj in the subtree T(Or) will have: d(Oj, Q) > r(Q)

Text Retrieval (Information retrieval) Given a database of documents, find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering

Problem - Motivation Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT ...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents?

Text – Inverted Files

A: mainly, the postings lists Text – Inverted Files Q: space overhead? A: mainly, the postings lists

Text – Inverted Files how to organize dictionary? stemming – Y/N? Keep only the root of each word ex. inverted, inversion  invert insertions?

Text – Inverted Files how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees, ... stemming – Y/N? insertions?

Text – Inverted Files log(freq) freq ~ 1/rank / ln(1.78V) log(rank) postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’ log(freq) freq ~ 1/rank / ln(1.78V) log(rank)

Text – Inverted Files postings lists Cutting+Pedersen (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92] geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead! Conclusions: needs space overhead (2%-300%), but it is the fastest

Text - Detailed outline Text databases problem inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Vector Space Model and Clustering Keyword (free-text) queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

Vector Space Model and Clustering main idea: each document is a vector of size d: d is the number of different terms in the database document aaron data zoo ‘indexing’ ...data... d (= vocabulary size)

Document Vectors Documents are represented as “bags of words” OR as vectors. A vector is like an array of floating points Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

Document Vectors One location for each word. B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

Document Vectors One location for each word. B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

Document Vectors A B C D E F G H I Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

We Can Plot the Vectors Star Diet Doc about movie stars Doc about astronomy Doc about mammal behavior Diet

Assigning Weights to Terms Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

tf x idf

Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents

Similarity Measures for document vectors (seen as sets) Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector space similarity (use the weights to compare the documents)

Computing Similarity Scores 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0

Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) D2 Q 0.8 0.6 0.4 D1 0.2 0.2 0.4 0.6 0.8 1.0 Term A

Text - Detailed outline Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Information Filtering + LSI [Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable news-documents Major contribution: LSI = Latent Semantic Indexing latent (‘hidden’) concepts

Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

Information Filtering + LSI Pictorially: concept-document matrix and...

Information Filtering + LSI ... and concept-term matrix

Information Filtering + LSI Q: How to search, eg., for ‘system’?

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)