Information Retrieval (IR)

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Inverted Index Hongning Wang
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval Chapter 8 Indexing and Searching.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Models
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
IR Models: Structural Models
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
Modern Information Retrieval Chapter 1 Introduction.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Search Engines and Information Retrieval Chapter 1.
Modern Information Retrieval Computer engineering department Fall 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Introduction Multimedia initial focus
Text Based Information Retrieval
CS 430: Information Discovery
Indexing and Searching (File Structures)
Multimedia Information Retrieval
Information Retrieval
Basic Information Retrieval
Multimedia Information Retrieval
Introduction to Information Retrieval
Recuperação de Informação B
Information Retrieval and Web Design
Chapter 11: Indexing and Hashing
Presentation transcript:

Information Retrieval (IR) Deals with the representation, storage and retrieval of unstructured data Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets Classical IR deals mainly with text The evolution of multimedia databases and of the web have given new interest to IR E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Multimedia IR A multimedia information system that can store and retrieve attributes text 2D grey-scale and color images 1D time series digitized voice or music video E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Applications Financial, marketing: stock prices, sales etc. find companies whose stock prices move similarly Scientific databases: sensor data whether, geological, environmental data Office automation, electronic encyclopedias, electronic books Medical databases X-rays, CT, MRI scans Criminal investigation suspects, fingerprints, Personal archives, text and color images E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Queries Formulation of user’s information need Free-text query By example document (e.g., text, image) e.g., in a collection of color photos find those showing a tree close to a house Retrieval is based on the understanding of the content of documents and of their components E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Goals of Retrieval Accuracy: retrieve documents that the user expects in the answer With as few incorrect answers as possible All relevant answers are retrieved Speed: retrievals has to be fast The system responds in real time E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Accuracy of Retrieval Depends on query criteria, query complexity and specificity Attributes / Content criteria? Depends on what is matched with the query How documents content is represented Typically feature vectors Depends also on matching function Euclidean distance E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Speed of Retrieval The query is compared (matched) with all stored documents The definition of similarity criteria (similarity or distance function between documents) is an important issue: Similarity or distance function between documents Matching has to fast: document matching has to be computationally efficient and sequential searching must be avoided Indexing: search documents that are likely to match the query E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Database database descriptions documents index Doc1-FeactureVector Doc2-FeactureVector DocN-FeatureVector Doc1 query Doc2 DocN E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Main Idea Descriptions are extracted and stored Same or separate storage with documents Queries address the stored descriptions rather the documents themselves Images & video: the majority of stored data Solution: retrieval by text Text retrieval is a well researched area Most systems work with text, attributes E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Problem Text descriptions for images and video contained in documents are not available e.g., the text in a web site is not always descriptive of every particular image contained in the web site Two approaches human annotations feature extraction E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Human Annotations Humans insert attributes, captions for images and videos inconsistent or subjective over time and users (different users do not give the same descriptions) retrievals will fail if queries are formulated using different keywords or descriptions time consuming process, expensive or impossible for large databases E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Feature Extraction Features are extracted from audio, image and video cheaper and replicable approach consistent descriptions but inexact low level feature (patterns, colors etc.) difficult to extract meaning different techniques for different data E.G.M Petrakis Multimedia Information Retrieval

Architecture of a IR System for Images Furht at.al. 96 E.G.M Petrakis Multimedia Information Retrieval

Multimedia IR in Practice Relies on text descriptions Non-text IR there is nothing similar to text-based retrieval systems automatic IR is sometimes impossible requires human intervention at some level significant progress have been made domain specific interpretations E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Ranking of IR Methods Ranking in terms of complexity and accuracy attributes text audio image video Combined IR based on two or more data types for more accurate retrievals Complex data types (e.g.,video) are rich sources of information accuracy complexity E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Similarity Searching IR is not an exact process two documents are never identical! they can be “similar” searching must be “approximate” The effectiveness of IR depends on the types and correctness of descriptions used types of queries allowed user uncertainly as to what he is looking for efficiency of search techniques E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Approximate IR A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity Two common types of similarity queries: range queries: retrieve all documents up to a distance “threshold” T nearest-neighbor queries: retrieve the k best matches E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Distance - Similarity Decide whether two documents are similar distance: the lower the distance, the more similar the documents are similarity: the higher the similarity, the more similar the documents are Key issue for successful retrieval: the more accurate the descriptions are, the more accurate the retrieval is E.G.M Petrakis Multimedia Information Retrieval

Retrieval Quality Criteria Retrieval with as few errors as possible Two types of errors: false dismissals or misses: qualifying but non retrieved documents false positives or false drops: retrieved but not qualifying documents A good method minimizes both Ranking quality: retrieve qualifying documents before non-qualifying ones E.G.M Petrakis Multimedia Information Retrieval

Evaluation of Retrieval “Given a collection of documents, a set of queries and human expert’s responses to the above queries, the ideal system will retrieve exactly what the human dictated” The deviations from the above are measured E.G.M Petrakis Multimedia Information Retrieval

Precision-Recall Diagram 1 0,5 the ideal method Measure precision-recall for 1, 2 ..N answers High precision means few false alarms High recall mean few false dismissals E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Harmonic Mean A single measure combining precision and recall r: recall, p: precision F takes values in [0,1] F  1 as more retrieved documents are relevant F 0 as few retrieved documents are relevant F is high when both precision and recall are high F expresses a compromise between precision and recall E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Ranking Quality The higher the Rnorm the better the ability of a method to retrieve correct answer before incorrect A human expert evaluates the answer set Then take answers in pairs: (relevant,irrelevant) S+ : pairs ranked correctly (the relevant entry has retrieved before the irrelevant one) S- : pairs ranked incorrectly Smax+: total ranked pairs E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Searching Method Sequential scanning: the query is matched with all stored documents slow if the database is large or if matching is slow The documents must be indexed hashing, inverted files, B-trees, R-trees different data types are indexed and searched separately E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Goals of IR Summarizing, there are two general goals common to all IR systems effectiveness: IR must be accurate (retrieves what the user expects to see in the answer) efficiency: IR must be fast (faster than sequential scanning) E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Approximate IR A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity Two common types of similarity queries: range queries: retrieve all documents up to a distance “threshold” T nearest-neighbor queries: retrieve the k best matches E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Query Formulation Must be flexible and convenient SQL queries constraints on all attributes and data types may become very complex Queries by example e.g., by providing an example document or image Browsing: display headers, summaries, miniatures etc. or for refining the retrieved results E.G.M Petrakis Multimedia Information Retrieval

Access Methods for Text Access methods for text are interesting for at least 3 reasons multimedia documents contain text, e.g., images often have captions text retrieval has several applications in itself (library automation, web search etc.) text retrieval research has led to useful ideas like, vector space model, information filtering, relevance feedback etc. E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Text Queries Single or multiple keyword queries context queries: phrases, word proximity boolean: keywords with and, or, not Natural language: free text queries Structured search takes also text structure into account and can be: flat or hierarchical for searching in titles, paragraphs, sections, chapters hypertext: combines content-connectivity E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Keyword Matching The whole collection is searched no preprocessing no space overhead updates are easy The user specifies a string (regular expression) and the text is parsed using a finite state automaton KMP, BMH algorithms E.G.M Petrakis Multimedia Information Retrieval

Error Tolerant Methods Methods that can tolerate errors Scan text one character at a time by keeping track of matched characters and retrieve strings within a desired editing distance from query Wu, Manber 92, Baeza-Yates, Connet 92 Extension: regular expressions built-up by strings and operators: “pro (blem | tein) (s | ε) | (0 | 1 | 2)” E.G.M Petrakis Multimedia Information Retrieval

Error Counting Methods Retrieve similar words or phrases e.g., misspelled words or words with different pronunciation editing distance: a numerical estimate of the similarity between 2 strings phonetic coding: search words with similar pronunciation (Soundex, Phonix) N-grams: count common N-length substrings E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Editing Distance Minimum number of edit operations that are needed to transform the first string to the second edit operation: insert, delete, substitute and (sometimes) transposition of characters d(si,ti) : distance between characters usually d(si,ti) = 1 if si < > ti and 0 otherwise edit(cordis,codis) = 1: r is deleted edit(cordis, codris) = 1: transposition of rd E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval 1 2 3 4 5 6 j a b c d 7 8 initialization cost total cost E.G.M. Petrakis Multimedia Information Retrieval

Damerau-Levenstein Algorithm Basic recurrence relation (DP algorithm) edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j; edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1, edit(i-1,j-1) + d(si,ti), edit(i-2,j-2) + d(si,tj-1) + d(si-1,sj) + 1 } E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Matching Algorithm Compute D(A,B) #A, #B lengths of A, B 0: null symbol R: cost of an edit operation D(0,0) = 0 for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0); for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]); for i = 0 to #A for j = 0 to #B { m1 = D(i,j-1) + R(0B[j]); m2 = D(i-1,j) + R(A[i] 0); m3 = D(i-1,j-1) + R(A[i] B[j]); D(i,j) = min{m1, m2, m3}; } E.G.M. Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Classical IR Models Boolean model simple based on set theory queries as Boolean expressions Vector space model queries and documents as vectors in term space Probabilistic model a probabilistic approach E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Text Indexing A document is represented by a set of index terms that summarize document contents adjectives, adverbs, connectives are less useful mainly nouns (lexicon look-up) requires text pre-processing (off-line) E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Indexing index Data Repository E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Text Preprocessing Extract index terms from text Processing stages word separation, sentence splitting change terms to a standard form (e.g., lowercase) eliminate stop-words (e.g. and, is, the, …) reduce terms to their base form (e.g., eliminate prefixes, suffixes) construct mapping between terms and documents (indexing) E.G.M Petrakis Multimedia Information Retrieval

Text Preprocessing Chart from Baeza – Yates & Ribeiro – Neto, 1999 E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Text Indexing Methods Main indexing methods: inverted files signature files bitmaps Size of index may exceed size of actual collection compress index to reduce storage typically 40% of size of collection E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Inverted Files Dictionary: terms stored alphabetically Posting list: pointers to documents containing the term Posting info: document id, frequency of occurrence, location in document, etc. dictionary indexed by B-trees, tries or binary searched Pros: fast and easy to implement Cons: large space overhead (up to 300%) E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Inverted Index άγαλμα αγάπη … δουλειά πρωί ωκεανός index posting list (1,2)(3,4) (4,3)(7,5) (10,3) 1 2 3 4 5 6 7 8 9 10 11 ……… documents E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Query Processing Parse query and extract query terms Dictionary lookup binary search Get postings from inverted file one term at a time Accumulate postings record matching document ids, frequencies, etc. compute weights Rank answers by weights (e.g., frequencies) E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Signature Files A filter for eliminating most of the non-qualifying documents signature: short hash-coded representation of document or query the signatures are stored and searched sequentially search returns all qualifying documents plus some false alarms the answer is searched to eliminate false alarms E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Superimposed Coding Each word yields a bit pattern of size F with m bits set to 1 and the rest left as 0 size of F affects the false drop rate bit patterns are OR-ed to form the doc signature find documents having 1 in the same bits as the query word signature data 001 000 110 010 base 000 010 101 001 document signature 001 010 111 011 E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval Bitmaps For every term, a bit vector is stored each bit corresponds to a document set to 1 if term appears in document, 0 otherwise e.g., “word”  10100: the “word” appears in documents 1 and 3 in a collection of 5 documents Very efficient for Boolean queries retrieve bit-vectors for each query term combine bit vectors with Boolean operators E.G.M Petrakis Multimedia Information Retrieval

Comparison of Text Indexing Methods Bitmaps: pros: intuitive, easy to implement and to process cons: space overhead Signature files: pros: easy to implement, compact index, no lexicon cons: many unnecessary accesses to documents Inverted files: pros: used by most search engines cons: large space overhead, index can be compressed E.G.M Petrakis Multimedia Information Retrieval

Multimedia Information Retrieval References “Searching Multimedia Databases by Content”, C. Faloutsos, Kluwer Academic Publishers, 1996 “Modern Information Retrieval”, R. Baeza-Yates, B. Ribeiro-Neto, Addison Wesley, 1999 “Automatic Text Processing”, Gerard Salton, Addison Wesley, 1989 Information Retrieval Links: http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html E.G.M Petrakis Multimedia Information Retrieval