Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Information Retrieval IR 4. Plan This time: Index construction.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
Database Management Systems, R. Ramakrishnan1 Introduction to IR Systems: Supporting Boolean Text Search Chapter 27, Part A.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Overview of Search Engines
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
LIS618 lecture 2 the Boolean model Thomas Krichel
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
Introduction to Information Retrieval Boolean Retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
1 Introduction to IR Systems: Supporting Boolean Text Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
General Architecture of Retrieval Systems 1Adrienn Skrop.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Take-away Administrativa
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval in Practice
Text Based Information Retrieval
Implementation Issues & IR Systems
CSCE 561 Information Retrieval System Models
Basic Information Retrieval
CMPS 561 Boolean Retrieval
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Query processing: phrase queries and positional indexes
Presentation transcript:

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University

LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Data Structure for Search (Index) Requirements : a) Represent documents appropriately b) Enable efficient and effective search In addition: c) Limit storage (note: tradeoff w. speed) Operations : Search Generation (add documents) Update (remove / replace documents) Others (Boolean search, phrases, etc.)

The Index Example Query: capital AND France (Boolean Query) Doc. 1: The capital of France is called Paris. Doc. 2: Paris is the capital of France. Doc. 3: The capitals of France and England are called Paris and London, respectively. Naive approach: Scanning Boolean query (capital AND France) delivers Doc. 1 and Doc. 2 as results Question: Can we do this more efficiently?

Term-Document Incidence Matrix DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Idea: Build a matrix with Columns = documents Rows = all appearing words (alphabetically sorted) Example: Doc. 1: The capital of France is called Paris. __________ 1 = Word appears in doc. 0 = Word does not appear

Term-Document Incidence Matrix Boolean Queries DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Query: capital AND France

Boolean Queries DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Query: capital AND France AND ( 1, 1, 0 )

Term-Document Incidence Matrix Search: Very easy, but not very efficient (e.g docs, terms = Matrix with cells) The good news: This matrix is very sparse (i.e. lots of 0’s, only few 1’s) Idea: Just store the ‘hits’ (term incidences) Data structure Inverted File

Inverted File DOC1DOC2DOC3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the

Inverted File Main advantage: Easy, efficient search Disadvantages: Storage (10%-100% of doc. size) Modifications (updates, …) Often other information is stored as well to support advanced queries (e.g. position for phrases) to speed up the search process (e.g. frequency for query optimization)

Inverted File & Term Frequency Query: France AND London AND capitals France capitals London … (203 documents) … (163 documents) … (24 documents) Optimize query to speed up search (i.e. limit number of merging steps) (France AND London) AND capitals (capitals AND London) AND France

Implementation capitals France … … … Dictionary … Postings Dictionary: Usually kept in memory (fast!) Postings: Kept on disks, access via offset

Dictionary: Size Dictionary usually kept in memory (speed) How big does it get? Heap’s law TEXT SIZE N DICTIONARY SIZE

Entries in the Dictionary and are called capital capitals England France is London of Paris respectively The the Ignores word order What terms / tokens should go into the dictionary? Bag-of-words approaches

LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION