Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)

Slides:



Advertisements
Similar presentations
Modern Information Retrieval Chapter 1: Introduction
Advertisements

Search Techniques Boolean Logic and Keyword Searching.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Chapter 5: Information Retrieval and Web Search
Searching the Literature planning a search using information resources effectively Psychology January 2015.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Homework #1 J. H. Wang Oct. 24, 2011.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Why indexing? For efficient searching of a document
CS122B: Projects in Databases and Web Applications Winter 2017
Query Models Use Types What do search engines do.
Text Based Information Retrieval
Why the interest in Queries?
CS 430: Information Discovery
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Evaluation of IR Performance
IL Step 3: Using Bibliographic Databases
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

Organizational Remarks Exercises: Please, register to the exercises by sending me (huerst@informatik.uni-freiburg.de) an email till Friday, May 5th, with - Your name, - Matrikelnummer, - Studiengang, - Plans for exam This is just to organize the exercises but has no effect if you decide to drop this course later.

Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCS. DOCUMENTS RESULTS INDEX

Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCS. DOCUMENTS QUERY RESULTS INDEXING RESULT REPRESENTATION SEARCH INDEX

Recap: IR System & Tasks Involved INFORMATION NEED SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION

Query Languages: Boolean Search So far: a) Single terms (unrelated / bag of words) b) Boolean conjunctions (AND, OR, NOT) Boolean search: Main search model before the Web came along (Note: Mainly professional users). Advantages of Boolean queries: Precise (mathematical model), Offers great control and transparency, Good for domains with ranking by other means than relevance, i.e. chronological

Boolean Search (Cont.) Disadvantages of Boolean queries: Sometimes hard to specify, even for experts Binary decision (relevant or not) Bag-of-Words, no position Example: Query: New York City Doc. 1: This is a nice city. Doc. 2: This city has a new library. Query: New AND York AND City Doc. 1: New York has a new library. Doc. 2: The city of York has a new library.

Further Query Types Phrases, e.g. New York City Proximity, e.g. University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g. AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards: index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)

Phrases Often used (esp. for web search): Quotas e.g. “New York City” Advantage: Easy and seem to work well (about 10% of web queries are such phrases according to Manning et al. [2]) How do we support this? We need word positions. We need all original words (e.g. no stop word removal in University of Freiburg). We need an efficient way to do this.

Approaches to Support Phrases Biword indexes: Idea: Store pairs of consecutive words (in addition to single terms), e.g. New York City is represented by the terms New, York, City, New York, York City Might cause problems for phrases with more than 2 words, but often works quite well Positional indexes: Idea: Store position of each word in the postings list

Positional Indexes – Example … 47 32 25 23 18453 CITY 9421 YORK 23535 NEW 18 55 53 23:4[3,12,46,78] 25:3[43,120,221] 32:6[12,20,57,200,322,481] … NEW 23535 …,25:6[41,87,136,…], … YORK 9421 …,25:2[42,137], …

Positional Indexes Also works for queries such as University [word]1 Freiburg University NEAR Freiburg Problem: Size Need to store additional info (positions) on an already large index (stop words!) Approx. size: 2-4 times the original index, 1/2 size of uncompressed documents [2] In practice: Combinations exist, e.g. index w. names as phrases, useful biwords, and store position

Pattern Matching – Wildcards Example: fußball* is mapped to fußballer, fußballspiel, fußballweltmeister, … Trailing wildcard queries, e.g. fußball* Can easily be found if dictionary is stored as a B-tree Leading wildcard queries, e.g. *meister Can easily be found if dictionary is stored as a reverse B-tree (i.e. terms stored backwards)

Wildcards (Cont.) General wildcards, e.g. f*ball (matches e.g. to fußball, federball, …) Idea: Move the * at the end Permuterm index: For each word (e.g. fußball) add end symbol (e.g. fußball$) and create permutations (e.g. fußball$, ußball$f, ßball$fu, ball$fuß, …, l$fußbal, $fußball) Permuterm index: dictionary = all permuterms, postings = dictionary terms containing this rotation Query: Permute * to the end (e.g. ball$f*) and get postings from permuterm index (e.g. ball$fuß, ball$feder, …)

Structural Queries In practice: Often semi-structured documents Structural queries: Use available structure to better specify the information need, e.g. AUTHOR = Ottmann AND TEXT CONTAINS search tree Requires to store structure information, e.g. in a parametric index encoded in the dictionary: or in the postings: OTTMANN.AUTHOR 9 17 19 28 … OTTMANN.TITLE 12 26 44 48 … OTTMANN.BODY 8 9 17 23 … OTTMANN 8.BODY 9.AUTHOR, 9.BODY 12.TITLE …

Summary: Further Query Types Phrases, e.g. New York City Proximity, e.g. University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g. AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards: index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)