Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations


Presentation on theme: "Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

1 Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

2 Organizational Remarks Exercises: Please, register to the exercises by sending me (huerst@informatik.uni-freiburg.de) an email till Friday, May 5th, with - Your name, - Matrikelnummer, - Studiengang, - Plans for exam This is just to organize the exercises but has no effect if you decide to drop this course later.

3 INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface RESULTS DOCS.

4 INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface QUERY RESULTS DOCS. RESULT REPRESENTATION INDEXING SEARCH

5 INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

6 Query Languages: Boolean Search So far : a) Single terms (unrelated / bag of words) b) Boolean conjunctions (AND, OR, NOT) Boolean search : Main search model before the Web came along (Note: Mainly professional users). Advantages of Boolean queries : Precise (mathematical model), Offers great control and transparency, Good for domains with ranking by other means than relevance, i.e. chronological

7 Boolean Search (Cont.) Disadvantages of Boolean queries : Sometimes hard to specify, even for experts Binary decision (relevant or not) Bag-of-Words, no position Example: Query: New York City Doc. 1: This is a nice city. Doc. 2: This city has a new library. Query: New AND York AND City Doc. 1: New York has a new library. Doc. 2: The city of York has a new library.

8 Further Query Types Phrases, e.g. New York City Proximity, e.g. University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g. AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards: index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)

9 Phrases Often used (esp. for web search): Quotas e.g. “New York City” Advantage: Easy and seem to work well (about 10% of web queries are such phrases according to Manning et al. [2]) How do we support this? We need word positions. We need all original words (e.g. no stop word removal in University of Freiburg). We need an efficient way to do this.

10 Approaches to Support Phrases Biword indexes: Idea: Store pairs of consecutive words (in addition to single terms), e.g. New York City is represented by the terms New, York, City, New York, York City Might cause problems for phrases with more than 2 words, but often works quite well Positional indexes: Idea: Store position of each word in the postings list

11 Positional Indexes – Example …47322523 … 18453CITY 9421YORK 23535NEW … …47252318 …55534725 23:4[3,12,46,78] 25:3[43,120,221] 32:6[12,20,57,200,322,481] … NEW23535…,25:6[41,87,136,…], … YORK9421…,25:2[42,137], …

12 Positional Indexes Also works for queries such as University [word]1 Freiburg University NEAR Freiburg Problem: Size Need to store additional info (positions) on an already large index (stop words!) Approx. size: 2-4 times the original index, 1/2 size of uncompressed documents [2] In practice: Combinations exist, e.g. index w. names as phrases, useful biwords, and store position

13 Pattern Matching – Wildcards Example : fußball* is mapped to fußballer, fußballspiel, fußballweltmeister, … Trailing wildcard queries, e.g. fußball* Can easily be found if dictionary is stored as a B-tree Leading wildcard queries, e.g. *meister Can easily be found if dictionary is stored as a reverse B-tree (i.e. terms stored backwards)

14 Wildcards (Cont.) General wildcards, e.g. f*ball (matches e.g. to fußball, federball, …) Idea: Move the * at the end Permuterm index : For each word (e.g. fußball) add end symbol (e.g. fußball$) and create permutations (e.g. fußball$, ußball$f, ßball$fu, ball$fuß, …, l$fußbal, $fußball) Permuterm index : dictionary = all permuterms, postings = dictionary terms containing this rotation Query : Permute * to the end (e.g. ball$f*) and get postings from permuterm index (e.g. ball$fuß, ball$feder, …)

15 Structural Queries In practice: Often semi-structured documents Structural queries : Use available structure to better specify the information need, e.g. AUTHOR = Ottmann AND TEXT CONTAINS search tree Requires to store structure information, e.g. in a parametric index encoded in the dictionary: or in the postings: OTTMANN.TITLE OTTMANN.BODY OTTMANN.AUTHOR91719 … 28 8917 … 23 122644 … 48 OTTMANN8.BODY9.AUTHOR, 9.BODY12.TITLE …

16 Summary: Further Query Types Phrases, e.g. New York City Proximity, e.g. University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g. AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards: index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)


Download ppt "Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University."

Similar presentations


Ads by Google