Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.

Advertisements

Information Retrieval and Text Mining Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in.

Introduction to Information Retrieval Introduction to Information Retrieval Christopher Manning and Prabhakar Raghavan Lecture 2: The term vocabulary and.

Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining.

CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.

CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists

Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.

COMP4201 Information Retrieval and Search Engines Lecture 2: The term vocabulary and postings lists.

CS276: Information Retrieval and Web Search

Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.

Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Information Retrieval and Web Search Lecture 2: Dictionary and Postings.

The term vocabulary and postings lists

Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.

CS276A Information Retrieval

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.

Information Retrieval Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting.

Information Retrieval in Practice

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.

Overview of Search Engines

Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 2 8/25/2011.

Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.

Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,

Information Retrieval Lecture 2: The term vocabulary and postings lists.

Text Pre-processing and Faster Query Processing David Kauchak cs458 Fall 2012 adapted from:

| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Chapter 6: Information Retrieval and Web Search

1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.

Information Retrieval and Web Search Lecture 2: Dictionary and Postings.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.

Information Retrieval Lecture 2: Dictionary and Postings.

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.

Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.

Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.

CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.

Basic Text Processing Word tokenization.

Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.

Information Retrieval in Practice

CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists

Information Retrieval

Lecture 2: The term vocabulary and postings lists

Ch 2 Term Vocabulary & Postings List

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Query processing: phrase queries and positional indexes

Lecture 2: The term vocabulary and postings lists

Modified from Stanford CS276 slides

Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics

Text Processing.

Document ingestion.

CS276: Information Retrieval and Web Search

CS276: Information Retrieval and Web Search

Recap of the previous lecture

Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing

PageRank GROUP 4.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Lecture 2: The term vocabulary and postings lists

Lecture 2: The term vocabulary and postings lists

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Query processing: phrase queries and positional indexes

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:

Introduction to Information Retrieval Recap of the previous lecture  Basic inverted indexes:  Structure: Dictionary and Postings  Key step in construction: Sorting  Boolean query processing  Intersection by linear time “merging”  Simple optimizations Ch. 1 2

Introduction to Information Retrieval Recall the basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friendromancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen. 3 First project

Introduction to Information Retrieval Plan for this lecture Elaborate basic indexing  Preprocessing to form the term vocabulary  Documents  Tokenization  What terms do we put in the index?  Postings  Faster merges: skip lists  Positional postings and phrase queries 4

Introduction to Information Retrieval Parsing a document  What format is it in?  pdf/word/excel/html?  What language is it in?  What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … Sec

Introduction to Information Retrieval Complications: Format/language  Documents being indexed can include docs from many different languages  A single index may have to contain terms of several languages.  Sometimes a document or its components can contain multiple languages/formats  French with a German pdf attachment.  Document unit Sec

Introduction to Information Retrieval TOKENS AND TERMS 7

Introduction to Information Retrieval Tokenization  Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. 8

Introduction to Information Retrieval Tokenization  Input: “university of Qom, computer department”  Output: Tokens  university  of  Qom  computer  department  A token is a sequence of characters in a document  Each such token is now a candidate for an index entry, after further processing  Described below  But what are valid tokens to emit? Sec

Introduction to Information Retrieval Issues in tokenization  Iran’s capital  Iran? Irans? Iran’s?  Hyphen  Hewlett-Packard  Hewlett and Packard as two tokens?  the hold-him-back-and-drag-him-away maneuver: break up hyphenated sequence.  co-education  lowercase, lower-case, lower case ?  Space  San Francisco: How do you decide it is one token? Sec

Introduction to Information Retrieval Issues in tokenization  Numbers  Older IR systems may not index numbers  But often very useful: think about things like looking up error codes/stack traces on the web  (One answer is using n-grams: Lecture 3)  Will often index “meta-data” separately  Creation date, format, etc.  3/12/91 Mar. 12, /3/91  55 B.C.  B-52  My PGP key is 324a3df234cb23e  (800) Sec

Introduction to Information Retrieval Language issues in tokenization  French  L'ensemble  one token or two?  L ? L’ ? Le ?  Want l’ensemble to match with un ensemble  Until at least 2003, it didn’t on Google  German noun compounds are not segmented  Lebensversicherungsgesellschaftsangestellter  ‘life insurance company employee’  German retrieval systems benefit greatly from a compound splitter module  Can give a 15% performance boost for German Sec

Introduction to Information Retrieval Language issues in tokenization  Chinese and Japanese have no spaces between words:  莎拉波娃现在居住在美国东南部的佛罗里达。  Not always guaranteed a unique tokenization  Further complicated in Japanese, with multiple alphabets intermingled フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 ) KatakanaHiraganaKanjiRomaji Sec

Introduction to Information Retrieval Language issues in tokenization  Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right  With modern Unicode representation concepts, the order of characters in files matches the conceptual order, and the reversal of displayed characters is handled by the rendering system, but this may not be true for documents in older encodings.  Other complexities that you know! Sec

Introduction to Information Retrieval Stop words  With a stop list, you exclude from the dictionary entirely the commonest words.  Intuition: They have little semantic content: the, a, and, to, be  Using a stop list significantly reduces the number of postings that a system has to store, because there are a lot of them. Sec

Introduction to Information Retrieval Stop words  You need them for:  Phrase queries: “President of Iran”  Various song titles, etc.: “Let it be”, “To be or not to be”  “Relational” queries: “flights to London”  The general trend in IR systems: from large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list.  Good compression techniques (lecture 5) means the space for including stop words in a system is very small  Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words. 16

Introduction to Information Retrieval Normalization to terms  We want to match I.R. and IR  Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens  Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary Sec

Introduction to Information Retrieval Normalization to terms  One way is using Equivalence Classes:  Searches for one term will retrieve documents that contain each of these members.  We most commonly implicitly define equivalence classes of terms rather than being fully calculated in advance (hand constructed), e.g.,  deleting periods to form a term  U.S.A., USA  deleting hyphens to form a term  anti-discriminatory, antidiscriminatory 18

Introduction to Information Retrieval Normalization: other languages  Accents: e.g., French résumé vs. resume.  Umlauts: e.g., German: Tuebingen vs. Tübingen  Normalization of things like date forms  7 月 30 日 vs. 7/30  Tokenization and normalization may depend on the language and so is intertwined with language detection  Crucial: Need to “normalize” indexed text as well as query terms into the same form Sec

Introduction to Information Retrieval Case folding  Reduce all letters to lower case  exception: upper case in mid-sentence?  Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… Sec

Introduction to Information Retrieval Normalization to terms  What is the disadvantage of equivalence classing?  An alternative to equivalence classing is to do asymmetric expansion (hand constructed)  An example of where this may be useful  Enter: windowSearch: window, windows  Enter: windowsSearch: Windows, windows, window  Enter: WindowsSearch: Windows  Potentially more powerful, but less efficient Sec

Introduction to Information Retrieval Thesauri and soundex  Do we handle synonyms and homonyms?  E.g., by hand-constructed equivalence classes  car = automobile color = colour  What about spelling mistakes?  One approach is soundex, which forms equivalence classes of words based on phonetic heuristics  More in lectures 3 and 9 22

Introduction to Information Retrieval Review  IR systems:  Indexing  Searching  Indexing:  Parsing document  Tokenization -> tokens  Normalization -> terms  Indexing -> index 23

Introduction to Information Retrieval Review  Normalization: consider tokens rather than query.  Examples:  Case  Hyphen  Period  Synonyms  Spelling mistakes 24

Introduction to Information Retrieval Review  Two methods for normalization:  Equivalence classing: often implicit.  Asymmetric expansion:  Query time:  a query expansion dictionary  more processing at query time  Indexing time:  more space for storing postings.  Asymmetric expansion is considerably less efficient than equivalence classing but more flexible. 25

Introduction to Information Retrieval Stemming and lemmatization  Documents are going to use different forms of a word, such as organize, organizes, and organizing.  Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.  Reduce terms to their “roots” before indexing.  E.g.,  am, are, is  be  car, cars, car's, cars'  car  the boy's cars are different colors  the boy car be different color 26

Introduction to Information Retrieval Stemming  “Stemming” suggest crude affix chopping  language dependent  Example:  Porter’s algorithm   Lovins stemmer  eral/lovins.htm eral/lovins.htm Sec

Introduction to Information Retrieval Porter’s algorithm  Commonest algorithm for stemming English  Results suggest it’s at least as good as other stemming options  Conventions + 5 phases of reductions  phases applied sequentially  each phase consists of a set of commands  sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. Sec

Introduction to Information Retrieval Typical rules in Porter  sses  ss presses  press  ies  Ibodies  bodi  ss  sspress  press  s  cats  cat  Many other rules are sensitive to the measure of words  (m>1) EMENT →  replacement → replac  cement → cement Sec

Introduction to Information Retrieval Lemmatization  Reduce inflectional/variant forms to base form (lemma) properly with the use of a vocabulary and morphological analysis of words  Lemmatizer: a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Sec

Introduction to Information Retrieval Stemming vs. Lemmatization  saw:  stemming might return just s,  Lemmatization would attempt to return either:  see: the use of the token was as a verb  saw: the use of the token was as a noun 31

Introduction to Information Retrieval Helpfulness of normalization  Do stemming and other normalizations help?  Definitely useful for Spanish, German, Finnish, …  30% performance gains for Finnish!  What about English? 32

Introduction to Information Retrieval Helpfulness of normalization  English:  Not so considerable help!  Helps a lot for some queries, hurts performance a lot for others.  Stemming helps recall but harms precision  operative (dentistry) ⇒ oper  operational (research) ⇒ oper  operating (systems) ⇒ oper  For a case like this, moving to using a lemmatizer would not completely fix the problem 33

Introduction to Information Retrieval Project 2  Find rules for normalizing Farsi documents and implement. 34

Introduction to Information Retrieval Exercise  Are the following statements true or false? Why?  a. In a Boolean retrieval system, stemming never lowers precision.  b. In a Boolean retrieval system, stemming never lowers recall.  c. Stemming increases the size of the vocabulary.  d. Stemming should be invoked at indexing time but not while processing a query 35

Introduction to Information Retrieval Language-specificity  Many of the above features embody transformations that are  Language-specific and  Often, application-specific  These are “plug-in” addenda to the indexing process  Both open source and commercial plug-ins are available for handling these Sec

Introduction to Information Retrieval FASTER POSTINGS MERGES: SKIP LISTS 37

Introduction to Information Retrieval Recall basic merge  Walk through the two postings simultaneously, in time linear in the total number of postings entries Brutus Caesar 2 8 If the list lengths are m and n, the merge takes O(m+n) operations. Can we do better? Yes Sec

Introduction to Information Retrieval Augment postings with skip pointers (at indexing time)  Why?  To skip postings that will not figure in the search results.  How?  Where do we place skip pointers?  The resulted list is skip list Sec

Introduction to Information Retrieval Query processing with skip pointers Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance. We then have 41 and 11 on the lower. 11 is smaller. But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings. Sec

Introduction to Information Retrieval Where do we place skips?  Tradeoff:  More skips  shorter skip spans  more likely to skip. But lots of comparisons to skip pointers.  Fewer skips  few pointer comparison, but then long skip spans  few successful skips. Sec

Introduction to Information Retrieval Placing skips  Simple heuristic: for postings of length L, use  L evenly-spaced skip pointers.  This ignores the distribution of query terms.  Easy if the index is relatively static; harder if L keeps changing because of updates.  This definitely used to help; with modern hardware it may not (Bahle et al. 2002)  The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging! Sec D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp

Introduction to Information Retrieval Exercise  Do exercises 2.5 and 2.6 of your book. 43

Introduction to Information Retrieval PHRASE QUERIES AND POSITIONAL INDEXES 44

Introduction to Information Retrieval Phrase queries  Want to be able to answer queries such as “stanford university” – as a phrase  Thus the sentence “I went to university at Stanford” is not a match.  Most recent search engines support a double quotes syntax Sec

Introduction to Information Retrieval Phrase queries  PHRASE QUERIES has proven to be very easily understood and successfully used by users.  As many as 10% of web queries are phrase queries,  Many more queries are implicit phrase queries  For this, it no longer suffices to store only entries  Solutions? 46

Introduction to Information Retrieval A first attempt: Biword indexes  Index every consecutive pair of terms in the text as a phrase  For example the text “Qom computer department” would generate the biwords  Qom computer  computer department  Each of these biwords is now a dictionary term  Two-word phrase query-processing is now immediate. Sec

Introduction to Information Retrieval Longer phrase queries  The query “modern information retrieval course” can be broken into the Boolean query on biwords: modern information AND information retrieval AND retrieval course  Work fairly well in practice,  But there can and will be occasional false positives. Sec

Introduction to Information Retrieval Extended biwords  Now consider phrases such as “student of the computer”  Perform part-of-speech-tagging (POST).  POST classify words as nouns, verbs, etc.  Group the terms into (say) Nouns (N) and articles/prepositions (X). 49

Introduction to Information Retrieval Extended biwords  Call any string of terms of the form NXXN an extended biword.  Each such extended biword is made a term in the vocabulary  Segment query into enhanced biwords Sec

Introduction to Information Retrieval Issues for biword indexes  False positives, as noted before  Index blowup due to bigger dictionary  Infeasible for more than biwords, big even for them  Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy Sec

Introduction to Information Retrieval Solution 2: Positional indexes  In the postings, store for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.> Sec

Introduction to Information Retrieval Positional index example  For phrase queries, we need to deal with more than just equality <be: ; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? Sec

Introduction to Information Retrieval Processing a phrase query  Extract inverted index entries for each distinct term: to, be, or, not.  Merge their doc:position lists:  to:  2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;...  be:  1:17,19; 4:17,191,291,430,434; 5:14,19,101;...  Same general method for proximity searches Sec

Introduction to Information Retrieval Proximity queries  LIMIT /3 STATUTE /3 FEDERAL /2 TORT  /k means “within k words of (on either side)”.  Clearly, positional indexes can be used for such queries; biword indexes cannot.  Figure 2.12: The merge of postings to handle proximity queries.  This is a little tricky to do correctly and efficiently Sec

Introduction to Information Retrieval Positional index size  Need an entry for each occurrence, not just once per document  Index size depends on average document size  Average web page has <1000 terms  Books, even some epic poems … easily 100,000 terms  Consider a term with frequency 0.1% Why? , Positional postings Postings Document size Sec

Introduction to Information Retrieval Rules of thumb  A positional index is 2–4 as large as a non-positional index  Positional index size 35–50% of volume of original text  Caveat: all of this holds for “English-like” languages Sec

Introduction to Information Retrieval Positional index size  You can compress position values/offsets: we’ll talk about that in lecture 5  Nevertheless, a positional index expands postings storage substantially  Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system. Sec

Introduction to Information Retrieval Combination schemes  These two approaches can be profitably combined  For particular phrases (“Hossein Rezazadeh”) it is inefficient to keep on merging positional postings lists Sec

Introduction to Information Retrieval Combination schemes  Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme  A typical web query mixture was executed in ¼ of the time of using just a positional index  It required 26% more space than having a positional index alone  H.E. Williams, J. Zobel, and D. Bahle “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. 60 Arbitrary Presentation