Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
1 Web Search and Text Mining Lecture 2 Adapted from Manning, Raghaven and Schuetze.
Information Retrieval and Text Mining Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in.
CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
COMP4201 Information Retrieval and Search Engines Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from:
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
The term vocabulary and postings lists
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CS276A Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Information Retrieval Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 2: The term vocabulary and postings.
CpSc 881: Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
Chapter 5: Information Retrieval and Web Search
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 2 8/25/2011.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
LIS618 lecture 2 the Boolean model Thomas Krichel
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs458 Fall 2012 adapted from:
Chapter 6: Information Retrieval and Web Search
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Information Retrieval Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Basic Text Processing Word tokenization.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Information Retrieval
Lecture 2: The term vocabulary and postings lists
Ch 2 Term Vocabulary & Postings List
Overview Recap Documents Terms Skip pointers Phrase queries
CS122B: Projects in Databases and Web Applications Winter 2017
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
Document ingestion.
CS276: Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
Recap of the previous lecture
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
PageRank GROUP 4.
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Basic Text Processing Word tokenization.
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Preparing the Text for Indexing 1

Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen. 2

What exactly is “a document”? What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? How easy is it to “read it in”? Each of these is a classification problem, that can be tackled with machine learning. But these tasks are often done heuristically … 3

Complications: Format/language What is a unit document? A file? A book? A chapter of a book? An ? An with 5 attachments? A group of files (PPT or LaTeX in HTML) Documents being indexed can include documents from many different languages An index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French with a German pdf attachment. 4

Getting the Tokens Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen. 5

6 Words, Terms, Tokens, Types  Word – A delimited string of characters as it appears in the text.  Term – A “normalized” word (case, morphology, spelling etc); an equivalence class of words.  Token – An instance of a word or term occurring in a document.  Type – The same as a term in most cases; an equivalence class of tokens. 6

Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing But what are valid tokens to emit? 7

What is a valid token? Finland’s capital  Finland? Finlands? Finland’s? Hewlett-Packard  Hewlett and Packard as two tokens? State-of-the-art  break up hyphenated sequence. co-education  ? San Francisco: one token or two? San Francisco-Los Angeles highway: how many tokens? Dr. Summer address is 35 Winter St., , RI, USA. 8

Tokenization: Numbers 3/12/91 Mar. 12, Dec B.C. B-52 My PGP key is 324a3df234cb23e Often, don’t index as text. But often very useful: e.g., for looking up error codes on the web Often, we index “meta-data” separately Creation date, format, etc. 9

Tokenization: language issues East Asian languages (e.g., Chinese and Japanese) have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique tokenization Semitic languages (Arabic, Hebrew) are written right to left, but certain items (e.g. numbers) written left to right Words are separated, but letter forms within a word form complex ligatures استقلت الجزائر في سنة 1962 بعد 132 عاما من الاحتلال الفرنسي. ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ 10

Making sense of the Tokens Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen. 11

Linguistic Processing Normalization Capitalization/Case-folding Stop words Stemming Lemmatization 12

Linguistic Processing: Normalization Need to “normalize” terms in indexed text & in query terms into the same form We want to match U.S.A. and USA We most commonly define equivalence classes of terms e.g., by deleting periods in a term Alternative is to do asymmetric expansion: Enter: windowSearch: window, windows Enter: windowsSearch: Windows, windows Enter: WindowsSearch: Windows 13

Normalization: other languages Accents: résumé vs. resume. Most important criterion: How are your users likely to write their queries for these words? Even in languages that standardly have accents, users often may not type them German: Tuebingen vs. Tübingen Should be equivalent 14

Linguistic Processing: Case folding Reduce all letters to lower case exception: upper case (in mid-sentence?)  e.g., General Motors  Fed vs. fed  SAIL vs. sail Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… 15

Linguistic Processing: Stop Words With a stop list, you exclude from dictionary entirely the commonest words. Why do it: They have little semantic content: the, a, and, to, be They take a lot of space: ~30% of postings for top 30  You will measure this! But the trend is away from doing this: You need them for:  Phrase queries: “King of Denmark”  Various song titles, etc.: “Let it be”, “To be or not to be”  “Relational” queries: “flights to London” 16

Linguistic Processing: Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. Porter’s Algorithm Commonest algorithm for stemming English Results suggest at least as good as other stemming options You find the algorithm and several implementations at

Typical rules in Porter RuleExample sses  sscaresses  caress ies  ibutterflies  butterfli ing  meeting  meet tional  tionintentional  intention Weight of word sensitive rules (m>1) EMENT →  replacement → replac  cement → cement 18

An Example of Stemming After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. after introduc a gener search engin architectur, we examin each engin compon in turn. we cover crawl, local web page storag, index, and the us of link analysi for boost search perform. 19

Linguistic Processing: Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, be, is  be car, cars, car's, cars'  car Lemmatization implies doing “proper” reduction to dictionary headword form the boy's cars are different colors  the boy car be different color 20

Phrase queries Want to answer queries such as “stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. The concept of phrase queries has proven easily understood by users – they use quotes about 10% of web queries are phrase queries In average a query is 2.3 words long. (Is it still the case?) No longer suffices to store only entries 21

Phrase queries via Biword Indices Biwords: Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate (it works exactly like the one term process) 22

Longer phrase queries stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Without examining each of the returned docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Can have false positives! (Why?) 23

Issues for biword indexes False positives, as noted before Index blowup due to bigger dictionary For extended biword index, parsing longer queries into conjunctions: E.g., the query tangerine trees, marmalade skies is parsed into tangerine trees AND trees marmalade AND marmalade skies Not standard solution (for all biwords) 24

Better solution: Positional indexes Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.> <be: ; 1: 6 {7, 18, 33, 72, 86, 231}; 2: 2 {3, 149}; 4: 5 {17, 191, 291, 430, 434}; 5: 2 {363, 367}; …> Which of docs 1,2,4,5 could contain “to be or not to be”? 25

Processing a phrase query Merge their doc:position lists to enumerate all positions with “to be or not to be”. to, :  2: 5{1,17,74,222,551}; 4: 5{8,16,190,429,433}; 7: 3{13,23,191};... be, :  1: 2{17,19}; 4: 5{17,191,291,430,434}; 5: 3{14,19,101};... Same general method for proximity searches 26

27 Extra Slides

Combination schemes A positional index expands postings storage substantially (Why?) Biword indexes and positional indexes approaches can be profitably combined For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists Even more so for phrases like “The Who” But is useful for “miserable failure” 28

Some Statistics 29 Query2013 results 2013 time 2011 results 2011 time britney spears409,000, ,000, emmy noether930, , the who9,440,000, ,000, wellesley college4,330, , worcester college4,750, , fast cars181,000, ,300, slow cars158,000, , sfjvrbfb wrla10.29

IR: The not-so-exciting part Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen. 30

31 Recap: The Inverted Index For each term t, we store a sorted list of all documents that contain t. 31 DictionaryPostings

32 Intersecting two posting lists 32

Does Google use the Boolean model? The default interpretation of a query [w 1 w 2...w n ] is w 1 AND w 2 AND...AND w n You also get hits that do not contain one of the w i : anchor text page contains variant of w i (morphology, spelling correction, synonym) long queries (n > 10?) when boolean expression generates very few hits  Boolean vs. Ranking of result set  Most well designed Boolean engines rank the Boolean result set  they rank “good hits” higher than “bad hits” (according to some estimator of relevance) 33