Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999),
Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page:
CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The term vocabulary and postings lists
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Information Retrieval in Practice
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Stemming, tagging and chunking Text analysis short of parsing.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 2 8/25/2011.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs458 Fall 2012 adapted from:
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.
Information Retrieval Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Basic Text Processing Word tokenization.
Information Retrieval in Practice
Information Retrieval
Search Engine Architecture
Ch 2 Term Vocabulary & Postings List
7CCSMWAL Algorithmic Issues in the WWW
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Document ingestion.
Token generation - stemming
CS276: Information Retrieval and Web Search
Java VSR Implementation
Recap of the previous lecture
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
PageRank GROUP 4.
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Java VSR Implementation
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Java VSR Implementation
Basic Text Processing Word tokenization.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course taught by Prof. Ray Mooney at UT Austin)

IR System Architecture Text Database Manager Indexing Index Query Operations Searching Ranking Ranked Docs User Feedback Text Operations User Interface Retrieved Docs User Need Text Query Logical View Inverted file

Text Processing Pipeline Tokenizer Token stream Friends RomansCountrymen Linguistic modules Modified tokens friend romancountryman Indexer Inverted index friend roman countryman Documents to be indexed OR User query Friends, Romans, countrymen.

From Text to Tokens to Terms Tokenization = segmenting text into tokens: token = a sequence of characters, in a particular document at a particular position. type = the class of all tokens that contain the same character sequence. “... to be or not to be...” “... so be it, he said...” term = a (normalized) type that is included in the IR dictionary. Example text = “I slept and then I dreamed” tokens = I, slept, and, then, I, dreamed types = I, slept, and, then, dreamed terms = sleep, dream (stopword removal). 3 tokens, 1 type

Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. –However, frequently they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach: –Separate ? ! ; : “ ‘ [ ] ( ) –Care with. - why? when? –Care with … ??

Tokenization Apostrophes are ambiguous: –possessive constructions: the book’s cover => the book s cover –contractions: he’s happy => he is happy aren’t => are not –quotations: ‘let it be’ => let it be Whitespaces in proper names or collocations: –San Francisco => San_Francisco how do we determine it should be a single token?

Tokenization Hyphenations: –co-education => co-education –state-of-the-art => state of the art? state_of_the_art? –lowercase, lower-case, lower case => lower_case –Hewlett-Packard => Hewlett_Packard? Hewlett Packard? Period –Abbreviations: Mr., Dr. –Acronyms: U.S.A. –File names: a.out

Tokenization Numbers –3/12/91 –Mar. 12, 1991 –55 B.C. –B-52 – Unusual strings that should be recognized as tokens: -C++, C#, B-52, C4.5,M*A*S*H.

Tokenization Tokenizing HTML –Should text in HTML commands not typically seen by the user be included as tokens? Words appearing in URLs. Words appearing in “meta text” of images. –Simplest approach is to exclude all HTML tag information (between “ ”) from tokenization. Note: it’s important to use the same tokenization rules for the queries and the documents

Tokenization is Language Dependent Need to know the language of the document/query: –Language Identification, based on classifiers trained on short character subsequences as features, is highly effective. French (reduced definite article, postposed clitic, pronouns): l’ensemble, un ensemble, donne-moi. German (compund nouns), need compound splitter: Computerlinguistik Lebensversicherungsgesellschaftsangestellter ( life insurance company employee) –Compound Splitting for German: usually implemented by finding segments that match against dictionary entries.

Tokenization is Language Dependent East Asian languages, need word segmenter: 莎拉波娃现在居住在美国东南部的佛罗里达。 -Not always guaranteed a unique tokenization Complicated in Japanese, with multiple alphabets intermingled. –Word Segmentation for Chinese: ML sequence tagging models trained on manually segmented text: Logistic Regression, HMMs, Conditional Random Fields. Multiple segmentations are possible: These characters can either be treated as one word “monk”, or as a sequence of two words “and” and “still”

Tokenization is Language Dependent Agluttinative languages –Tusaatsiarunnanngittualuujunga –I can't hear very well. This long word is composed of a root word tusaa- 'to hear' followed by five suffixes: -tsiaq- well -junnaq- be able to -nngit- not -tualuu- very much -junga 1st pers. singular present indicative non-specific

Tokenization is Language Dependent Arabic and Hebrew: –Written right to left, but with certain items like numbers written left to right. –Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start Algeria achieved its independence in 1962 after 132 years of French occupation.

Language Identification – Simplest (and often most effective) approach: calculate the likelihood of each candidate language, based on training data – Given a sequence of words (or characters) w 1 n – For each candidate language, calculate: P(w 1 n ) ≈ Π k=1 n P(w k |w k-1 ); w 0 = P(w n |w n-1 ) = C(w n-1 w n )/C(w n-1 ) - where C(w n-1 w n ) and C(w n-1 ) are simple frequency counts collected from that language’s training data – Choose the language that maximizes the probability

Language Identification – Avoiding Zero Counts What if a word / character (or a pair of words / characters) never occurred in the training data? Zero counts mislead your total probability Use smoothing: simple techniques to add a (very) small quantity to any count P(w n |w n-1 ) = [C(w n-1 w n )+1] / [C(w n-1 )+V] - where C(w n-1 w n ) and C(w n-1 ) are simple frequency counts collected from that language’s training data - V is the vocabulary, i.e., total number of unique words (or characters) in the training data A tip: use log(P) to avoid very small numbers

Exercise French: ainsi de la même manière que l'on peut dire que le Royaume-Uni est un pays, on peut dire que l'Angleterre est un pays. English: the same way that we can say that United Kingdom is a country, we can also say that England is a country. ? this Assume character-level bigram model. Apply add-one smoothing. Assume V for French is 27, V for English is 23

Stopwords It is typical to exclude high-frequency words (e.g., function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”). Stopwords are language dependent For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. –E.g., simple Python dictionary

Exercise How to determine a list of stopwords? For English? – may use existing lists of stopwords –E.g. SMART’s commonword list (~ 400) –WordNet stopword list For Spanish? Bulgarian?

Stopwords The trend is away from using them: –From large stop lists ( ), to small stop lists (7-12), to none. –Good compression techniques (or cheap hardware) mean the cost for including stop words in a system is very small. –Good query optimization techniques mean you pay little at query time for including stop words. –You need them for: Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” Relational queries: “flights to London”

Normalization Token Normalization = reducing multiple tokens to the same canonical term, such that matches occur despite superficial differences. 1.Create equivalence classes, named after one member of the class: {anti-discriminatory, antidiscriminatory} {U.S.A., USA} 2.Maintain relations between unnormalized tokens: o can be extended with lists of synonyms (car, automobile). 1. Index unnormalized tokens, a query term is expanded into a disjunction of multiple postings lists. 2. Perform expansion during index construction.

Normalization Accents and diacritics in French: –résumé vs. resume. Umlauts in German: –Tuebingen vs. Tübingen Most important criterion: –How do users like to write their queries for these words? Even in languages that standardly have accents, users often may not type them: Often best to normalize to a de-accented term -Tuebingen, Tübingen, Tubingen => Tubingen

Normalization Case-Folding = reduce all letters to lower case: –allow Automobile at beginning of sentences to match automobile. –allow user-typed ferrari to match Ferrari in documents. –but may lead to unintended matches: the Fed vs. fed. Bush, Black, General Motors, Associated Press,... Heuristic = lowercase only some tokens: –words at beginning of sentences. –all words in a title where most words are capitalized. Truecasing = use a classifier to decide when to fold: –trained on many heuristic features.

Normalization British vs. American spellings: –colour vs. color. Multiple formats for dates, times: –09/30/2013 vs. Sep 30, Asymmetric expansion: –Enter: windowSearch: window, windows –Enter: windowsSearch: Windows, windows, window –Enter: WindowsSearch: Windows

Lemmatization Reduce inflectional/variant forms to base form Direct impact on vocabulary size E.g., –am, are, is  be –car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color How to do this? –Need a list of grammatical rules + a list of irregular words –Children  child, spoken  speak … –Practical implementation: use WordNet’s morphstr function Perl: WordNet::QueryData (first returned value from validForms function) Python: NLTK.stem

Stemming Reduce tokens to “root” form of words to recognize morphological variation. –“computer”, “computational”, “computation” all reduced to same token “compute” Correct morphological analysis is language specific and can be complex. Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres.

Porter Stemmer Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: –“computer”, “computational”, “computation” all reduced to same token “comput” May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations.

Typical rules in Porter sses  ss ies  i ational  ate tional  tion See class website for link to “official” Porter stemmer site –Provides Python ready to use implementations

Porter Stemmer Errors Errors of “comission”: –organization, organ  organ –police, policy  polic –arm, army  arm Errors of “omission”: –cylinder, cylindrical –create, creation –Europe, European

Other stemmers Other stemmers exist, e.g., Lovins stemmer – –Single-pass, longest suffix removal (about 250 rules) Stemming is language- and often application-specific: –open source and commercial plug-ins. Does it improve IR performance? –mixed results for English: improves recall, but hurts precision. operative (dentistry) ⇒ oper –definitely useful for languages with richer morphology: Spanish, German, Finish (30% gains).