Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Christopher Manning and Prabhakar Raghavan Lecture 2: The term vocabulary and.
Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining.
CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
COMP4201 Information Retrieval and Search Engines Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from:
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The term vocabulary and postings lists
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Stemming, tagging and chunking Text analysis short of parsing.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Text Analysis Everything Data CompSci Spring 2014.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 2 8/25/2011.
In general, whitespace is insignificant.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs458 Fall 2012 adapted from:
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Basic Text Processing Regular Expressions. Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck.
Information Retrieval Lecture 2: Dictionary and Postings.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Basic Text Processing Word tokenization.
Basic Text Processing Regular Expressions.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Information Retrieval
CSC 594 Topics in AI – Natural Language Processing
Lecture 2: The term vocabulary and postings lists
Ch 2 Term Vocabulary & Postings List
Lecture 1: Introduction and the Boolean Model Information Retrieval
7CCSMWAL Algorithmic Issues in the WWW
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Many slides from Dan Jurafsky, Stanford
CSC 594 Topics in AI – Natural Language Processing
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
Document ingestion.
Token generation - stemming
CS276: Information Retrieval and Web Search
Recap of the previous lecture
Basic Text Processing: Sentence Segmentation
PageRank GROUP 4.
Introduction to Text Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Basic Text Processing Word tokenization.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats 3.Segmenting sentences in running text

Basic Text Processing -1Word tokenization

Dan Jurafsky How many words? I do uh main- mainly business data processing Fragments, filled pauses Seuss’s cat in the hat is different from other cats! Lemma: same stem, part of speech, rough word sense cat and cats = same lemma Wordform: the full inflected surface form cat and cats = different wordforms

Dan Jurafsky How many words? EXAMPLE: Input: Friends, Romans, Countrymen, lend me your ears; Output: Friends Romans Countrymen lend me your ears Input :they lay back on the San Francisco grass and looked at the stars and their. Type: an element of the vocabulary (Lemma type or Wordform type) بدون تكرار Token: an instance of that type in running text. ر بالتكرا How many? 15 tokens (or 14) Output: they lay back on the San Francisco grass and looked at the stars and their =15 tokens Output: they lay back on the SanFrancisco grass and looked at the stars and their =14 tokens 13 types (or 12) (or 11?) Output: they lay back on the San Francisco grass and looked at the stars and their=13 tokens Output: they lay back on the SanFrancisco grass and looked at the stars and their=12 tokens Output: they lay back on the SanFrancisco grass and looked at the stars and their=11 tokens

Dan Jurafsky How many words? N = number of tokens V = vocabulary = set of types |V| is the size of the vocabulary Tokens = NTypes = |V| Switchboard phone conversations 2.4 million20 thousand Shakespeare884,00031 thousand Google N-grams1 trillion13 million

Dan Jurafsky Issues in Tokenization The major question of the tokenization phase is what are the correct tokens to use? In previous example1, it looks fairly trivial: you chop on whitespace and throw away punctuation characters. This is a starting point, but even for English there are a number of tricky cases. For example, what do you do about the various uses of the apostrophe for possession, contractions and hyphenation? 1.apostrophe for possession: Ex: Finland’s capital  Finland Finlands Finland’s ? 2.Contractions: Ex: what’re, I’m, isn’t  What are, I am, is not 3.Hyphenation: Ex: Hewlett-Packard  Hewlett Packard ? Ex: state-of-the-art  state of the art ?

Dan Jurafsky 4. Ex: Lowercase  lower-case lowercase lower case ? 5. Splitting on white space can also split what should be regarded as a single token: Ex: San Francisco  one token or two? 6. EX: m.p.h., PhD.  ?? 7

Dan Jurafsky Tokenization: language issues French L'ensemble  one token or two? L ? L’ ? Le ? German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ German information retrieval needs compound splitter

Dan Jurafsky Tokenization: language issues Chinese and Japanese no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 Sharapova now lives in US southeastern Florida Further complicated in Japanese, with multiple alphabets intermingled Dates/amounts in multiple formats フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 ) KatakanaHiraganaKanjiRomaji End-user can express query entirely in hiragana!

Dan Jurafsky Max-match segmentation illustration Thecatinthehat Thetabledownthere Doesn’t generally work in English! But works astonishingly well in Chinese 莎拉波娃现在居住在美国东南部的佛罗里达。 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 Modern probabilistic segmentation algorithms even better the table down there the cat in the hat theta bled own there

Basic Text Processing 2-Word Normalization and Stemming

Dan Jurafsky Normalization In Information Retrieval : Having broken up our documents (and also our query) into tokens, the easy case is if tokens in the query just match tokens in the token list of the document. However, there are many cases when two character sequences are not quite the same but you would like a match to occur. For instance, if you search for USA, you might hope to also match documents containing U.S.A Another example: the tokens computer should be linked to other tokens that can be synonyms of the word computer like PC, Laptop, Mac etc. So when a user searches for Computer the results of the documents that contains the token PC, Laptop, Mac Should also be displayed to the user Token normalization has as a goal to link together tokens with similar meaning. Token normalization: is the process of canonicalizing or standardizing tokens so that matches occur despite superficial differences in the character sequences of the tokens.

Dan Jurafsky Normalization To do Normalization: 1.The most standard way to normalize is to implicitly create equivalence classes: By Using Mapping rules that remove characters like hyphens or deleting periods in a token The terms that happen to become identical as the result of these rules are the equivalence classes. Examples: by deleting periods in U.S.A, (U.S.A and USA) are equivalence classes. Then searching for U.S.A will retrieve documents that contain U.S.A or USA by deleting hyphens in anti-discriminatory, (antidiscriminatory, anti-discriminatory) are equivalence classes. Then searching for anti-discriminatory will retrieve documents that contain antidiscriminatory or anti-discriminatory. Disadvantage: if I put in as my query term C.A.T., I might be rather upset if it matches every appearance of the word cat in documents

Dan Jurafsky 2.Alternative: asymmetric expansion: The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term as you see in figure 2.6. Example: Potentially more powerful. Disadvantage: less efficient and more complicated because it is adding a query expansion dictionary and requires more processing at query time(solving part of problem by using case folding 14

Dan Jurafsky Case folding Applications like Information Retrieval (IR): reduce all letters to lower case in the document Since users tend to use lower case word when he is searching. 1.The simplest cast folding is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase or in which most or all words are capitalized. These words are usually ordinary words that have been capitalized 2.Avoid make cast folding to Mid-sentence capitalized words because m any proper nouns are derived from common nouns and so are distinguished only by case, including: Companies (General Motors, The Associated Press) Government organizations (the Fed vs. fed) Person names (Bush, Black) For sentiment analysis, Machine translation, Information extraction Case is helpful (US versus us is important), so we do not need case folding

Dan Jurafsky Lemmatization Lemmatization: By the use of a vocabulary and morphological analysis of words to remove inflectional endings only or variant forms and to return the base form LEMMA or dictionary form of a word am, are, is  be car, cars, car's, cars'  car Example: the boy's cars are different colors  the boy car be different color Goal of Lemmatization: have to find correct dictionary headword form (lemma) Machine translation Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer ‘want’

Dan Jurafsky Morphology Morphemes: The small meaningful units that make up words Stems: The core meaning-bearing units Affixes: Bits and pieces that adhere to stems Often with grammatical functions

Dan Jurafsky Stemming Stemming: is crude chopping of affixes to Reduce terms to their stems in information retrieval in the hope of achieving this goal correctly most of the time. language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

Dan Jurafsky Lemmatization and stemming Example: If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. 19

Dan Jurafsky Porter’s algorithm The most common English stemmer Step 1a sses  ss caresses  caress ies  i ponies  poni ss  ss caress  caress s  ø cats  cat Step 1b (*v*)ing  ø walking  walk sing  sing (*v*)ed  ø plastered  plaster … Step 2 (for long stems) ational  ate relational  relate izer  ize digitizer  digitize ator  ate operator  operate … Step 3 (for longer stems) al  ø revival  reviv able  ø adjustable  adjust ate  ø activate  activ … Here some rules of porter’s algorithm:

Dan Jurafsky Example: 21

Dan Jurafsky Viewing morphology in a corpus Why only strip –ing if there is a vowel? (*v*)ing  ø walking  walk sing  sing 22

Dan Jurafsky Viewing morphology in a corpus Why only strip –ing if there is a vowel? 23 tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr (If we strip –ing from each word we will not get much word correctly but.. tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr We can get more words correctly if we strip –ing only once we find a vowel.) 548 being 541 nothing 152 something 145 coming 130 morning 122 having 120 living 117 loving 116 Being 102 going 1312 King 548 being 541 nothing 388 king 375 bring 358 thing 307 ring 152 something 145 coming 130 morning

Basic Text Processing 3-Sentence Segmentation and Decision Trees

Dan Jurafsky Sentence Segmentation Things ending in !, ? are relatively unambiguous Period “.” is quite ambiguous. Period can be: Sentence boundary But also it also used for Abbreviations like( Inc.) or (Dr.) (etc.)and Numbers like(.02%) or (4.3) So we can not assume that period is the end of the sentence. To solve the period problem we need to build a binary classifier Looks at a “.” Decide weather period is EndOfSentence/NotEndOfSentence To make Classifiers: we can use hand-written rules, regular expressions, or machine-learning

Dan Jurafsky Determining if a word is end-of-sentence: by using (a Decision Tree)as classifier

Dan Jurafsky More sophisticated decision tree features Simple features: Case of word with “.”: Upper, Lower, Cap, Number Case of word after “.”: Upper, Lower, Cap, Number Numeric features: Length of word with “.” (Ex: abbreviations tend to be short) Probability(word with “.” occurs at end-of-s) Probability(word after “.” occurs at beginning-of-s)(Ex:.The) “The” is very likely to come after “.”

Dan Jurafsky Implementing Decision Trees A decision tree is just an if-then-else statement The interesting research is choosing the features Setting up the structure is often too hard to do by hand Hand-building only possible for very simple features, domains For numeric features, it’s too hard Instead, structure usually learned by machine learning from a training corpus