Basic Text Processing Word tokenization.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Morphology and Lexicon Chapter 3
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
CS 430 / INFO 430 Information Retrieval
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 2: The term vocabulary and postings.
CpSc 881: Information Retrieval
WMES3103 : INFORMATION RETRIEVAL
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
Text Analysis Everything Data CompSci Spring 2014.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Index Construction David Kauchak cs458 Fall 2012 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.
Basic Text Processing Regular Expressions. Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck.
Natural Language Processing Chapter 2 : Morphology.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Stemming Stemming is crude chopping of Affixes in inflected words. It is used to coalesce terms for effective Information Retrieval. The base version of.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Basic Text Processing Word tokenization.
Basic Text Processing Regular Expressions.
Basic Text Processing: Morphology Word Stemming
Information Retrieval in Practice
Information Retrieval
CSC 594 Topics in AI – Natural Language Processing
Search Engine Architecture
Ch 2 Term Vocabulary & Postings List
Overview Recap Documents Terms Skip pointers Phrase queries
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 2: The term vocabulary and postings lists
Many slides from Dan Jurafsky, Stanford
CS 430: Information Discovery
CSC 594 Topics in AI – Natural Language Processing
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
Document ingestion.
CSCI 5832 Natural Language Processing
Token generation - stemming
CS276: Information Retrieval and Web Search
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Chapter 7 Lexical Analysis and Stoplists
PageRank GROUP 4.
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
String Processing 1 MIS 3406 Department of MIS Fox School of Business
IP notices: slides from D. Jurafsy, C. Manning and S. Batzoglou
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Discussion Class 3 Stemming Algorithms.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSCI 5832 Natural Language Processing
Information Retrieval and Web Design
Basic Text Processing: Morphology Word Stemming
Presentation transcript:

Basic Text Processing Word tokenization

Text Normalization Every NLP task needs to do text normalization: Segmenting/tokenizing words in running text Normalizing word formats

How many words? Type: an element of the vocabulary. they lay back on the San Francisco grass and looked at the stars and their Type: an element of the vocabulary. Token: an instance of that type in running text. How many? 15 tokens 13 types There are 2 “the”. Both belong to the same type.

How many words? N = number of tokens V = vocabulary = set of types |V| is the size of the vocabulary Tokens = N Types = |V| Switchboard phone conversations 2.4 million 20 thousand Shakespeare 884,000 31 thousand Google N-grams 1 trillion 13 million

Issues in Tokenization Finland’s capital  Finland Finlands Finland’s ? what’re, I’m, isn’t  What are, I am, is not Hewlett-Packard  Hewlett Packard ? state-of-the-art  state of the art ? Lowercase  lower-case lowercase lower case ? San Francisco  one token or two? m.p.h., PhD.  ??

Word Normalization and Stemming Basic Text Processing Word Normalization and Stemming

Normalization Need to “normalize” terms Information Retrieval: indexed text & query terms must have same form. We want to match U.S.A. and USA We implicitly define equivalence classes of terms e.g., deleting periods in a term Alternative: asymmetric expansion: Enter: window Search: window, windows Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows

Case folding Reduce all letters to lower case Since users tend to use lower case Possible exception: upper case in mid-sentence? e.g., General Motors Fed vs. fed SAIL vs. sail For sentiment analysis, MT, Information extraction Case is helpful (US versus us is important)

Dictionary headword Reduce inflections or variant forms to base form am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color Lemmatization: have to find correct form

Divide up a word "Unbreakable" comprises three morphemes: un- (a bound morpheme signifying "not") -break- (the root, a free morpheme) -able (a free morpheme signifying "can be done"). Stems: The core meaning-bearing units Affixes: Bits and pieces that adhere to stems

Stemming Reduce terms to their stems in information retrieval Stemming is crude chopping of affixes e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

Porter’s algorithm The most common English stemmer Step 1a sses  ss caresses  caress ies  i ponies  poni ss  ss caress  caress s  ø cats  cat Step 1b (*v*)ing  ø walking  walk sing  sing (*v*)ed  ø plastered  plaster Step 2 (for long stems) ational ate relational relate izer ize digitizer  digitize ator ate operator  operate Step 3 (for longer stems) al  ø revival  reviv able  ø adjustable  adjust ate  ø activate  activ *v* is a vowel

Lemmatization vs stemming Lemmatization is the task of determining that two words have the same root. E.g., the words sang, sung, and sings are forms of the verb sing. Stemming is crude chopping of affixes. E.g., automates, automatic, automation all reduced to automat.

Summary Main process in lexical analysis phase is tokenization. Convert input text into lower case. Convert words into their stems. Standardize the uses of periods. Tokenization issues can be dealt with by regular expression.