Download presentation
Presentation is loading. Please wait.
Published byShanon Robbins Modified over 6 years ago
1
CSC 594 Topics in AI – Natural Language Processing
Spring 2016/17 4. Text Normalization (Some slides adapted from Jurafsky & Martin)
2
Text Normalization Every NLP task needs to do some form of text normalization. Not terribly glamorous but critical to further processing Determining a relevant vocabulary Segmenting/tokenizing words in running text Normalizing word formats Segmenting sentences in running text
3
Speech and Language Processing - Jurafsky and Martin
How many words? “They lay back on the San Francisco grass and looked at the stars and their…” Type: an element of the vocabulary Token: an instance of that type in running text How many words are there in this sentence? 15 tokens (or maybe 14) 13 types (or 12) Speech and Language Processing - Jurafsky and Martin
4
Speech and Language Processing - Jurafsky and Martin
Vocabulary First, what’s our vocabulary? What words are needed for a particular application? All of English? A subset? How many will there be? How do we determine the right list? Is it finite? Etc. Speech and Language Processing - Jurafsky and Martin
5
Speech and Language Processing - Jurafsky and Martin
How many words? N = number of tokens V = vocabulary = set of types |V| is the size of the vocabulary Tokens = N Types = |V| Switchboard phone conversations 2.4 million 20 thousand Shakespeare 884,000 31 thousand Google N-grams 1 trillion 13 million Speech and Language Processing - Jurafsky and Martin
6
Simple Tokenization in UNIX
Given a text file, output all the word types and their associated frequencies in that text corpus Inspired by Ken Church’s UNIX for Poets. Unix has many commands to deal with basic text processing operations Original Unix designers cared a lot about text processing Most of which are line oriented Speech and Language Processing - Jurafsky and Martin
7
The first step: tokenizing
So let’s do some dumb tokenization and get each token on a line by itself. tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head THE SONNETS by William Shakespeare From fairest creatures We ... Speech and Language Processing - Jurafsky and Martin
8
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head A ... Speech and Language Processing - Jurafsky and Martin
9
Speech and Language Processing - Jurafsky and Martin
Then some counting Merging upper and lower case tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ Sorting the counts tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r 23243 the 22225 i 18618 and 16339 to 15687 of 12780 a 12163 you 10839 my 10005 in 8954 d What happened here? Speech and Language Processing - Jurafsky and Martin
10
Issues in Tokenization
Finland’s capital Finland Finlands Finland’s ? what’re, I’m, isn’t What are, I am, is not Hewlett-Packard Hewlett Packard ? state-of-the-art state of the art ? Lowercase lower-case lowercase lower case ? San Francisco one token or two? m.p.g., PhD. ?? Speech and Language Processing - Jurafsky and Martin
11
Tokenization: language issues
French L'ensemble one token or two? L ? L’ ? Le ? German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ German information retrieval needs compound splitter Speech and Language Processing - Jurafsky and Martin
12
Tokenization: language issues
Chinese has no spaces between words 莎拉波娃现在居住在美国东南部的佛罗里达。 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 Sharapova now lives in US southeastern Florida Japanese allows multiple alphabets intermingled フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji Speech and Language Processing - Jurafsky and Martin
13
Word Tokenization in Chinese
Also called Word Segmentation Chinese words are composed of characters Characters are generally 1 syllable and 1 morpheme. Average word is 2.4 characters long. Standard baseline segmentation algorithm: Maximum Matching (also called Greedy) Speech and Language Processing - Jurafsky and Martin
14
Maximum Matching Word Segmentation Algorithm
Given a wordlist of Chinese, and a string Start a pointer at the beginning of the string Find the longest word in dictionary that matches the string starting at pointer Move the pointer over the word in string Go to 2 Speech and Language Processing - Jurafsky and Martin
15
Speech and Language Processing - Jurafsky and Martin
16
Max-match English Example
Thecatinthehat Thetabledownthere Doesn’t generally work in English! But works astonishingly well in Chinese 莎拉波娃现在居住在美国东南部的佛罗里达。 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 Modern probabilistic segmentation algorithms even better the cat in the hat the table down there theta bled own there Speech and Language Processing - Jurafsky and Martin
17
Speech and Language Processing - Jurafsky and Martin
Case folding Applications like web search: reduce all letters to lower case Since users tend to use lower case For sentiment analysis, MT, Information extraction Case is helpful (US versus us; IRE vs. ire) Speech and Language Processing - Jurafsky and Martin
18
Speech and Language Processing - Jurafsky and Martin
Lemmatization Reduce inflections or variant forms to base form am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color Lemmatization: have to find correct dictionary headword form Speech and Language Processing - Jurafsky and Martin
19
Speech and Language Processing - Jurafsky and Martin
Stemming Reduce terms to their stems Stemming is crude chopping of affixes language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress Speech and Language Processing - Jurafsky and Martin
20
Speech and Language Processing - Jurafsky and Martin
Porter’s Algorithm Step 1a sses ss caresses caress ies i ponies poni ss ss caress caress s ø cats cat Step 1b (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster Step 2 (for long stems) ational ate relational relate izer ize digitizer digitize ator ate operator operate Step 3 (for longer stems) al ø revival reviv able ø adjustable adjust ate ø activate activ Speech and Language Processing - Jurafsky and Martin
21
Speech and Language Processing - Jurafsky and Martin
Complex Morphology Some languages requires complex morpheme segmentation Turkish Uygarlastiramadiklarimizdanmissinizcasina `(behaving) as if you are among those whom we could not civilize’ Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’ + imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’ Speech and Language Processing - Jurafsky and Martin
22
Sentence Segmentation
In English, punctuation is used to mark sentence boundaries !, ? are relatively unambiguous Period “.” is quite ambiguous Abbreviations like Inc. or Dr. Numbers like .02% or 4.3 Machine learning approach Build a binary classifier Looks at each possible EOS punctuation and decides EndOfSentence/NotEndOfSentence Speech and Language Processing - Jurafsky and Martin
23
Speech and Language Processing - Jurafsky and Martin
Decision Tree Example Speech and Language Processing - Jurafsky and Martin
24
More sophisticated decision tree features
Case of word with “.”: Upper, Lower, Cap, Number Case of word after “.”: Upper, Lower, Cap, Number Numeric features Length of word with “.” Probability(word with “.” occurs at end-of-s) Probability(word after “.” occurs at beginning-of-s) Speech and Language Processing - Jurafsky and Martin
25
Decision Trees and other classifiers
We can think of the questions in a decision tree as features that could be exploited by any kind of classifier Logistic regression SVM Neural Nets etc. Speech and Language Processing - Jurafsky and Martin
26
Speech and Language Processing - Jurafsky and Martin
Maximum Matching themartian Speech and Language Processing - Jurafsky and Martin
27
Testing, Improvement and Evaluation
There are two tasks: Make sure that you have implemented MaxMatch correctly Testing Figure out a way to improve performance on this task This means you need a way to detect improvement Speech and Language Processing - Jurafsky and Martin
28
Speech and Language Processing - Jurafsky and Martin
Testing Given a particular dictionary and a correct implementation there is a “right” answer that MaxMatch should come up with. So for “themartian” that might be them art i an That’s the “right” answer even though its clearly not the right answer… Speech and Language Processing - Jurafsky and Martin
29
Improvement and Evaluation
So given a test set of outputs like them art i an How do we know if things are getting better? What do we need to know to say things are getting better? Speech and Language Processing - Jurafsky and Martin
30
Speech and Language Processing - Jurafsky and Martin
Reference Answers We need to know what the “right” answer is. Here “right” means the answer we would expect a human to produce. In this case “the martian”. So we have them art i an the martian And a whole bunch of examples like this, some right, some wrong How do we assess how well we’re doing? Speech and Language Processing - Jurafsky and Martin
31
Speech and Language Processing - Jurafsky and Martin
Evaluation Strict accuracy Given a test set, how many things are right and how many are wrong? Too pessimistic Might get you fired Not fine-grained enough May not show you are making progress when you are in fact making progress Speech and Language Processing - Jurafsky and Martin
32
Speech and Language Processing - Jurafsky and Martin
Progress? Start with them art I an the martian Move to The mart I an The martian They’re both still wrong. So does it make sense to say one is better? Speech and Language Processing - Jurafsky and Martin
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.