CSC 594 Topics in AI – Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing
Spring 2016/17 4. Text Normalization (Some slides adapted from Jurafsky & Martin)

Text Normalization Every NLP task needs to do some form of text normalization. Not terribly glamorous but critical to further processing Determining a relevant vocabulary Segmenting/tokenizing words in running text Normalizing word formats Segmenting sentences in running text

Speech and Language Processing - Jurafsky and Martin
How many words? “They lay back on the San Francisco grass and looked at the stars and their…” Type: an element of the vocabulary Token: an instance of that type in running text How many words are there in this sentence? 15 tokens (or maybe 14) 13 types (or 12) Speech and Language Processing - Jurafsky and Martin

Vocabulary First, what’s our vocabulary? What words are needed for a particular application? All of English? A subset? How many will there be? How do we determine the right list? Is it finite? Etc. Speech and Language Processing - Jurafsky and Martin

How many words? N = number of tokens V = vocabulary = set of types |V| is the size of the vocabulary Tokens = N Types = |V| Switchboard phone conversations 2.4 million 20 thousand Shakespeare 884,000 31 thousand Google N-grams 1 trillion 13 million Speech and Language Processing - Jurafsky and Martin

Simple Tokenization in UNIX
Given a text file, output all the word types and their associated frequencies in that text corpus Inspired by Ken Church’s UNIX for Poets. Unix has many commands to deal with basic text processing operations Original Unix designers cared a lot about text processing Most of which are line oriented Speech and Language Processing - Jurafsky and Martin

The first step: tokenizing
So let’s do some dumb tokenization and get each token on a line by itself. tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head THE SONNETS by William Shakespeare From fairest creatures We ... Speech and Language Processing - Jurafsky and Martin

The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head A ... Speech and Language Processing - Jurafsky and Martin

Then some counting Merging upper and lower case tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ Sorting the counts tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r 23243 the 22225 i 18618 and 16339 to 15687 of 12780 a 12163 you 10839 my 10005 in 8954 d What happened here? Speech and Language Processing - Jurafsky and Martin

Issues in Tokenization
Finland’s capital  Finland Finlands Finland’s ? what’re, I’m, isn’t  What are, I am, is not Hewlett-Packard  Hewlett Packard ? state-of-the-art  state of the art ? Lowercase  lower-case lowercase lower case ? San Francisco  one token or two? m.p.g., PhD.  ?? Speech and Language Processing - Jurafsky and Martin

Tokenization: language issues
French L'ensemble  one token or two? L ? L’ ? Le ? German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ German information retrieval needs compound splitter Speech and Language Processing - Jurafsky and Martin

Tokenization: language issues
Chinese has no spaces between words 莎拉波娃现在居住在美国东南部的佛罗里达。莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida Japanese allows multiple alphabets intermingled フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji Speech and Language Processing - Jurafsky and Martin

Word Tokenization in Chinese
Also called Word Segmentation Chinese words are composed of characters Characters are generally 1 syllable and 1 morpheme. Average word is 2.4 characters long. Standard baseline segmentation algorithm: Maximum Matching (also called Greedy) Speech and Language Processing - Jurafsky and Martin

Maximum Matching Word Segmentation Algorithm
Given a wordlist of Chinese, and a string Start a pointer at the beginning of the string Find the longest word in dictionary that matches the string starting at pointer Move the pointer over the word in string Go to 2 Speech and Language Processing - Jurafsky and Martin

Max-match English Example
Thecatinthehat Thetabledownthere Doesn’t generally work in English! But works astonishingly well in Chinese 莎拉波娃现在居住在美国东南部的佛罗里达。莎拉波娃现在居住在美国东南部的佛罗里达 Modern probabilistic segmentation algorithms even better the cat in the hat the table down there theta bled own there Speech and Language Processing - Jurafsky and Martin

Case folding Applications like web search: reduce all letters to lower case Since users tend to use lower case For sentiment analysis, MT, Information extraction Case is helpful (US versus us; IRE vs. ire) Speech and Language Processing - Jurafsky and Martin

Lemmatization Reduce inflections or variant forms to base form am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color Lemmatization: have to find correct dictionary headword form Speech and Language Processing - Jurafsky and Martin

Stemming Reduce terms to their stems Stemming is crude chopping of affixes language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress Speech and Language Processing - Jurafsky and Martin

Porter’s Algorithm Step 1a sses  ss caresses  caress ies  i ponies  poni ss  ss caress  caress s  ø cats  cat Step 1b (*v*)ing  ø walking  walk sing  sing (*v*)ed  ø plastered  plaster Step 2 (for long stems) ational ate relational relate izer ize digitizer  digitize ator ate operator  operate Step 3 (for longer stems) al  ø revival  reviv able  ø adjustable  adjust ate  ø activate  activ Speech and Language Processing - Jurafsky and Martin

Complex Morphology Some languages requires complex morpheme segmentation Turkish Uygarlastiramadiklarimizdanmissinizcasina `(behaving) as if you are among those whom we could not civilize’ Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’ + imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’ Speech and Language Processing - Jurafsky and Martin

Sentence Segmentation
In English, punctuation is used to mark sentence boundaries !, ? are relatively unambiguous Period “.” is quite ambiguous Abbreviations like Inc. or Dr. Numbers like .02% or 4.3 Machine learning approach Build a binary classifier Looks at each possible EOS punctuation and decides EndOfSentence/NotEndOfSentence Speech and Language Processing - Jurafsky and Martin

Decision Tree Example Speech and Language Processing - Jurafsky and Martin

More sophisticated decision tree features
Case of word with “.”: Upper, Lower, Cap, Number Case of word after “.”: Upper, Lower, Cap, Number Numeric features Length of word with “.” Probability(word with “.” occurs at end-of-s) Probability(word after “.” occurs at beginning-of-s) Speech and Language Processing - Jurafsky and Martin

Decision Trees and other classifiers
We can think of the questions in a decision tree as features that could be exploited by any kind of classifier Logistic regression SVM Neural Nets etc. Speech and Language Processing - Jurafsky and Martin

Maximum Matching themartian Speech and Language Processing - Jurafsky and Martin

Testing, Improvement and Evaluation
There are two tasks: Make sure that you have implemented MaxMatch correctly Testing Figure out a way to improve performance on this task This means you need a way to detect improvement Speech and Language Processing - Jurafsky and Martin

Testing Given a particular dictionary and a correct implementation there is a “right” answer that MaxMatch should come up with. So for “themartian” that might be them art i an That’s the “right” answer even though its clearly not the right answer… Speech and Language Processing - Jurafsky and Martin

Improvement and Evaluation
So given a test set of outputs like them art i an How do we know if things are getting better? What do we need to know to say things are getting better? Speech and Language Processing - Jurafsky and Martin

Reference Answers We need to know what the “right” answer is. Here “right” means the answer we would expect a human to produce. In this case “the martian”. So we have them art i an the martian And a whole bunch of examples like this, some right, some wrong How do we assess how well we’re doing? Speech and Language Processing - Jurafsky and Martin

Evaluation Strict accuracy Given a test set, how many things are right and how many are wrong? Too pessimistic Might get you fired Not fine-grained enough May not show you are making progress when you are in fact making progress Speech and Language Processing - Jurafsky and Martin

Progress? Start with them art I an the martian Move to The mart I an The martian They’re both still wrong. So does it make sense to say one is better? Speech and Language Processing - Jurafsky and Martin

CSC 594 Topics in AI – Natural Language Processing

Similar presentations

Presentation on theme: "CSC 594 Topics in AI – Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSC 594 Topics in AI – Natural Language Processing

Similar presentations

Presentation on theme: "CSC 594 Topics in AI – Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback