Natural Language Processing

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏

BİL711 Natural Language Processing

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Natural Language Processing Lecture 6—9/17/2013 Jim Martin.

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Parts of Speech Sudeshna Sarkar 7 Aug 2008.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Word classes and part of speech tagging Chapter 5.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

Lecture 4 Ngrams Smoothing

Natural Language Processing

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Part-of-speech tagging

Natural Language Processing Statistical Inference: n-grams

LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Word classes and part of speech tagging Chapter 5.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji.

Speech and Language Processing Lecture 4 Chapter 4 of SLP.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Lecture 9: Part of Speech

N-Grams Chapter 4 Part 2.

CSC 594 Topics in AI – Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

CS4705 Part of Speech tagging

LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

Speech and Language Processing

Machine Learning in Natural Language Processing

CSCI 5832 Natural Language Processing

CPSC 503 Computational Linguistics

LING/C SC 581: Advanced Computational Linguistics

CSCI 5832 Natural Language Processing

LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.

N-Gram Model Formulas Word sequences Chain rule of probability

Lecture 7 HMMs – the 3 Problems Forward Algorithm

CSCI 5832 Natural Language Processing

CS4705 Natural Language Processing

CSCE 771 Natural Language Processing

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CSCE 771 Natural Language Processing

Natural Language Processing

Part-of-Speech Tagging Using Hidden Markov Models

Natural Language Processing (NLP)

Presentation transcript:

Natural Language Processing Lecture 7—2/3/2015 Susan W. Brown

Today Problem set 1 Last bit of n-grams Programming assignment 1 Parts of Speech

Problem set 1 What is a string? ^ and $ needed? Any sequence of characters, including whitespace Not necessarily a word A sub-string is also a string This class is chock full of fun. Is it? Yes! ^ and $ needed?

Problem set 1 Set of strings ending in b. {b, ab, gb, aaaab, Roger is glib, …} Set of strings from alphabet a, b, such that each a is preceded and followed by a b. Empty string is part of this set.

Back to N-grams Using probablities of n-grams to predict (or generate) next word What to do with zero counts?

Speech and Language Processing - Jurafsky and Martin Zero Counts Some of those zeros are really zeros... Things that really aren’t ever going to happen Fewer of these than you might think On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have had a count What would that count be in all likelihood? 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Zero Counts Zipf’s Law (long tail phenomenon) A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get good statistics on low frequency events Result Our estimates are necessarily sparse! We have no counts at all for the vast number of events we want to estimate. Answer Estimate the likelihood of unseen (zero count) N-grams! 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Laplace Smoothing Also called Add-One smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts: 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Big Change to the Counts! C(want to) went from 608 to 238! P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction So in general, Laplace is a blunt instrument Could use more fine-grained method (add-k) But Laplace smoothing not generally used for N-grams, as we have much better methods Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially For pilot studies In document classification Information retrieval In domains where the number of zeros isn’t so huge. 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Better Smoothing Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell Use the count of things we’ve seen once to help estimate the count of things we’ve never seen 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin One Fish Two Fish Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass Not sure where this fishing hole is... You have caught up to now 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish to be caught is an eel? How likely is it that the next fish caught will be a member of newly seen species? Now how likely is it that the next fish caught will be an eel? 12/3/2018 Slide adapted from Josh Goodman Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Good-Turing Notation: Nx is the frequency-of-frequency-x So N10=1 Number of fish species seen 10 times is 1 (carp) N1=3 Number of fish species seen 1 is 3 (trout, salmon, eel) To estimate the probability of an unseen species Use number of species (words) we’ve seen once c0* =c1 p0 = N1/N All other estimates are adjusted downward to account for unseen probabilities 3/18 c*(eel) = c*(1) = (1+1) 1/ 3 = 2/3 12/3/2018 Slide from Josh Goodman Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin GT Fish Example 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Bigram Frequencies of Frequencies and GT Re-estimates 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Bigram Frequencies of Frequencies and GT Re-estimates 3*= 4 * (381/642) = 4 * .593 = 2.37 12/3/2018 Speech and Language Processing - Jurafsky and Martin

GT Smoothed Bigram Probabilities 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin GT Complications In practice, assume large counts (c>k for some k) are reliable: That complicates c*, making it: Also: we assume singleton counts c=1 are unreliable, so treat N-grams with count of 1 as if they were count=0 Also, need the Nk to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them 12/3/2018 Speech and Language Processing - Jurafsky and Martin

More zero counts What if a frequency of frequency count is zero? Remember Zipf’s law Linear regression mapping Nc to c in log space

Speech and Language Processing - Jurafsky and Martin Toolkits With FSAs/FSTs... Openfst.org For language modeling SRILM SRI Language Modeling Toolkit All the bells and whistles you can imagine 12/3/2018 Speech and Language Processing - Jurafsky and Martin

End of chapter 4 Not covering 4.9 in lecture; read it for the problems presented and the intuition of the solutions 4.10 and 4.11 will not come up in the exam

Programming assignment 1 Due Thursday, Feb. 5, midnight Import re What is a word? What is the context of the end and beginning of a sentence? What is different between this and abbreviations? Perfect counts not the ultimate test of a good script. Generalize. Do not over-fit your example.

Back to Some Linguistics 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Word Classes: Parts of Speech 8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. Also known as parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... Lots of debate within linguistics and cognitive science community about the number, nature, and universality of these We’ll completely ignore this debate 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin POS examples N noun chair, bandwidth, pacing V verb chew, debate, believe ADJ adjective purple, tall, ridiculous ADV adverb unfortunately, slowly P preposition of, by, to PRO pronoun I, me, mine DET determiner the, a, that, those 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin POS Tagging The process of assigning a part-of-speech marker to each word in a text. WORD tag the DET koala N put V the DET keys N on P table N 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Why POS Tagging is Useful First step of a vast number of practical tasks Speech synthesis How to pronounce “lead”? INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT Parsing Helpful to know parts of speech before you start parsing Information extraction Finding names, relations, etc. Machine Translation 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Open and Closed Classes Closed class: a small(ish) fixed membership Usually function words (short common words which play a role in grammar) Open class: new ones can be created all the time English has 4: Nouns, Verbs, Adjectives, Adverbs Many languages have these 4, but not all! Nouns are typically where the bulk of the action is with respect to new items 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Open Class Words Nouns Proper nouns (Boulder, Microsoft, Beyoncé, Cairo) English capitalizes these Common nouns (the rest) Count nouns and mass nouns Count: have plurals, get counted: goat/goats, one goat, two goats Mass: don’t get counted (snow, salt, communism) (*two snows) Adverbs: tend to modify things Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here, home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, slinkily, delicately) Verbs In English, have morphological affixes (eat/eats/eaten) With differing patterns of regularity 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Closed Class Words Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Prepositions from CELEX 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin English Particles 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Conjunctions 12/3/2018 Speech and Language Processing - Jurafsky and Martin

POS Tagging: Choosing a Tagset There are many potential distinctions we can draw leading to potentially large tagsets To do POS tagging, we need to choose a standard set of tags to work with Could pick very coarse tagsets N, V, Adj, Adv. More commonly used set is the finer grained, “Penn TreeBank tagset”, 45 tags PRP$, WRB, WP$, VBG Even more fine-grained tagsets exist 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Penn TreeBank POS Tagset 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. 12/3/2018 Speech and Language Processing - Jurafsky and Martin

How Hard is POS Tagging? Measuring Ambiguity 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Two Methods for POS Tagging Rule-based tagging See the text Stochastic Probabilistic sequence models HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models) 12/3/2018 Speech and Language Processing - Jurafsky and Martin

POS Tagging as Sequence Classification We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow What is the best sequence of tags that corresponds to this sequence of observations? Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn. 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Getting to HMMs We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized” 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Getting to HMMs This equation is guaranteed to give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian inference Use Bayes rule to transform this equation into a set of other probabilities that are easier to compute 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Using Bayes Rule 12/3/2018 Speech and Language Processing - Jurafsky and Martin

Thursday More POS tagging Bayesian inference Finish chapter 5