600.465 - Intro to NLP - J. Eisner1 Grouping Words.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Advertisements

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Statistical NLP Winter 2008 Lecture 5: word-sense disambiguation Roger Levy Thanks to Jason Eisner and Dan Klein.

Text Similarity David Kauchak CS457 Fall 2011.

Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Intro to NLP - J. Eisner1 Probabilistic CKY.

I256 Applied Natural Language Processing Fall 2009

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

Intro to NLP - J. Eisner1 Words vs. Terms Taken from Jason Eisner’s NLP class slides:

Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.

Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.

Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

1 I256: Applied Natural Language Processing Marti Hearst Nov 6, 2006.

Intro to NLP - J. Eisner1 The Expectation Maximization (EM) Algorithm … continued!

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.

Natural Language Understanding

ELN – Natural Language Processing Giuseppe Attardi

Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.

Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

Chapter 23: Probabilistic Language Models April 13, 2004.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Rules, Movement, Ambiguity

Sight Word List.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Words vs. Terms Intro to NLP - J. Eisner.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.

Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Part-of-Speech Tagging

Text Based Information Retrieval

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

Machine Learning in Natural Language Processing

Language and Statistics

From frequency to meaning: vector space models of semantics

Words vs. Terms Intro to NLP - J. Eisner.

Grouping Words Intro to NLP - J. Eisner.

Word embeddings (continued)

Artificial Intelligence 2004 Speech & Natural Language Processing

Statistical NLP Winter 2009

Presentation transcript:

Intro to NLP - J. Eisner1 Grouping Words

Intro to NLP - J. Eisner2 Linguistic Objects in this Course  Trees (with strings at the nodes)  Syntax, semantics  Algorithms: Generation, parsing, inside-outside, build semantics  Sequences (of strings)  n-grams, tag sequences  morpheme sequences, phoneme sequences  Algorithms: Finite-state, best-paths, forward-backward  “Atoms” (unanalyzed strings)  Words, morphemes  Represent by contexts – other words they occur with  Algorithms: Grouping similar words, splitting words into senses

Intro to NLP - J. Eisner3 A Concordance for “party” from

Intro to NLP - J. Eisner4 A Concordance for “party” from  thing. She was talking at a party thrown at Daphne's restaurant inparty  have turned it into the hot dinner-party topic. The comedy is theparty  selection for the World Cup party, which will be announced on May 1party  in the 1983 general election for a party which, when it could not bear toparty  to attack the Scottish National Party, who look set to seize Perth andParty  that had been passed to a second party who made a financial decisionparty  the by-pass there will be a street party. "Then," he says, "we are goingparty  number-crunchers within the Labour party, there now seems little doubtparty  political tradition and the same party. They are both relatively Anglophilicparty  he told Tony Blair's modernised party they must not retreat into "warmparty  "Oh no, I'm just here for the party," they said. "I think it's terribleparty  A future obliges each party to the contract to fulfil it byparty  be signed by or on behalf of each party to the contract." Mr David Nparty

Intro to NLP - J. Eisner5 What Good are Word Senses?  thing. She was talking at a party thrown at Daphne's restaurant in  have turned it into the hot dinner-party topic. The comedy is the  selection for the World Cup party, which will be announced on May 1  in the 1983 general election for a party which, when it could not bear to  to attack the Scottish National Party, who look set to seize Perth and  that had been passed to a second party who made a financial decision  the by-pass there will be a street party. "Then," he says, "we are going  number-crunchers within the Labour party, there now seems little doubt  political tradition and the same party. They are both relatively Anglophilic  he told Tony Blair's modernised party they must not retreat into "warm  "Oh no, I'm just here for the party," they said. "I think it's terrible  A future obliges each party to the contract to fulfil it by  be signed by or on behalf of each party to the contract." Mr David N

Intro to NLP - J. Eisner6 What Good are Word Senses?  thing. She was talking at a party thrown at Daphne's restaurant in  have turned it into the hot dinner-party topic. The comedy is the  selection for the World Cup party, which will be announced on May 1  the by-pass there will be a street party. "Then," he says, "we are going  "Oh no, I'm just here for the party," they said. "I think it's terrible  in the 1983 general election for a party which, when it could not bear to  to attack the Scottish National Party, who look set to seize Perth and  number-crunchers within the Labour party, there now seems little doubt  political tradition and the same party. They are both relatively Anglophilic  he told Tony Blair's modernised party they must not retreat into "warm  that had been passed to a second party who made a financial decision  A future obliges each party to the contract to fulfil it by  be signed by or on behalf of each party to the contract." Mr David N

Intro to NLP - J. Eisner7 What Good are Word Senses?  John threw a “rain forest” party last December. His living room was full of plants and his box was playing Brazilian music …

Intro to NLP - J. Eisner8 What Good are Word Senses?  Replace word w with sense s  Splits w into senses: distinguishes this token of w from tokens with sense t  Groups w with other words: groups this token of w with tokens of x that also have sense s

Intro to NLP - J. Eisner9 What Good are Word Senses?  number-crunchers within the Labour party, there now seems little doubtparty  political tradition and the same party. They are both relatively Anglophilicparty  he told Tony Blair's modernised party they must not retreat into "warmparty  thing. She was talking at a party thrown at Daphne's restaurant inparty  have turned it into the hot dinner-party topic. The comedy is theparty  selection for the World Cup party, which will be announced on May 1party  the by-pass there will be a street party. "Then," he says, "we are goingparty  "Oh no, I'm just here for the party," they said. "I think it's terribleparty  an appearance at the annual awards bash, but feels in no fit state tobash  -known families at a fundraising bash on Thursday night for Learningbash  Who was paying for the bash? The only clue was the name Asprey,bash  Mail, always hosted the annual bash for the Scottish Labour front-bash  popular. Their method is to bash sense into criminals with a short,bash  just cut off people's heads and bash their brains out over the floor,bash

Intro to NLP - J. Eisner10 What Good are Word Senses?  number-crunchers within the Labour party, there now seems little doubt  political tradition and the same party. They are both relatively Anglophilic  he told Tony Blair's modernised party they must not retreat into "warm  thing. She was talking at a party thrown at Daphne's restaurant in  have turned it into the hot dinner-party topic. The comedy is the  selection for the World Cup party, which will be announced on May 1  the by-pass there will be a street party. "Then," he says, "we are going  "Oh no, I'm just here for the party," they said. "I think it's terrible  an appearance at the annual awards bash, but feels in no fit state to  -known families at a fundraising bash on Thursday night for Learning  Who was paying for the bash? The only clue was the name Asprey,  Mail, always hosted the annual bash for the Scottish Labour front-  popular. Their method is to bash sense into criminals with a short,  just cut off people's heads and bash their brains out over the floor,

Intro to NLP - J. Eisner11 What Good are Word Senses?  Semantics / Text understanding  Axioms about TRANSFER apply to (some tokens of) throw  Axioms about BUILDING apply to (some tokens of) bank  Machine translation  Info retrieval / Question answering / Text categ.  Query or pattern might not match document exactly  Backoff for just about anything  what word comes next? (speech recognition, language ID, …)  trigrams are sparse but tri-meanings might not be  bilexical PCFGs: p(S[ devour ]  NP[ lion ] VP[ devour ] | S[ devour ])  approximate by p(S[EAT]  NP[ lion ] VP[EAT] | S[EAT])  Speaker’s real intention is senses; words are a noisy channel

Intro to NLP - J. Eisner12 Cues to Word Sense  Adjacent words (or their senses)  Grammatically related words (subject, object, …)  Other nearby words  Topic of document  Sense of other tokens of the word in the same document

Intro to NLP - J. Eisner13 Word Classes by Tagging  Every tag is a kind of class  Tagger assigns a class to each word token Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes probs from tag bigram model probs from unigram replacement

Intro to NLP - J. Eisner14 Word Classes by Tagging  Every tag is a kind of class  Tagger assigns a class to each word token  Simultaneously groups and splits words  “party” gets split into N and V senses  “bash” gets split into N and V senses  {party/N, bash/N} vs. {party/V, bash/V}  What good are these groupings?

Intro to NLP - J. Eisner15 Learning Word Classes  Every tag is a kind of class  Tagger assigns a class to each word token  {party/N, bash/N} vs. {party/V, bash/V}  What good are these groupings?  Good for predicting next word or its class!  Role of forward-backward algorithm?  It adjusts classes etc. in order to predict sequence of words better (with lower perplexity)

Intro to NLP - J. Eisner16 Words as Vectors  Represent each word type w by a point in k- dimensional space  e.g., k is size of vocabulary  the 17 th coordinate of w represents strength of w’s association with vocabulary word 17 (0, 0, 3, 1, 0, 7,... 1, 0) aardvark abacus abbot abduct above zygote zymurgy abandoned count too high (too influential) count too low Jim Jeffords abandoned the Republican party. There were lots of abbots and nuns dancing at that party. The party above the art gallery was, above all, a laboratory for synthesizing zygotes and beer. From corpus: = party

Intro to NLP - J. Eisner17 Words as Vectors  Represent each word type w by a point in k- dimensional space  e.g., k is size of vocabulary  the 17 th coordinate of w represents strength of w’s association with vocabulary word 17 (0, 0, 3, 1, 0, 7,... 1, 0) aardvark abacus abbot abduct above zygote zymurgy abandoned = party  how often words appear next to each other  how often words appear near each other  how often words are syntactically linked  should correct for commonness of word (e.g., “above”) how might you measure this?

Intro to NLP - J. Eisner18 Words as Vectors  Represent each word type w by a point in k- dimensional space  e.g., k is size of vocabulary  the 17 th coordinate of w represents strength of w’s association with vocabulary word 17 (0, 0, 3, 1, 0, 7,... 1, 0) aardvark abacus abbot abduct above zygote zymurgy abandoned  Plot all word types in k-dimensional space  Look for clusters of close-together types

Intro to NLP - J. Eisner19 Learning Classes by Clustering  Plot all word types in k-dimensional space  Look for clusters of close-together types Plot in k dimensions (here k=3)

Intro to NLP - J. Eisner20 Bottom-Up Clustering  Start with one cluster per point  Repeatedly merge 2 closest clusters  Single-link: dist(A,B) = min dist(a,b) for a  A, b  B  Complete-link: dist(A,B) = max dist(a,b) for a  A, b  B

Intro to NLP - J. Eisner21 Bottom-Up Clustering – Single-Link Again, merge closest pair of clusters: Single-link: clusters are close if any of their points are dist(A,B) = min dist(a,b) for a  A, b  B each word type is a single-point cluster merge example from Manning & Schütze

Intro to NLP - J. Eisner22 example from Manning & Schütze Again, merge closest pair of clusters: Single-link: clusters are close if any of their points are dist(A,B) = min dist(a,b) for a  A, b  B Fast, but tend to get long, stringy, meandering clusters... Bottom-Up Clustering – Single-Link

Intro to NLP - J. Eisner23 Bottom-Up Clustering – Complete-Link Again, merge closest pair of clusters: Complete-link: clusters are close only if all of their points are dist(A,B) = max dist(a,b) for a  A, b  B distance between clusters example from Manning & Schütze

Intro to NLP - J. Eisner24 Bottom-Up Clustering – Complete-Link Again, merge closest pair of clusters: Complete-link: clusters are close only if all of their points are dist(A,B) = max dist(a,b) for a  A, b  B distance between clusters example from Manning & Schütze Slow to find closest pair – need quadratically many distances

Intro to NLP - J. Eisner25 Bottom-Up Clustering  Average-link: dist(A,B) = mean dist(a,b) for a  A, b  B  Centroid-link: dist(A,B) = dist(mean(A),mean(B))  Stop when clusters are “big enough”  e.g., provide adequate support for backoff (on a development corpus)  Some flexibility in defining dist(a,b)  Might not be Euclidean distance; e.g., use vector angle  Start with one cluster per point  Repeatedly merge 2 closest clusters  Single-link: dist(A,B) = min dist(a,b) for a  A, b  B  Complete-link: dist(A,B) = max dist(a,b) for a  A, b  B  too slow to update cluster distances after each merge; but  alternatives!

Intro to NLP - J. Eisner26 EM Clustering (for k clusters)  EM algorithm  Viterbi version – called “k-means clustering”  Full EM version – called “Gaussian mixtures”  Expectation step: Use current parameters (and observations) to reconstruct hidden structure  Maximization step: Use that hidden structure (and observations) to reestimate parameters  Parameters: k points representing cluster centers  Hidden structure: for each data point (word type), which center generated it?

Intro to NLP - J. Eisner27 EM Clustering (for k clusters)  [see spreadsheet animation]