NLP Linguistics 101 David Kauchak CS159 – Fall 2014

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Advertisements

Morphology.
Syntax and Context-Free Grammars Julia Hirschberg CS 4705 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
Syntactic analysis using Context Free Grammars. Analysis of language Morphological analysis – Chairs, Part Of Speech (POS) tagging – The/DT man/NN left/VBD.
Statistical NLP: Lecture 3
Fall 2008Programming Development Techniques 1 Topic 9 Symbol Manipulation Generating English Sentences Section This is an additional example to symbolic.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.
Introduction to Syntax Owen Rambow October
Session 6 Morphology 1 Matakuliah : G0922/Introduction to Linguistics
NLP and Speech 2004 English Grammar
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Phrases and Sentences: Grammar
Raymond J. Mooney University of Texas at Austin
Embedded Clauses in TAG
Context Free Grammars Reading: Chap 12-13, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
NLP LINGUISTICS 101 David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Natural Language Processing Lecture 6 : Revision.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
NLP. Introduction to NLP Is language more than just a “bag of words”? Grammatical rules apply to categories and groups of words, not individual words.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Context Free Grammars Reading: Chap 9, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Rada Mihalcea.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
Linguistic Essentials
Parsing with Context-Free Grammars for ASR Julia Hirschberg CS 4706 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
SYNTAX.
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
GRAMMARS David Kauchak CS457 – Spring 2011 some slides adapted from Ray Mooney.
1 Some English Constructions Transformational Framework October 2, 2012 Lecture 7.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
MORPHOLOGY : THE STRUCTURE OF WORDS. MORPHOLOGY Morphology deals with the syntax of complex words and parts of words, also called morphemes, as well as.
Part-of-Speech Tagging CSE 628 Niranjan Balasubramanian Many slides and material from: Ray Mooney (UT Austin) Mausam (IIT Delhi) * * Mausam’s excellent.
Week 3. Clauses and Trees English Syntax. Trees and constituency A sentence has a hierarchical structure Constituents can have constituents of their own.
Lecture 9: Part of Speech
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Introduction to Machine Learning and Text Mining
SYNTAX.
Part I: Basics and Constituency
Introduction to Linguistics
Linguistic Essentials
Natural Language Processing
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Introduction to Linguistics
Presentation transcript:

NLP Linguistics 101 David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney

Admin Assignment 2: How’d it go? Quiz #1 CS server issues Thursday First 30 minutes of class (show up on time!) Everything up to today (but not including today)

Simplified View of Linguistics Phonology/Phonetics /waddyasai/ Morphology /waddyasai/ what did you say say Syntax what did you say Nikolai Trubetzkoy in Grundzüge der Phonologie (1939) defines phonology as "the study of sound pertaining to the system of language," as opposed to phonetics, which is "the study of sound pertaining to the act of speech." obj subj you what say Semantics obj subj P[ x. say(you, x) ] you what Discourse what did you say what did you say

Morphology What is morphology? Why might this be useful for NLP? study of the internal structure of words morph-ology word-s jump-ing Why might this be useful for NLP? generalization (runs, running, runner are related) additional information (it’s plural, past tense, etc) allows us to handle words we’ve never seen before smoothing?

New words AP newswire stories from Feb 1988 – Dec 30, 1988 300K unique words New words seen on Dec 31 compounds: prenatal-care, publicly-funded, channel- switching, … New words: dumbbells, groveled, fuzzier, oxidized, ex-presidency, puppetry, boulderlike, over-emphasized, antiprejudice new technology: - googled, googling - tweeted

Morphology basics Words are built up from morphemes stems (base/main part of the word) affixes prefixes precedes the stem suffixes follows the stem infixes inserted inside the stem circumfixes surrounds the stem Examples?

Morpheme examples prefix suffix circum- (circumnavigate) dis- (dislike) mis- (misunderstood) com-, de-, dis-, in-, re-, post-, trans-, … suffix -able (movable) -ance (resistance) -ly (quickly) -tion, -ness, -ate, -ful, …

Morpheme examples infix circumfix -fucking- (cinder-fucking-rella) more common in other languages circumfix doesn’t really happen in English a- -ing a-running a-jumping

Agglutinative: Finnish talo 'the-house’ kaup-pa 'the-shop' talo-ni 'my house' kaup-pa-ni 'my shop' talo-ssa 'in the-house' kaup-a-ssa 'in the-shop' talo-ssa-ni 'in my house’ kaup-a-ssa-ni 'in my shop' talo-i-ssa 'in the-houses’ kaup-o-i-ssa 'in the-shops' talo-i-ssa-ni 'in my houses’ kaup-o-i-ssa-ni 'in my shops'

Stemming (baby lemmatization) Reduce a word to the main morpheme automate automates automatic automation automat run runs running run

Stemming example This is a poorly constructed example using the Porter stemmer. This is a poorli construct example us the Porter stemmer. http://maya.cs.depaul.edu/~classes/ds575/porter.html (or you can download versions online)

Porter’s algorithm (1980) Most common algorithm for stemming English Results suggest it’s at least as good as other stemming options Multiple sequential phases of reductions using rules, e.g. sses  ss ies  i ational  ate tional  tion http://tartarus.org/~martin/PorterStemmer/

What is Syntax? Study of structure of language Examine the rules of how words interact and go together Rules governing grammaticality I will give you one perspective no single correct theory of syntax still an active field of research in linguistics we will often use it as a tool/stepping stone for other applications

Structure in language The man all the way home. what are some examples of words that can/can’t go here?

Structure in language The man all the way home. why can’t some words go here?

Structure in language The man flew all the way home. Language is bound by a set of rules It’s not clear exactly the form of these rules, however, people can generally recognize them This is syntax!

Syntax != Semantics Colorless green ideas sleep furiously. Syntax is only concerned with how words interact from a grammatical standpoint, not semantically (i.e. meaning)

Parts of speech What are parts of speech (think 3rd grade)?

Parts of speech The man all the way home. Parts of speech are constructed by grouping words that function similarly: - with respect to the words that can occur nearby - and by their morphological properties The man all the way home. ran forgave ate drove drank hid learned hurt integrated programmed shot shouted sat slept understood voted washed warned walked spoke succeeded survived read recorded

Parts of speech What are the English parts of speech? Noun (person, place or thing) Verb (actions and processes) Adjective (modify nouns) Adverb (modify verbs) Preposition (on, in, by, to, with) Determiners (a, an, the, what, which, that) Conjunctions (and, but, or) Particle (off, up)

English parts of speech Brown corpus: 87 POS tags Penn Treebank: ~45 POS tags Derived from the Brown tagset Most common in NLP Many of the examples we’ll show us this one British National Corpus (C5 tagset): 61 tags C6 tagset: 148 C7 tagset: 146 C8 tagset: 171 tag sets also include tags for punctuation

Tagsets Brown tagset: http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html C8 tagset: http://ucrel.lancs.ac.uk/claws8tags.pdf

English Parts of Speech Noun (person, place or thing) Singular (NN): dog, fork Plural (NNS): dogs, forks Proper (NNP, NNPS): John, Springfields Personal pronoun (PRP): I, you, he, she, it Wh-pronoun (WP): who, what Verb (actions and processes) Base, infinitive (VB): eat Past tense (VBD): ate Gerund (VBG): eating Past participle (VBN): eaten Non 3rd person singular present tense (VBP): eat 3rd person singular present tense: (VBZ): eats Modal (MD): should, can To (TO): to (to eat)

English Parts of Speech (cont.) Adjective (modify nouns) Basic (JJ): red, tall Comparative (JJR): redder, taller Superlative (JJS): reddest, tallest Adverb (modify verbs) Basic (RB): quickly Comparative (RBR): quicker Superlative (RBS): quickest Preposition (IN): on, in, by, to, with Determiner: Basic (DT) a, an, the WH-determiner (WDT): which, that Coordinating Conjunction (CC): and, but, or, Particle (RP): off (took off), up (put up)

Closed vs. Open Class Closed class categories are composed of a small, fixed set of grammatical function words for a given language. Pronouns, Prepositions, Modals, Determiners, Particles, Conjunctions Open class categories have large number of words and new ones are easily invented. Nouns (Googler, futon, iPad), Verbs (Google, futoning), Adjectives (geeky), Abverb (chompingly)

Part of speech tagging Annotate each word in a sentence with a part-of- speech marker Lowest level of syntactic analysis John saw the saw and decided to take it to the table. NNP VBD DT NN CC VBD TO VB PRP IN DT NN

Ambiguity in POS Tagging I like candy. Time flies like an arrow. VBP (verb, non-3rd person, singular, present) IN (preposition) Does “like” play the same role (POS) in these sentences?

Ambiguity in POS Tagging I bought it at the shop around the corner. I never got around to getting the car. The cost of a new Prius is around $25K. IN (preposition) RP (particle… on, off) RB (adverb) Does “around” play the same role (POS) in these sentences?

Ambiguity in POS tagging Like most language components, the challenge with POS tagging is ambiguity Brown corpus analysis 11.5% of word types are ambiguous (this sounds promising!), but… 40% of word appearances are ambiguous Unfortunately, the ambiguous words tend to be the more frequently used words

How hard is it? If I told you I had a POS tagger that achieved 90% accuracy would you be impressed? Shouldn’t be… just picking the most frequent POS for a word gets you this What about a POS tagger that achieves 93.7%? Still probably shouldn’t be… only need to add a basic module for handling unknown words What about a POS tagger that achieves 100%? Should be suspicious… humans only achieve ~97% Probably overfitting (or cheating!)

POS Tagging Approaches Rule-Based: Human crafted rules based on lexical and other linguistic knowledge Learning-Based: Trained on human annotated corpora like the Penn Treebank Statistical models: Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF), log-linear models, support vector machines Rule learning: Transformation Based Learning (TBL) The book discusses some of the more common approaches Many publicly available: http://nlp.stanford.edu/links/statnlp.html (list 15 different ones mostly publicly available!) http://www.coli.uni-saarland.de/~thorsten/tnt/

Constituency likes to eat candy. What can/can’t go here? Parts of speech can be thought of as the lowest level of syntactic information Groups words together into categories likes to eat candy. What can/can’t go here?

Constituency likes to eat candy. nouns determiner nouns Dave Professor Kauchak Dr. Suess The man The boy The cat determiner nouns + pronouns The man that I saw The boy with the blue pants The cat in the hat He She

Constituency Words in languages tend to form into functional groups (parts of speech) Groups of words (aka phrases) can also be grouped into functional groups often some relation to parts of speech though, more complex interactions These phrase groups are called constituents

Common constituents He likes to eat candy. noun phrase verb phrase The man in the hat ran to the park. noun phrase verb phrase

Common constituents The man in the hat ran to the park. noun phrase prepositional phrase prepositional phrase noun phrase verb phrase

Common constituents The man in the hat ran to the park. noun phrase prepositional phrase noun phrase prepositional phrase noun phrase verb phrase

Syntactic structure Hierarchical: syntactic trees NP VP PP non-terminals PP NP NP NP DT NN IN DT NN VBD IN DT NN parts of speech The man in the hat ran to the park. terminals (words)

Syntactic structure The man in the hat ran to the park. S NP VP PP PP (S (NP (NP (DT the) (NN man)) (PP (IN in) (NP (DT the) (NN hat)))) (VP (VBD ran) (PP (TO to (NP (DT the) (NN park)))))) S NP VP PP PP NP NP NP DT NN IN DT NN VBD IN DT NN The man in the hat ran to the park.

Syntactic structure (S (NP (NP (DT the) (NN man)) (PP (IN in) (S (NP (NP (DT the) (NN man)) (PP (IN in) (NP (DT the) (NN hat)))) (VP (VBD ran) (PP (TO to (NP (DT the) (NN park)))))) (S (NP (NP (DT the) (NN man)) (PP (IN in) (NP (DT the) (NN hat)))) (VP (VBD ran) (PP (TO to) (NP (DT the) (NN park))))))

Syntactic structure A number of related problems: Given a sentence, can we determine the syntactic structure? Can we determine if a sentence is grammatical? Can we determine how likely a sentence is to be grammatical? to be an English sentence? Can we generate candidate, grammatical sentences?

Grammars What is a grammar (3rd grade again…)?

Grammars Grammar is a set of structural rules that govern the composition of sentences, phrases and words Lots of different kinds of grammars: regular context-free context-sensitive recursively enumerable transformation grammars

States What is the capitol of this state? Jefferson City (Missouri)

Context free grammar How many people have heard of them? Look like: S  NP VP left hand side (single symbol) right hand side (one or more symbols)

Formally… G = (NT, T, P, S) NT: finite set of nonterminal symbols T: finite set of terminal symbols, NT and T are disjoint P: finite set of productions of the form A  , A  NT and   (T  NT)* S  NT: start symbol

CFG: Example Many possible CFGs for English, here is an example (fragment): S  NP VP VP  V NP NP  DetP N | AdjP NP AdjP  Adj | Adv AdjP N  boy | girl V  sees | likes Adj  big | small Adv  very DetP  a | the | is shorthand

Grammar questions Can we determine if a sentence is grammatical? Given a sentence, can we determine the syntactic structure? Can we determine how likely a sentence is to be grammatical? to be an English sentence? Can we generate candidate, grammatical sentences? Which of these can we answer with a CFG? How?

Grammar questions Can we determine if a sentence is grammatical? Is it accepted/recognized by the grammar Applying rules right to left, do we get the start symbol? Given a sentence, can we determine the syntactic structure? Keep track of the rules applied… Can we determine how likely a sentence is to be grammatical? to be an English sentence? Not yet… no notion of “likelihood” (probability) Can we generate candidate, grammatical sentences? Start from the start symbol, randomly pick rules that apply (i.e. left hand side matches)