I256 Applied Natural Language Processing Fall 2009 Lecture 9 Review Barbara Rosario
2 Why NLP is difficult Fundamental goal: deep understand of broad language –Not just string processing or keyword matching Language is ambiguous –At all levels: lexical, phrase, semantic Language is flexible –New words, new meanings –Different meanings in different contexts Language is subtle Language is about human communication Problem of scale –Many (infinite?) possible words, meanings, context Problem of sparsity –Very difficult to do statistical analysis, most things (words, concepts) are never seen before Long range correlations Representation of meaning
3 Linguistics essentials Important distinction: –study of language structure (grammar) –study of meaning (semantics) Grammar –Phonology (the study of sound systems and abstract sound units). –Morphology (the formation and composition of words) –Syntax (the rules that determine how words combine into sentences) Semantics –The study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences
4 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing Semantics –Lexical semantics –Word sense disambiguation (WSD) –Lexical acquisition Corpus-based statistical approaches to tackle NLP problems Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing What I hope we achieved Overall idea of linguistic problems Overall understanding of “lower level” NLP tasks –POS, WSD, language models, segmentation, etc –NOTE: Will be used for preprocessing and as features for higher level tasks Initial understanding of Stat NLP –Corpora & annotation –probability theory, GM –Sparsity problem Familiarity with Python and NLTK
5 Morphology Morphology is the study of the internal structure of words, of the way words are built up from smaller meaning units. Morpheme: –The smallest meaningful unit in the grammar of a language. Two classes of morphemes –Stems: “main” morpheme of the word, supplying the main meaning (i.e. establish in the example below) –Affixes: add additional meaning Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German –Examples: unladylike, dogs, technique
6 Types of morphological processes Inflection: –Systematic modification of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. –Doesn’t change the word class –New grammatical role –Usually produces a predictable, non idiosyncratic change of meaning. run runs | running | ran hope+ing hopinghop hopping Derivation: –Ex: compute computer computerization –Less systematic that inflection –It can involve a change of meaning Compounding: –Merging of two or more words into a new word Downmarket, (to) overtake
7 Stemming & Lemmatization The removal of the inflectional ending from words (strip off any affixes) Laughing, laugh, laughs, laughed laugh –Problems Can conflate semantically different words –Gallery and gall may both be stemmed to gall –Regular Expressions for Stemming –Porter Stemmer –nltk.wordnet.morphy A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.
8 Grammar: words: POS Words of a language are grouped into classes to reflect similar syntactic behaviors Syntactical or grammatical categories (aka part-of- speech) –Nouns (people, animal, concepts) –Verbs (actions, states) –Adjectives –Prepositions –Determiners Open or lexical categories (nouns, verbs, adjective) –Large number of members, new words are commonly added Closed or functional categories (prepositions, determiners) –Few members, clear grammatical use
9 Part-of-speech (English) From Dan Klein’s cs 288 slides
10 Terminology Tagging –The process of associating labels with each token in a text Tags –The labels –Syntactic word classes Tag Set –The collection of tags used
11 Example Typically a tagged text is a sequence of white-space separated base/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN./.
12 Part-of-speech (English) From Dan Klein’s cs 288 slides
13 Part-of-Speech Ambiguity Words that are highly ambiguous as to their part of speech tag
14 Sources of information Syntagmatic: tags of the other words –AT JJ NN is common –AT JJ VBP impossible (or unlikely) Lexical: look at the words –The AT –Flour more likely to be a noun than a verb –A tagger that always chooses the most common tag is 90% correct (often used as baseline) Most taggers use both
15 What does Tagging do? 1.Collapses Distinctions Lexical identity may be discarded e.g., all personal pronouns tagged with PRP 2.Introduces Distinctions Ambiguities may be resolved e.g. deal tagged with NN or VB 3.Helps in classification and prediction
16 Why POS? A word’s POS tells us a lot about the word and its neighbors: –Limits the range of meanings (deal), pronunciation (text to speech) (object vs object, record) or both (wind) –Helps in stemming: saw[v] → see, saw[n] → saw –Limits the range of following words –Can help select nouns from a document for summarization –Basis for partial parsing (chunked parsing)
17 Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between –Getting better information about context –Make it possible for classifiers to do their job
18 Tagging methods Hand-coded Statistical taggers –N-Gram Tagging –HMM –(Maximum Entropy) Brill (transformation-based) tagger
19 Unigram Tagger Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. –For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent word) more often than it is used as a verb (e.g. I frequent this cafe).
20 N-Gram Tagging An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2- gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers. trigram tagger
21 N-Gram Tagging Why not 10-gram taggers? As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off)
22 Markov Model Tagger Bigram tagger Assumptions: –Words are independent of each other –A word identity depends only on its tag –A tag depends only on the previous tag
23 Markov Model Tagger t1t1 w1w1 t2t2 w2w2 tntn wnwn
24 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger? –Just massive tables of numbers –Aren’t there any linguistic insights that could emerge from the data? –Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.
25 The Brill tagger (transformation-based tagger) An example of Transformation-Based Learning –Basic idea: do a quick job first (using frequency), then revise it using contextual rules. Very popular (freely available, works fairly well) –Probably the most widely used tagger (esp. outside NLP) –…. but not the most accurate: 96.6% / 82.0 % A supervised method: requires a tagged corpus
26 Brill Tagging: In more detail Start with simple (less accurate) rules…learn better ones from tagged corpus –Tag each word initially with most likely POS –Examine set of transformations to see which improves tagging decisions compared to tagged corpus –Re-tag corpus using best transformation –Repeat until, e.g., performance doesn’t improve –Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text
27 An example Examples: –They are expected to race tomorrow. –The race for outer space. Tagging algorithm: 1.Tag all uses of “race” as NN (most likely tag in the Brown corpus) They are expected to race/NN tomorrow the race/NN for outer space 2.Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: They are expected to race/VB tomorrow the race/NN for outer space
28 What gets learned? [from Brill 95] Tags-triggered transformationsMorphology-triggered transformations Rules are linguistically interpretable
29 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing Semantics –Word sense disambiguation (WSD) –Lexical semantics –Lexical acquisition Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing
30 Phrase structure Words are organized in phrases Phrases: grouping of words that are clumped as a unit Syntax: study of the regularities and constraints of word order and phrase structure
31 Major phrase types Sentence (S) (whole grammatical unit). Normally rewrites as a subject noun phrase and a verb phrase Noun phrase (NP): phrase whose head is a noun or a pronoun, optionally accompanied by a set of modifiers –The smart student of physics with long hair
32 Major phrase types Prepositional phrases (PP) –Headed by a preposition and containing a NP She is [on the computer] They walked [to their school] Verb phrases (VP) –Phrase whose head is a verb [Getting to school on time] was a struggle He [was trying to keep his temper] That woman [quickly showed me the way to hide]
33 Phrase structure grammar Syntactic analysis of sentences –(Ultimately) to extract meaning: Mary gave Peter a book Peter gave Mary a book
34 Phrase structure parsing Parsing: the process of reconstructing the derivation(s) or phrase structure trees that give rise to a particular sequence of words Parse is a phrase structure tree –New art critics write reviews with computers
35 Phrase structure parsing & ambiguity The children ate the cake with a spoon PP Attachment Ambiguity Why is it important for NLP?
36 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing –Text normalization –Segmentation Semantics Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing
37 Text Normalization Stemming Convert to lower case Identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. –For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks. Lemmatization –Make sure that the resulting form is a known word in a dictionary –WordNet lemmatizer only removes affixes if the resulting word is in its dictionary
38 Segmentation Word segmentation –For languages that do not put spaces between words Chinese, Japanese, Korean, Thai, German (for compound nouns) Tokenization Sentence segmentation –Divide text into sentences
39 Tokenization Divide text into units called tokens (words, numbers, punctuations) –Page 124—136 Manning What is a word? –Graphic word: string of continuous alpha numeric character surrounded by white space $22.50 –Main clue (in English) is the occurrence of whitespaces –Problems Periods: usually remove punctuation but sometimes it’s useful to keep periods (Wash. wash) Single apostrophes, contractions (isn’t, didn’t, dog’s: for meaning extraction could be useful to have 2 separate forms: is + n’t or not) Hyphenation: –Sometime best a single word: co-operate –Sometime best as 2 separate words: 26-year-old, aluminum-export ban (RE for tokenization)
40 Sentence Segmentation Sentence: –Something ending with a.. ?, ! (and sometime also :) –“You reminded me,” she remarked, “of your mother.” Nested sentences Note the.” Sentence boundary detection algorithms –Heuristic (see figure 4.1 page 135 Manning) –Statistical classification trees (Riley 1989) Probability of a word to occur before or after a boundary, case and length of words –Neural network (Palmer and Hearst 1997) Part of speech distribution of preceding and following words –Maximum Entropy (Mikheev 1998) For reference see Manning
41 Sentence Segmentation Sentence: –Something ending with a.. ?, ! (and sometime also :) –“You reminded me,” she remarked, “of your mother.” Nested sentences Note the.” Sentence boundary detection algorithms –Heuristic (see figure 4.1 page 135 Manning) –Statistical classification trees (Riley 1989) Probability of a word to occur before or after a boundary, case and length of words –Neural network (Palmer and Hearst 1997) Part of speech distribution of preceding and following words –Maximum Entropy (Mikheev 1998) Note: MODELS and Features
42 Segmentation as classification Sentence segmentation can be viewed as a classification task for punctuation: –Whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence. –We’ll return on this when we cover classification See Section 6.2 NLTK bookSection 6.2 For word segmentation see section 3.8 NLTK book –Also page 180 of Speech and Language Processing Jurafsky and MartinSpeech and Language Processing
43 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing Semantics –Lexical semantics –Word sense disambiguation (WSD) –Lexical acquisition Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing
44 Semantics Semantics is the study of the meaning of words, construction and utterances 1.Study of the meaning of individual words (lexical semantics) 2.Study of how meanings of individual words are combined into the meaning of sentences (or larger units)
45 Lexical semantics How words are related with each other Hyponymy –scarlet, vermilion, carmine, and crimson are all hyponyms of red Hypernymy Antonymy (opposite) –Male, female Meronymy (part of) –Tire is meromym of car Etc..
46 Word Senses Words have multiple distinct meanings, or senses: –Plant: living plant, manufacturing plant, … –Title: name of a work, ownership document, form of address, material at the start of a film, … Many levels of sense distinctions –Homonymy: totally unrelated meanings (river bank, money bank) –Polysemy: related meanings (star in sky, star on tv, title) –Systematic polysemy: productive meaning extensions (metonymy such as organizations to their buildings) or metaphor –Sense distinctions can be extremely subtle (or not) Granularity of senses needed depends a lot on the task Taken from Dan Klein’s cs 288 slides
47 Word Sense Disambiguation Determine which of the senses of an ambiguous word is invoked in a particular use of the word Example: living plant vs. manufacturing plant How do we tell these senses apart? –“Context” The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. –Maybe it’s just text categorization –Each word sense represents a topic Why is it important to model and disambiguate word senses? –Translation Bank banca or riva –Parsing For PP attachment, for example –information retrieval To return documents with the right sense of bank Adapted from Dan Klein’s cs 288 slides
48 Features Bag-of-words (use words around with no order) –The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. –Bags of words = {after, manufacturing, which, labor,..} Bag-of-words classification works ok for noun senses –90% on classic, shockingly easy examples (line, interest, star) –80% on senseval-1 nouns –70% on senseval-1 verbs
49 Verb WSD Why are verbs harder? –Verbal senses less topical –More sensitive to structure, argument choice –Better disambiguated by their argument (subject-object): importance of local information –For nouns, a wider context likely to be useful Verb Example: “Serve” –[function] The tree stump serves as a table –[enable] The scandal served to increase his popularity –[dish] We serve meals for the homeless –[enlist] She served her country –[jail] He served six years for embezzlement –[tennis] It was Agassi's turn to serve –[legal] He was served by the sheriff Different types of information may be appropriate for different part of speech Adapted from Dan Klein’s cs 288 slides
50 Better features There are smarter features: –Argument selectional preference: serve NP[meals] vs. serve NP[papers] vs. serve NP[country] Subcategorization: –[function] serve PP[as] –[enable] serve VP[to] –[tennis] serve –[food] serve NP {PP[to]} – Can capture poorly (but robustly) with local windows… but we can also use a parser and get these features explicitly Taken from Dan Klein’s cs 288 slides
51 Various Approaches to WSD Unsupervised learning –We don’t know/have the labels –More than disambiguation is discrimination Cluster into groups and discriminate between these groups without giving labels Clustering –Example: EM (expectation-minimization), Bootstrapping (seeded with some labeled data) Supervised learning Adapted from Dan Klein’s cs 288 slides
52 Supervised learning –When we know the truth (true senses) (not always true or easy) –Classification task –Most systems do some kind of supervised learning –Many competing classification technologies perform about the same (it’s all about the knowledge sources you tap) –Problem: training data available for only a few words –Examples: Bayesian classification Naïve Bayes (simplest example of Graphical models) Adapted from Dan Klein’s cs 288 slides
53 Semantics: beyond individual words Once we have the meaning of the individual words, we need to assemble them to et the meaning of the whole sentence Hard because natural language does not obey the principle of compositionality by which the meaning of the whole can be predicted by the meanings of the parts
54 Semantics: beyond individual words Collocations –White skin, white wine, white hair Idioms: meaning is opaque –Kick the bucket
55 Lexical acquisition Develop algorithms and statistical techniques for filling the holes in existing dictionaries and lexical resources by looking at the occurrences of patterns of words in large text corpora –Collocations –Semantic similarity –(Logical metonymy) –Selectional preferences
56 Collocations A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things –Noun phrases: weapons of mass destruction, stiff breeze (but why not *stiff wind?) –Verbal phrases: to make up –Not necessarily contiguous: knock…. door Limited compositionality –Compositional if meaning of expression can be predicted by the meaning of the parts –Idioms are most extreme examples of non- compositionality Kick the bucket
57 Collocations Non Substitutability –Cannot substitute words in a collocation *yellow wine Non modifiability –To get a frog in one’s throat *To get an ugly frog in one’s throat Useful for –Language generation *Powerful tea, *take a decision –Machine translation Easy way to test if a combination is a collocation is to translate it into another language –Make a decision *faire une decision (prendre), *fare una decisione (prendere)
58 Finding collocations Frequency –If two words occur together a lot, that may be evidence that they have a special function –Filter by POS patterns –A N (linear function), N N (regression coefficients) etc.. Mean and variance of the distance of the words For not contiguous collocations Mutual information measure
59 Lexical acquisition Examples: –“insulin” and “progesterone” are in WordNet 2.1 but “leptin” and “pregnenolone” are not. –“HTML” and “SGML”, but not “XML” or “XHTML”. –“Google” and “Yahoo”, but not “Microsoft” or “IBM”. We need some notion of word similarity to know where to locate a new word in a lexical resource
60 Semantic similarity Similar if contextually interchangeable –The degree for which one word can be substituted for another in a given context Suit similar to litigation (but only in the legal context) Measures of similarity –WordNet-based –Vector-based Detecting hyponymy and other relations with patterns
61 Lexical acquisition Lexical acquisition problems –Collocations –Semantic similarity –(Logical metonymy) –Selectional preferences
62 Selectional preferences Most verbs prefer arguments of a particular type: selectional preferences or restrictions –Objects of eat tend to be food, subjects of think tend to be people etc.. –“Preferences” to allow for metaphors Feat eats the soul Why is it important for NLP?
63 Selectional preferences Why Important? –To infer meaning from selectional restrictions Suppose we don’t know the words durian (not in the vocabulary) Susan ate a very fresh durian Infer that durian is a type of food –Ranking the possible parses of a sentence Give higher scores to parses where the verbs has ‘natural argument”
64 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing Semantics –Lexical semantics –Word sense disambiguation (WSD) –Lexical acquisition Corpus-based statistical approaches to tackle NLP problems Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing
65 Corpus-based statistical approaches to tackle NLP problem Data (corpora, labels, linguistic resources) Feature extractions (usually linguistics motivated) Statistical models
66 The NLP Pipeline For a given problem to be tackled 1.Choose corpus (or build your own) –Low level processing done to the text before the ‘real work’ begins Important but often neglected –Low-leveling formatting issues Junk formatting/content (Html tags, Tables) Case change (i.e. everything to lower case) Tokenization, sentence segmentation 2.Choose annotation to use (or choose the label set and label it yourself ) 1.Check labeling (inconsistencies etc…) 3.Extract features 4.Choose or implement new NLP algorithms 5.Evaluate 6.(eventually) Re-iterate
67 Corpora Text Corpora & Annotated Text Corpora –NLTK corpora –Use/create your own Lexical resources –WordNet –VerbNet –FrameNet –Domain specific lexical resources Corpus Creation Annotation
68 Annotated Text Corpora Many text corpora contain linguistic annotations, representing genres, POS tags, named entities, syntactic structures, semantic roles, and so forth. Not part of the text in the file; it explains something of the structure and/or semantics of text
69 Annotated Text Corpora Grammar annotation –POS, parses, chunks Semantic annotation –Topics, Named Entities, sentiment, Author, Language, Word senses, co-reference … Lower level annotation –Word tokenization, Sentence Segmentation, Paragraph Segmentation
70 Processing Search Engine Results The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this text
71 Lexical Resources A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts –A vocabulary (list of words in a text) is the simplest lexical resource WordNet VerbNet FrameNet Medline
72 Annotation: main issues Deciding Which Layers of Annotation to Include –Grammar annotation –Semantic annotation –Lower level annotation Markup schemes How to do the annotation Design of a tag set
73 Annotation: design of a tag set Tag set: the set of the annotation classes: genres, POS etc. The tags should reflect distinctive text properties, i.e. ideally we would want to give distinctive tags to words (o documents) that have distinctive distributions –That: complementizer and preposition: 2 very different distributions: Two tags or only one? If two: more predictive If one: automatic classification easier (fewer classes) Tension: splitting tags/classes to capture useful distinctions gives improved information for prediction but can make the classification task harder
74 How to do the annotation By hand –Can be difficult, time consuming, domain knowledge and/or training may be required –Amazon’s Mechanical Turk (MTurk, allows to create and post a task that requires human intervention (offering a reward for the completion of the task)MTurkhttp:// Our reward to users was between 15 and 30 cents per survey (< 1 cent for text segment) We obtained labels for 3627 text segments for under $70. HIT completed (by all 3 “workers”) within a few minutes to a half-hour –[Yakhnenko and Rosario 07] Unsupervised methods do not use labeled data and try to learn a task from the “properties” of the data. Automatic (i.e. using some other metadata available) Bootstrapping –Bootstrapping is an iterative process where, given (usually) a small amount of labeled data (seed-data), the labels for the unlabeled data are estimated at each round of the process, and the (accepted) labels then incorporated as training data. Co-training –Co-training is a semi-supervised learning technique that requires two views of the data. It assumes that each example is described using two different feature sets that provide different, complementary information about the instance. –“the description of each example can be partitioned into two distinct views” and for which both (a small amount of) labeled data and (much more) unlabeled data are available. –co-training is essentially the one-iteration, probabilistic version of bootstrapping Non linguistic (i.e. clicks for IR relevance)
75 Why Probability? Statistical NLP aims to do statistical inference for the field of NLP Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inference about this distribution.
76 Why Probability? Examples of statistical inference are WSD, the task of language modeling (ex how to predict the next word given the previous words), topic classification, etc. In order to do this, we need a model of the language. Probability theory helps us finding such model
77 Probability Theory How likely it is that something will happen Sample space Ω is listing of all possible outcome of an experiment –Sample space can be continuous or discrete –For language applications it’s discrete (i.e. words) Event A is a subset of Ω Probability function (or distribution)
78
79
80 Prior Probability Prior probability: the probability before we consider any additional knowledge
81 Conditional probability Sometimes we have partial knowledge about the outcome of an experiment Conditional (or Posterior) Probability Suppose we know that event B is true The probability that A is true given the knowledge about B is expressed by
82
83 Conditional probability (cont) Note: P(A,B) = P(A ∩ B) Chain Rule P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the probability that B happens times the probability that A happens, given B has occurred. P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the probability that A happens times the probability that B happens, given A has occurred. Multi-dimensional table with a value in every cell giving the probability of that specific state occurring
84 Chain Rule P(A,B) = P(A|B)P(B) = P(B|A)P(A) P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
85 Chain Rule Bayes' rule P(A,B) = P(A|B)P(B) = P(B|A)P(A) Bayes' rule Useful when one quantity is more easy to calculate; trivial consequence of the definitions we saw but it’ s extremely useful
86 Bayes' rule Bayes' rule translates causal knowledge into diagnostic knowledge. For example, if A is the event that a patient has a disease, and B is the event that she displays a symptom, then P(B | A) describes a causal relationship, and P(A | B) describes a diagnostic one (that is usually hard to assess). If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B) for free.
87 Example S:stiff neck, M: meningitis P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20 I have stiff neck, should I worry?
88 (Conditional) independence Two events A e B are independent of each other if P(A) = P(A|B) Two events A and B are conditionally independent of each other given C if P(A|C) = P(A|B,C)
89 Back to language Statistical NLP aims to do statistical inference for the field of NLP –Topic classification P( topic | document ) –Language models P (word | previous word(s) ) –WSD P( sense | word) Two main problems –Estimation: P in unknown: estimate P –Inference: We estimated P; now we want to find (infer) the topic of a document, or the sense of a word
90 Language Models (Estimation) In general, for language events, P is unknown We need to estimate P, (or model M of the language) We’ll do this by looking at evidence about what P must be based on a sample of data
91 Inference The central problem of computational Probability Theory is the inference problem: Given a set of random variables X 1, …, X k and their joint density P(X 1, …, X k ), compute one or more conditional densities given observations. –Compute P(X 1 | X 2 …, X k ) P(X 3 | X 1 ) P(X 1, X 2 | X 3, X 4, ) Etc … Many problems can be formulated in these terms.
92 Bayes decision rule w: ambiguous word S = {s 1, s 2, …, s n } senses for w C = {c 1, c 2, …, c n } context of w in a corpus V = {v 1, v 2, …, v j } words used as contextual features for disambiguation Bayes decision rule –Decide s j if P(s j | c) > P(s k | c) for s j ≠ s k We want to assign w to the sense s’ where s’ = argmax s k P(s k | c)
93 Graphical Models Within the Machine Learning framework Probability theory plus graph theory Widely used –NLP –Speech recognition –Error correcting codes –Systems diagnosis –Computer vision –Filtering (Kalman filters) –Bioinformatics
94 (Quick intro to) Graphical Models Nodes are random variables B C D A P(A) P(D) P(B|A) P(C|A,D) Edges are annotated with conditional probabilities Absence of an edge between nodes implies conditional independence “Probabilistic database”
95 Graphical Models A BCD Define a joint probability distribution: P(X 1,..X N ) = i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) Learning –Given data, estimate the parameters P(A), P(D), P(B|A), P(C | A, D)
96 Graphical Models Define a joint probability distribution: P(X 1,..X N ) = i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) Learning –Given data, estimate P(A), P(B|A), P(D), P(C | A, D) Inference: compute conditional probabilities, e.g., P(A|B, D) or P(C | D) Inference = Probabilistic queries General inference algorithms (e.g. Junction Tree) A BCD
97 Naïve Bayes models Simple graphical model X i depend on Y Naïve Bayes assumption: all x i are independent given Y Currently used for text classification and spam detection x1x1 x2x2 x3x3 Y
98 Naïve Bayes models Naïve Bayes for document classification w1w1 w2w2 wnwn topic Inference task: P(topic | w 1, w 2 … w n )
99 Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk Recall the general joint probability distribution: P(X 1,..X N ) = i P(X i | Par(X i ) ) P(s k, v 1..v 3 ) = P(s k ) P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k )
100 Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(s k ) P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k )
101 Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(s k ) P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Inference (Testing): Compute conditional probabilities of interest: P(s k | v 1, v 2, v 3 )
102 Graphical Models Given Graphical model –Do estimation (find parameters from data) –Do inference (compute conditional probabilities) How do I choose the model structure (i.e. the edges)?
103 How to choose the model structure? v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk
104 Model structure Learn it: structure learning –Difficult & need a lot of data Knowledge of the domain and of the relationships between the variables –Heuristics –The fewer dependencies (edges) we can have, the “better” Sparsity: more edges, need more data –Direction of arrows v1v1 v2v2 v3v3 sksk P (v 3 | s k, v 1, v 2 )
105 Generative vs. discriminative P(s k, v 1..v 3 ) = P(s k ) P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ), P(v 1 | s k ), P(v 2 | s k ) and P(v 3 | s k ) Inference (Testing): Compute: P(s k | v 1, v 2, v 3 ) (there are algorithms to find these cond. Pb, not covered here) v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(v 1 ) P(v 2 ) P(v 3 ) P( s k | v 1, v 2 v 3 ) Conditional pb. of interest is “ready”: P(s k | v 1, v 2, v 3 ) i.e. modeled directly Estimation (Training): Given data, estimate: P(v 1 ), P(v 2 ), P(v 3 ), and P( s k | v 1, v 2 v 3 ) Do inference to find Pb of interestPb of interest is modeled directly v1v1 v2v2 v3v3 sksk GenerativeDiscriminative
106 Naïve Bayes for topic classification w1w1 w2w2 wnwn T Recall the general joint probability distribution: P(X 1,..X N ) = i P(X i | Par(X i ) ) P(T, w 1..w n ) = P(T) P(w 1 | T) P(w 2 | T) … P(w n | T )= = P(T) i P(w i | T) Inference (Testing): Compute conditional probabilities: P(T | w 1, w 2,..w n ) Estimation (Training): Given data, estimate: P(T), P(w i | T)
107 Topic = sport (num words = 15) D1: 2009 open season D2: against Maryland Sept D3: play six games D3: schedule games weekends D4: games games games Exercise Topic = politics (num words = 19) D1: Obama hoping rally support D2: billion stimulus package D3: House Republicans tax D4: cuts spending GOP games D4: Republicans obama open D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0 Estimate: for each w i, T j
108 Exercise: inference What is the topic of new documents: –Republicans obama season –games season open –democrats kennedy house
109 Exercise: inference Recall: Bayes decision rule Decide T j if P(T j | c) > P(T k | c) for T j ≠ T k c is the context, here the words of the documents We want to assign the topic T for which T’ = argmax T j P(T j | c)
110 Exercise: Bayes classification We compute P(T j | c) with Bayes rule Because of the dependencies encoded in the GM Bayes rule This GM
111 Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/ /19 = 0 That is, for each T j we calculate and see which one is higher Choose T = politics
112 Exercise: Bayes classification That is, for each T j we calculate and see which one is higher New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/ /19 = 0 democrats kennedy: unseen words data sparsity How can we address this?
113 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing Semantics –Lexical semantics –Word sense disambiguation (WSD) –Lexical acquisition Corpus-based statistical approaches to tackle NLP problem Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing
114 Language Models Model to assign scores to sentences Probabilities should broadly indicate likelihood of sentences –P( I saw a van) >> P( eyes awe of an) Not grammaticality –P(artichokes intimidate zippers) ≈ 0 In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides
115 Language models Related: the task of predicting the next word Can be useful for –Spelling corrections I need to notified the bank –Machine translations –Speech recognition –OCR (optical character recognition) –Handwriting recognition –Augmentative communication Computer systems to help the disabled in communication –For example, systems that let choose words with hand movements
116 Language Models Model to assign scores to sentences –Sentence: w 1, w 2, … w n –Break sentence probability down with chain rule (no loss of generality) –Too many histories!
117 wiwi w1w1 Markov assumption: n-gram solution Markov assumption: only the prior local context - -- the last “few” n words– affects the next word N-gram models: assume each word depends only on a short linear history –Use N-1 words to predict the next one wiwi W i-3
118 Markov assumption: n-gram solution Bigrams (n = 2)Trigrams (n = 3) Unigrams (n =1)
119 Choice of n In principle we would like the n of the n-gram to be large –green –large green –the large green –swallowed the large green –swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) –The crocodile swallowed the large green.. –Mary swallowed the large green.. –And so on…
120 Discrimination vs. reliability Looking at longer histories (large n) should allows us to make better prediction (better discrimination) But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large – The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations
121 Language Models N size of vocabulary Unigrams Bi-grams Tri-grams For each w i calculate P(w i ): N of such numbers: N parameters For each w i, w j w k calculate P(w i | w j, w k ): NxNxN parameters For each w i, w j calculate P(w i | w j, ): NxN parameters
122 N-grams and parameters ModelParameters Bigram model20,000 2 = 400 million Trigram model20,000 3 = 8 trillion Four-gram model20,000 4 = 1.6 x Assume we have a vocabulary of 20,000 words Growth in number of parameters for n-grams models:
123 Sparsity Zipf’s law: most words are rare –This makes frequency-based approaches to language hard New words appear all the time, new bigrams more often, trigrams or more, still worse! These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus
124 Sparsity The larger the number of parameters, the more likely it is to get 0 probabilities Note also the product: If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence
125 Tackling data sparsity Discounting or smoothing methods –Change the probabilities to avoid zeros –Remember pd have to sum to 1 –Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)
126 Smoothing From Dan Klein’s CS 288 slides
127 Smoothing Put probability mass on “unseen events” Add one /delta (uniform prior) Add one /delta (unigram prior) Linear interpolation ….
128 Smoothing: Combining estimators Make linear combination of multiple probability estimates –(Providing that we weight the contribution of each of them so that the result is another probability function) Linear interpolation or mixture models
129 Smoothing: Combining estimators Back-off models –Special case of linear interpolation
130 Smoothing: Combining estimators Back-off models: trigram version
131 Today’s review Grammar –Morphology –Part-of-speech (POS) –Phrase level syntax Lower level text processing Semantics –Lexical semantics –Word sense disambiguation (WSD) –Lexical acquisition Corpus-based statistical approaches to tackle NLP problem Corpora Intro to probability theory and graphical models (GM) –Example for WSD –Language Models (LM) and smoothing What I hope we achieved Overall idea of linguistic problems Overall understanding of “lower level” NLP tasks –POS, WSD, language models, segmentation, etc –NOTE: Will be used for preprocessing and as features for higher level tasks Initial understanding of Stat NLP –Corpora & annotation –probability theory, GM –Sparsity problem Familiarity with Python and NLTK
132 Next classes How do we now tackle ‘higher level’ NLP problems? NLP applications Text Categorization –Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative) Spelling & Grammar Corrections Information Extraction Speech Recognition Information Retrieval –Synonym Generation Summarization Machine Translation Question Answering Dialog Systems –Language generation
133 The NLP Pipeline For a given problem to be tackled 1.Choose corpus (or build your own) –Low level processing done to the text before the ‘real work’ begins Important but often neglected –Low-leveling formatting issues Junk formatting/content (Html tags, Tables) Case change (i.e. everything to lower case) Tokenization, sentence segmentation 2.Choose annotation to use (or choose the label set and label it yourself ) 1.Check labeling (inconsistencies etc…) 3.Extract features 4.Choose or implement new NLP algorithms 5.Evaluate 6.(eventually) Re-iterate
134 Next classes Classification: important: many NLP app can be framed as classification –Text Categorization (topics, language, author, spam filtering, sentiment classification) (positive, negative) –Information Extraction –Information Retrieval Feature extraction Projects NLP applications