Download presentation
Presentation is loading. Please wait.
Published byJennifer Bruce Modified over 9 years ago
1
10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini
2
10/24/2015CPSC503 Winter 20082 Knowledge-Formalisms Map Logical formalisms (First-Order Logics) Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Pragmatics Discourse and Dialogue Semantics AI planners Markov Models Markov Chains -> n-grams Hidden Markov Models (HMM) MaxEntropy Markov Models (MEMM)
3
10/24/2015CPSC503 Winter 20083 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it
4
10/24/2015CPSC503 Winter 20084 Model Evaluation: Goal You may want to compare: 2-grams with 3-grams two different smoothing techniques (given the same n-grams) On a given corpus…
5
10/24/2015CPSC503 Winter 20085 Model Evaluation: Key Ideas Corpus Training Set Testing set A:split B: train models Models: Q 1 and Q 2 C:Apply models counting frequencies smoothing Compare results
6
10/24/2015CPSC503 Winter 20086 Entropy Def1. Measure of uncertainty Def2. Measure of the information that we need to resolve an uncertain situation –Let p(x)=P(X=x); where x X. –H(p)= H(X)= - x X p(x)log 2 p(x) –It is normally measured in bits.
7
10/24/2015CPSC503 Winter 20087 Model Evaluation Actual distribution Our approximation How different? Relative Entropy (KL divergence) ? D(p||q)= x X p(x)log(p(x)/q(x))
8
10/24/2015CPSC503 Winter 20088 Entropy of Entropy rate Language Entropy Assumptions: ergodic and stationary Entropy can be computed by taking the average log probability of a looooong sample NL? Shannon-McMillan-Breiman
9
10/24/2015CPSC503 Winter 20089 Cross-Entropy Between probability distribution P and another distribution Q (model for P) Between two models Q 1 and Q 2 the more accurate is the one with higher =>lower cross- entropy => lower Applied to Language
10
10/24/2015 CPSC503 Winter 2008 10 Model Evaluation: In practice Corpus Training Set Testing set A:split B: train models Models: Q 1 and Q 2 C:Apply models counting frequencies smoothing Compare cross- perplexities
11
10/24/2015 CPSC503 Winter 2008 11 k-fold cross validation and t-test Randomly divide the corpus in k subsets of equal size Use each for testing (all the other for training) In practice you do k times what we sow in previous slide Now for each model you have k perplexities Compare average models perplexities with t-test
12
10/24/2015CPSC503 Winter 200812 Knowledge-Formalisms Map (including probabilistic formalisms) Logical formalisms (First-Order Logics) Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Pragmatics Discourse and Dialogue Semantics AI planners
13
10/24/2015CPSC503 Winter 200813 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it
14
10/24/2015CPSC503 Winter 200814 Example of a Markov Chain 1.4 1.3.4.6 1.4 te h a p i Start.6 Start.4
15
10/24/2015CPSC503 Winter 200815 Markov-Chain Formal description: Manning/Schütze, 2000: 318 Probability of initial states t i.6.4 Stochastic Transition matrix A t tip i p 0.30.40.6 001 1 2 a h e 00.4 000 100 ahe.3.40 000 000.600 001 000
16
10/24/2015CPSC503 Winter 200816 Markov Assumptions Let X=(X 1,.., X t ) be a sequence of random variables taking values in some finite set S={s 1, …, s n }, the state space, the Markov properties are: (a) Limited Horizon: For all t, P(X t+1 |X 1,.., X t )=P(X t+1 | X t ) (b)Time Invariant: For all t, P(X t+1 |X t )=P(X 2 | X 1 ) i.e., the dependency does not change over time.
17
10/24/2015CPSC503 Winter 200817 Markov-Chain Probability of a sequence of states X 1 … X T Manning/Schütze, 2000: 320 Example: Similar to …….?
18
10/24/2015CPSC503 Winter 200818 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it
19
10/24/2015CPSC503 Winter 200819 HMMs (and MEMM) intro They are probabilistic sequence-classifier / sequence-lablers: assign a class/label to each unit in a sequence We have already seen a non-prob. version... Used extensively in NLP Part of Speech Tagging Partial parsing Named entity recognition Information Extraction
20
10/24/2015CPSC503 Winter 200820 Hidden Markov Model (State Emission).7.3.4.6 1.4 s1s1 a b i Start.6 Start.4 s2s2 a s3s3 s4s4 i a b b.5.1.9 1.1.4.5
21
10/24/2015CPSC503 Winter 200821 Hidden Markov Model Formal Specification as five-tuple Set of States Output Alphabet Initial State Probabilities State Transition Probabilities Symbol Emission Probabilities
22
10/24/2015CPSC503 Winter 200822 Three fundamental questions for HMMs Decoding: Finding the probability of an observation brute force or Forward/Backward-Algorithm Manning/Schütze, 2000: 325 Finding the best state sequence Viterbi-Algorithm Training: find model parameters which best explain the observations
23
10/24/2015CPSC503 Winter 200823 Computing the probability of an observation sequence O= o 1... o T X = all sequences of T states e.g., P(b,i | sample HMM )
24
10/24/2015CPSC503 Winter 200824 Decoding Example Manning/Schütze, 2000: 327 s 1, s 1 = 0 ? s 1, s 4 = 1 *.5 *.6 *.7 s 2, s 4 = 0? ………. s 1, s 2 = 1 *.1 *.6 *.3 ………. Complexity
25
10/24/2015CPSC503 Winter 200825 The forward procedure 1. Initialization 2. Induction 3. Total Complexity
26
10/24/2015CPSC503 Winter 200826 Three fundamental questions for HMMs Decoding: Finding the probability of an observation brute force or Forward Algorithm Finding the best state sequence Viterbi-Algorithm Training: find model parameters which best explain the observations If interested in details of the next two questions, read (Sections 6.4 – 6.5)
27
10/24/2015CPSC503 Winter 200827 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it
28
10/24/2015CPSC503 Winter 200828 Parts of Speech Tagging What is it? Why do we need it? Word classes (Tags) –Distribution –Tagsets How to do it –Rule-based –Stochastic –Transformation-based
29
10/24/2015CPSC503 Winter 200829 Parts of Speech Tagging: What Brainpower_NN,_, not_RB physical_JJ plant_NN,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN._. Tag meanings NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N sing. or mass), VBZ (V 3sg pres), DT (Determiner), POS (Possessive ending),. (sentence-final punct) Output Brainpower, not physical plant, is now a firm's chief asset. Input
30
10/24/2015CPSC503 Winter 200830 Parts of Speech Tagging: Why? As a basis for (Partial) Parsing Information Retrieval Word-sense disambiguation Speech synthesis Improve language models (Spelling/Speech) Part-of-speech (word class, morph. class, syntactic category) gives a significant amount of info about the word and its neighbors Useful in the following NLP tasks:
31
10/24/2015CPSC503 Winter 200831 Parts of Speech Eight basic categories –Noun, verb, pronoun, preposition, adjective, adverb, article, conjunction These categories are based on: –morphological properties (affixes they take) –distributional properties (what other words can occur nearby) –e.g, green It is so…, both…, The… is Not semantics!
32
10/24/2015CPSC503 Winter 200832 Parts of Speech Two kinds of category –Closed class (generally are function words) Prepositions, articles, conjunctions, pronouns, determiners, aux, numerals –Open class Nouns (proper/common; mass/count), verbs, adjectives, adverbs Very short, frequent and important Objects, actions, events, properties If you run across an unknown word….??
33
10/24/2015CPSC503 Winter 200833 PoS Distribution Parts of speech follow a usual behavior in Language Words 1 PoS 2 PoS (unfortunately very frequent) >2 PoS …but luckily different tags associated with a word are not equally likely ~35k ~4k
34
10/24/2015CPSC503 Winter 200834 Sets of Parts of Speech:Tagsets Most commonly used: –45-tag Penn Treebank, –61-tag C5, –146-tag C7 The choice of tagset is based on the application (do you care about distinguishing between “to” as a prep and “to” as a infinitive marker?) Accurate tagging can be done with even large tagsets
35
10/24/2015CPSC503 Winter 200835 PoS Tagging Dictionary word i -> set of tags from Tagset Brainpower_NN,_, not_RB physical_JJ plant_NN,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN._. ………. Brainpower, not physical plant, is now a firm's chief asset. ………… Input text Output Tagger
36
10/24/2015CPSC503 Winter 200836 Tagger Types Rule-based ‘95 Stochastic –HMM tagger ~ >= ’92 –Transformation-based tagger (Brill) ~ >= ’95 –Maximum Entropy Models ~ >= ’97
37
10/24/2015CPSC503 Winter 200837 Rule-Based (ENGTWOL ‘95) 1.A lexicon transducer returns for each word all possible morphological parses 2.A set of ~1,000 constraints is applied to rule out inappropriate PoS Step 1: sample I/O “Pavlov had show that salivation….” Pavlov N SG PROPER had HAVE V PAST SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO …… that ADV PRON DEM SG CS …….. ……. Sample Constraint Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV
38
10/24/2015CPSC503 Winter 200838 HMM Stochastic Tagging Tags corresponds to an HMM states Words correspond to the HMM alphabet symbols Tagging: given a sequence of words (observations), find the most likely sequence of tags (states) But this is…..! We need: State transition and symbol emission probabilities 1) From hand- tagged corpus 2) No tagged corpus: parameter estimation (Baum-Welch)
39
10/24/2015CPSC503 Winter 200839 Evaluating Taggers Accuracy: percent correct (most current taggers 96-7%) *test on unseen data!* Human Celing: agreement rate of humans on classification (96-7%) Unigram baseline: assign each token to the class it occurred in most frequently in the training set (race -> NN). (91%) What is causing the errors? Build a confusion matrix…
40
10/24/2015CPSC503 Winter 200840 Knowledge-Formalisms Map (next three lectures) Logical formalisms (First-Order Logics) Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Pragmatics Discourse and Dialogue Semantics AI planners
41
10/24/2015CPSC503 Winter 200841 Next Time Read Chapter 12 (syntax & Context Free Grammars)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.