1 Statistical Methods Allen ’ s Chapter 7 J&M ’ s Chapters 8 and 12.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
Statistical Issues in Research Planning and Evaluation
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Statistical NLP: Lecture 11
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.
Intro to NLP - J. Eisner1 Probabilistic CKY.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Thanks to Nir Friedman, HU
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Natural Language Understanding
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Advanced Smoothing, Evaluation of Language Models.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 Natural Language Processing Lecture 11 Efficient Parsing Reading: James Allen NLU (Chapter 6)
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Copyright © 2010 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
Tokenization & POS-Tagging
Copyright © 2010 Pearson Education, Inc. Chapter 6 Probability.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Supertagging CMSC Natural Language Processing January 31, 2006.
Natural Language Processing Slides adapted from Pedro Domingos
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
Natural Language Processing Statistical Inference: n-grams
Chapter 14 From Randomness to Probability. Dealing with Random Phenomena A random phenomenon: if we know what outcomes could happen, but not which particular.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Copyright © 2010 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
AP Statistics From Randomness to Probability Chapter 14.
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Honors Statistics From Randomness to Probability
Allen’s Chapter 7 J&M’s Chapters 8 and 12
Presentation transcript:

1 Statistical Methods Allen ’ s Chapter 7 J&M ’ s Chapters 8 and 12

2 Statistical Methods Large data sets (Corpora) of natural languages allow using statistical methods that were not possible before Brown Corpus includes about words with POS Penn Treebank contains full syntactic annotations

3 Basic Probability Theory Random Variable ranges over a predefined set of values TOSS = {h or t} If E is a random variable with possible values of {e 1 … e n } then 1.P(e i )  0, for all i 2.P(e i )  1, for all i 3.  i=1,n P(e i ) =1

4 An Example R is a random variable with values {Win, Lose} Harry is horse had 100 race with 20 Win P(Win) = 0.2 and P(Lose) = 0.8 In 30 races it was raining, and Harry won 15 of those races; so in rain P(Win) = 0.5 This is captured by Conditional Probability P(A | B) = P(A & B) / P(B) P(Win | Rain) = 0.15 / 0.3 = 0.5

5 Bayes Rule P(A | B) = P(A) * P(B | A) / P(B) P(Rain | Win) = P(Rain) * P(Win | Rain) / P(Win) P(Rain | Win) = 0.3 * 0.5 / 0.2 = 0.75 From the conditional probability: P(Rain | Win) = P( Rain & Win) / P(Win) P(Rain | Win) = 0.15 / 0.2 = 0.75

6 Independent Events P(A | B) = P(A) P(A & B) = P(A) * P(B) Assume –L is a random variable with values {F, E} –P(F) = 0.6 –P( Win & F) = 0.12 Then P(Win | F) = 0.12 / 0.6 = 0.2 = P(Win) So Win and F are independent But Win and Rain are not independent P(Win & Rain) = 0.15  P(Win) * P(Rain) = 0.06

7 Part of Speech Tagging Determining the most likely category of each word in a sentence with ambiguous words Example: finding POS of words that can be either nouns and/or verbs Need two random variables: 1.C that ranges over POS {N, V} 2.W that ranges over all possible words

8 Part of Speech Tagging (Cont.) Example: W = flies Problem: which one is greater? P(C=N | W = Flies) or P(C=V | W = flies) P(N | flies) or P(V | flies) P(N | flies) = P(N & flies) / P(flies) P(V | flies) = P(V & flies) / P(flies) So P(N & flies) or P(V & flies)

9 Part of Speech Tagging (Cont.) We don’t have true probabilities We can estimate using large data sets Suppose: –There is a Corpus with words –There is 1000 uses of flies: 400 with an noun sense, and 600 with a verb sense P(flies) = 1000 / = P(flies & N) = 400 / = P(flies & V) = 600 / = P(V | flies) = P(V & flies) / P(flies) = / = So in %60 occasions flies is a verb

10 Estimating Probabilities We want to use probability to predict the future events Using the information P(V | flies) = to predict that the next “flies” is more likely to be a verb This is called Maximum Likelihood estimation (MLE) Generally the larger the data set we use, the more accuracy we get

11 Estimating Probabilities (Cont.) Estimating the outcome probability of tossing a coin (i.e., 0.5) Acceptable margin of error : ( ) The more tests performed, the more accurate estimation –2 trials: %50 chance of reaching acceptable result –3 trials: %75 chance –4 trials: %87.5 chance –8 trials: %93 chance –12trials: %95 chance …

12 Estimating tossing a coin outcome

13 Estimating Probabilities (Cont.) So the larger data set the better, but The problem of sparse data Brown Corpus contains about a million words –but there is only different words, –so one expect each word occurs about 20 times, –But over words occur less than 5 times.

14 Estimating Probabilities (Cont.) For a random variable X with a set of values V i, computed from counting number times X = x i P(X = x i )  V i /  i V i Maximum Likelihood Estimation (MLE) uses V i = |x i | Expected likelihood Estimation (ELE) Uses V i = |x i | + 0.5

15 MLE vs ELE Suppose a word w doesn’t occur in the Corpus We want to estimate w occurring in one of 40 classes L1 … L40 We have a random variable X, X = xi only if w appears in word category Li By MLE, P(Li | w) is undefined because the divisor is zero ELE, P(Li | w)  0.5 / 20 = Suppose w occurs 5 times (4 times as an noun and once as a verb) By MLE, P(N |w) = 4/5 = 0.8, By ELE, P(N | w) =4.5/25 = 0.18

16 Evaluation Data set is divided into: –Training set (%80-%90 of the data) –Test set (%10-%20) Cross-Validation: –Repeatedly removing different parts of corpus as the test set, –Training on the reminder of the corpus, –Then evaluating the new test set.

17 Noisy Channel noisy channel X  Y real language X noisy language Y p(X) p(Y | X) p(X|Y) *  want to recover x  X from y  Y choose x that maximizes p(x | y)

18 Part of speech tagging Simplest Algorithm: choose the interpretation that occurs most frequently “flies” in the sample corpus was %60 a verb This algorithm success rate is %90 Over %50 of words appearing in most corpora are unambiguous To improve the success rate, Use the tags before or after the word under examination If “flies” is preceded by the word “the” it is definitely a noun

19 Part of speech tagging (Cont.) General form of the POS Problem: There is a sequence of words w 1 … w t, and We want to find a sequence of lexical categories C 1 … C t, such that 1.P(C 1 … C t | w 1 … w t ) is maximized Using the Bayes rule: 2. P(C 1 … C t ) * P(w 1 … w t | C 1 … C t ) / P(w 1 … w t ) The problem is reduced to finding C 1 … C t, such that 3.P(C 1 … C t ) * P(w 1 … w t | C 1 … C t ) is maximized But no effective method for calculating the probability of these long sequences accurately exists, as it would require too much data The probabilities can be estimated by some independence assumptions

20 Part of speech tagging (Cont.) Using the information about –The previous word category: bigram –Or two previous word categories: trigram –Or n-1 previous word categories: n-gram Using the bigram model P(C 1 … C t )   i=1,t P(C i | C i-1 ) P(Art N V N) = P(Art,  ) * P(N | ART) * P( V | N) * P(N | V) P(w 1 … w t | C 1 … C t )   i=1,t P(w i | C i ) Therefore we are looking for a sequence C 1 … C t such that  i=1,t P(C i | C i-1 ) * P(w i | C i ) is maximized

21 Part of speech tagging (Cont.) The information needed by this new formula can be extracted from the corpus P(C i = V | C i-1 = N) = Count( N at position i-1 & V at position i) / Count (N at position i-1) (Fig. 7-4) P( the | Art) = Count(# times the is an Art) / Count(# times an Art occurs) (Fig. 7-6)

22 Using an Artificial corpus An artificial corpus generated with 300 sentences of categories Art, N, V, P 1998 words, 833 nouns, 300 verbs, 558 article, and 307 propositions, To deal with the problem of the problem of the sparse data, a minimum probability of is assumed

23 Bigram probabilities from the generated corpus

24 Word counts in the generated corpus NVARTPTOTAL flies fruit like a the flower flowers birds others TOTAL

25 Lexical-generation probabilities (Fig. 7-6) PROB (the | ART).54PROB (a | ART).360 PROB (flies | N).025PROB (a | N).001 PROB (flies | V).076PROB (flower | N)063 PROB (like | V).1PROB (flower | V).05 PROB (like | P).068PROB (birds | N).076 PROB (like | N).012

26 Part of speech tagging (Cont.) How to find the sequence C 1 … C t that maximizes  i=1,t P(C i | C i-1 ) * P(w i | C i ) Brute Force search: Finding all possible sequences With N categories and T words, there are N T sequences Using the independence assumption and bigram probabilities, the probability w i to be in category C i depends only on C i-1 The process can be modeled by a special form of probabilistic finite state machine (Fig. 7-7)

27 Markov Chain Probability of a sequence of 4 words being in cats: ART N V N 0.71 * 1 * 0.43 * 0.35 = The representation is accurate only if the probability of a category occurring depends only the one category before it. This called the Markov assumption The network is called Markov chain

28 Hidden Markov Model (HMM) Markov network can be extended to include the lexical-generation probabilities, too. Each node could have an output probability for its every possible corresponding output node N is associated with a probability table indicating, for each word, how likely that word is to be selected if we randomly select a noun The output probabilities are exactly the lexical- generation probabilities shown in fig 7-6 Markov network with output probabilities is called Hidden Markov Model (HMM)

29 Hidden Markov Model (HMM) The word hidden indicates that for a specific sequence of words, it is not clear what state the Markov model is in For instance, the word “flies” could be generated from state N with a probability of 0.25, or from state V with a probability of Now, it is not trivial to compute the probability of a sequence of words from the network But, if you are given a particular sequence, the probability that it generates a particular output is easily computed by multiplying the probabilities on the path times the probabilities of each output.

30 Hidden Markov Model (HMM) The probability that the sequence N V ART N generates the output Flies Like a flower is: –The probability of path N V ART N is 0.29 * 0.43 * 0.65 * 1 = –The probability of the output being Flies like a flower is P(flies | N) * P(like | V) * P(a | ART) * P(flower | N) = * 0.1 * 0.36 * = 5.4 * The likelihood that HMM would generate the sentence is * = * Therefore, the probability of a sentence w 1 … w t, given a sequence C 1 … C t, is  i=1,t P(C i | C i-1 ) * P(w i | C i )

31 Markov Chain

32 Viterbi Algorithm

33 Flies like a flower SEQSCORE(i, 1) = P(flies | Li) * P(Li |  ) P(flies/V) = * = 7.6 * P(flies/N) = * 0.29 = P(likes/V) = max( P(flies/N) * P(V | N), P(flies/V) * P(V | V)) * P(like | V) = max ( * 0.43, 7.6 * * ) * 0.1 =

34 Flies like a flower

35 Flies like a flower Brute force search steps are N T Viterbi algorithm steps are K* T * N 2

36 Getting Reliable Statistics (smoothing) Suppose we have 40 categories To collect unigrams, at least 40 samples, one for each category, are needed For bigrams, 1600 samples are needed For trigerams, samples are needed For 4-grams, samples are needed P(C i | C … C i-1 ) = 1 P(C i ) + 2 P(C i | C i-1 ) + 3 P(C i | C i-2 C i-1 ) = 1

37 Statistical Parsing Corpus-based methods offer new ways to control parsers We could use statistical methods to identify the common structures of a Language We can choose the most likely interpretation when a sentence is ambiguous This might lead to much more efficient parsers that are almost deterministic

38 Statistical Parsing What is the input of an statistical parser? Input is the output of a POS tagging algorithm If POSs are accurate, lexical ambiguity is removed But if tagging is wrong, parser cannot find the correct interpretation, or, may find a valid but implausible interpretation With %95 accuracy, the chance of correctly tagging a sentence of 8 words is 0.67, and that of 15 words is 0.46

39 Obtaining Lexical probabilities A better approach is: 1.computing the probability that each word appears in the possible lexical categories. 2.combining these probabilities with some method of assigning probabilities to rule use in the grammar The context independent Lexical category of a word w be L j can be estimated by: P(L j | w) = count (L j & w) /  i=1, N count( L i & w)

40 Context-independent lexical categories P(Lj | w) = count (Lj & w) /  i=1,N count( Li & w) P(Art | the) = 300 /303 =0.99 P(N | flies) = 21 / 44 = 0.48

41 Context dependent lexical probabilities A better estimate can be obtained by computing how likely it is that category L i occurs at position t, in all sequences of the input w 1 … w t Instead of just finding the sequence with the maximum probability, we add up the probabilities of all sequences that end in w t /L i The probability that flies is a noun in the sentence The flies like flowers is calculated by adding the probability of all sequences that end with flies as a noun

42 Context-dependent lexical probabilities Using probabilities of Figs 7-4 and 7-6, the sequences that have nonzero values: P(The/Art flies/N) = P( the | ART) * P(ART |  ) * P(N | ART) * P(flies | N) = 0.54 * 0.71 * 1.0 * = 9.58 * P(The/N flies/N) = P( the | N) * P(N |  ) * P(N | N) * P(flies | N) = 1/833 * 0.29 * 0.13 * = 1.13 * P(The/P flies/N) = P(the | P) * P(P |  ) * P(N | P) * P(flies | N) = 2/307 * * 0.26 * = 4.55 * Which adds up to 9.58 * 10 -3

43 Context-dependent lexical probabilities Similarly, there are three nonzero sequences ending with flies as a V with a total value of 1.13 * P(The flies) = 9.58 * * = * P(flies/N | The flies) = P(flies/N & The flies) / P(The flies) = 9.58 * / * = P(flies/V | The flies) = P(flies/V & The flies) / P(The flies) = 1.13 * / * =

44 Forward Probabilities The probability of producing the words w 1 … w t and ending is state w t /L i is called the forward probability  i (t) and is defined as:  i (t) = P(w t /L i & w 1 … w t ) In the flies like flowers,  2 (3) is the sum of values computed for all sequences ending in a V (the second category) in position 3, for the input the flies like P(w t /L i | w 1 … w t ) = P(w t /L i & w 1 … w t ) / P(w 1 … w t )   i (t) /  j=1, N  j (t)

45 Forward Probabilities

46 Context dependent lexical Probabilities

47 Context dependent lexical Probabilities

48 Backward Probability Backward probability,  j (t)), is the probability of producing the sequence w t … w T beginning from the state w t /L j P(w t /L i )  (  i (t) *  i (t) ) /  j=1, N (  j (t) *  i (t))

49 Probabilistic Context-free Grammars CFGs can be generalized to PCFGs We need some statistics on rule use The simplest approach is to count the number of times each rule is used in a corpus with parsed sentences If category C has rules R 1 … R m, then P(R j | C) = count(# times R j used) /  i=1,m count(# times R i used)

50 Probabilistic Context-free Grammars

51 Independence assumption You can develop algorithm similar to the Veterbi algorithm that finds the most probable parse tree for an input Certain independence assumptions must be made The probability of a constituent being derived by a rule Rj is independent of how the constituent is used as a sub constituent The probabilities of NP rules are the same no matter the NP is the subject, the object of a verb, or the object of a proposition This assumption is not valid; a subject NP is much more likely to be a pronoun than an object NP

52 Inside Probability The probability that a constituent C generates a sequence of words w i, w i+1, …, w j (written as w i,j ) is called the inside probability and is denoted as P(w i,j | C) It is called inside probability because it assigns a probability to the word sequence inside the constituent

53 Inside Probabilities How to derive inside probabilities? For lexical categories, these are the same as lexical- generation probabilities P(flower | N) is the inside probability that the constituent N is realized as the word flower (0.06 in fig. 7-6) Using lexical-generation probabilities, inside probabilities of Non-lexical constituents can be computed

54 Inside probability of an NP generating A flower The probability of an NP generates a flower is estimated as: P(a flower | NP) = P(rule 8 | NP) * P(a | ART) * P(flower | N) + P(Rule 6 | NP) * P(a | N) * P(flower | N) = 0.55 * 0.36 * * * 0.06 = 0.012

55 Inside probability of an S generating A flower wilted These probabilities can then be used to compute the probabilities of larger constituents P(a flower wilted | S) = P(Rule 1 | S) * P(a flower | NP) * P(wilted | VP) + P(Rule 1 | S) * P(a | NP) * P(flower wilted | VP)

56 Probabilistic chart parsing In parsing, we are interested in finding the most likely parse rather than the overall probability of a given sentence. We can a Chart Parser for this propose When entering an entry E of category C using rule i with n sub constituents corresponding to entries E 1 … E n, then P(E) = P(Rule i | C) * P(E 1 ) * … * P(E n ) For lexical categories, it is better to use forward probabilities rather than lexical-generation probs.

57 A flower

58 Probabilistic Parsing This technique identifies the correct parse %50 times The reason is that the independence assumption is too radical One of crucial issues is handling of lexical items A context-free model does not consider lexical preferences Parser prefers that PP attached to V rather than NP, and fails to find the correct structure of those that PP should be attached to NP

59 Best-First Parsing Exploring higher probability constituents first Much of the search space, containing lower- rated probabilities is not explored Chart parser’s Agenda is organized as a priority queue Arc extension algorithm need to be modified

60 New arc extension for Prob. Chart Parser

61 The man put a bird in the house Best first parser finds the correct parse after generating 65 constituents, Standard bottom-up parser generates 158 constituents Standard algorithm generates 106 constituents to find the first answer So, the best-first parsing is a significant improvement

62 Best First Parsing It finds the most probable interpretation first Probability of a constituent is always lower or equal to the probability of any of its sub constituents If S2 with probability of p2 is found after S1 with the probability of p1, then p2 cannot be higher than p1, otherwise: Sub constituents of S2 would have higher probabilities than p1 and would be found sooner than S1 and thus S2 would be found sooner, too

63 Problem of multiplication In practice with large grammars, probabilities would drop quickly because of multiplications Other functions can be used Score(C) = MIN (Score(C  C1 … Cn), Score(C1), …, Score(Cn) But MIN leads to a %39 correct result

64 Context-dependent probabilistic parsing The best-first algorithm improves the efficiency, but has no effect on accuracy Computing rules probability based on some context-dependent lexical information can improve accuracy The first word of a constituent is often its head word Computing the probability of rules based on the first word of constituents : P(R | C, w)

65 Context-dependent probabilistic parsing P(R | C, w) = Count( # times R used for cat. C starting with w) / Count(# times cat. C starts with w) Singular names rarely occur alone as a noun phrase (NP  N) Plural nouns rarely act as a modifying name (NP  N N) Context-dependent rules also encode verb preferences for sub categorizations

66 Rule probabilities based on the first word of constituents

67 Context-Dependent Parser Accuracy

68 The man put the bird in the house P(VP  V NP PP | VP, put) = 0.93 * 0.99 * 0.76 * 0.76 = 0.54 P(VP  V NP | VP, put) =

69 The man Likes the bird in the house P(VP  V NP PP | VP, like) = 0.1 P(VP  V NP | VP, like) = 0.054

70 Context-dependent rules The accuracy of the parser is still %66 Make the rule probabilities relative to larger fragment of input (bigram, trigram, …) Using other important words, such as prepositions The more selective the lexical categories, the more predictive the estimates can be (provided that there is enough data) Other closed class words such as articles, quantifiers, conjunctions can also be used (i.e., treated individually) But what about open class words such as verbs and nouns (cluster similar words)

71 Handling Unknown Words An unknown word will disrupt the parse Suppose we have a trigram model of data If w 3 in the sequence of words w 1 w 2 w 3 is unknown, and if w 1 and w 2 are of categories C 1 and C 2 Pick the category C for w 3 such that P(C | C 1 C 2 ) is maximized. For instance, if C 2 is ART, then C will probably be a NOUN (or an ADJECTIVE) Morphology can also help Unknown words ending with –ing are likely a VERB, and those ending with –ly are likely an ADVERB

72 Human preference in Parsing Allen’s Chapter 6

73 Human preference in Parsing Parsing techniques seen so far have depended on a Search But human seem to parse more deterministically However, they may fall in a garden-path: The raft floated down the river sank

74 Human preference in Parsing Some of principles that appears to be used by people to choose the correct interpretation are: –Minimal Attachment –Right Association –Lexical Preferences

75 Minimal Attachment

76 Minimal Attachment The M.A. Principle may cause misparsing 1.We painted all the walls with cracks PP tends to attach to VP rather to NP 2.The horse [that was] raced past the barn fell The reduced relative clause introduce more nodes, so “raced” is taken as the main verb, but it is rejected when “fell” is seen

77 Right Association (or Late Closure) George said that Henry left in his car I thought it would rain yesterday

78 Right Association (or Late Closure)

79 Lexical Preference M.A. and R.A. principle may have conflict The man keeps the dog in the house R.A. Suggests The man keeps the dog in the house M.A. Suggests The man keeps the dog in the house Should M.A. be given more priority?

80 Lexical Preference 1.I wanted the dog in the house (R.A. is preferred) I wanted the dog in the house 2.I kept the dog in the house I kept the dog in the house (M.A is preferred) 3.I put the dog in the house I put the dog in the house (R.A. = wrong) So L.P. overrides both M.A. and R.A.