1 Statistical methods in NLP Course 5 Diana Trandab ă ț
The sentence as a string of words E.g I saw the lady with the binoculars string = a b c d e b f
The relations of parts of a string to each other may be different I saw the lady with the binoculars is stucturally ambiguous Who has the binoculars?
[ I ] saw the lady [ with the binoculars ] = [a] b c d [e b f] I saw [ the lady with the binoculars] = a b [c d e b f]
How can we represent the difference? By assigning them different structures. We can represent structures with 'trees'. I read thebook
a. I saw the lady with the binoculars S NP VP VNP NPPP Isaw the lady with the binoculars I saw [the lady with the binoculars]
b. I saw the lady with the binoculars S NP VP VPPP I saw the lady with the binoculars I [ saw the lady ] with the binoculars
birds fly S NPVP NV birdsfly S →NPVP NP → N VP → V Syntactic rules
S NPVP birdsfly a b ab = string
S A B a b S → A B A → a B → b
Rules Assumption: natural language grammars are a rule-based systems What kind of grammars describe natural language phenomena? What are the formal properties of grammatical rules?
The Chomsky Hierarchy
Chomsky (1957) Syntactic Structures. The Hague: Mouton Chomsky, N. and G.A. Miller (1958) Finite-state languages Information and Control 1, Chomsky (1959) On certain formal properties of languages. Information and Control 2, The Chomsky Hierarchy
SYNTAX (phrase/sentence formation) SENTENCE : The boykissed the girl S UBJECTPREDICATE NOUN PHRASEVERB PHRASE ART + NOUNVERB + NOUN PHRASE S→NPVP VP→VNP NP→ARTN
Chomsky Hierarchy 0.Type 0 (recursively enumerable) languages The only restriction on rules: left-hand side cannot be the empty string (* Ø …….) 1. Context-Sensitive languages - Context-Sensitive (CS) rules 2. Context-Free languages - Context-Free (CF) rules 3. Regular languages - Non-Context-Free (CF) rules 0 ⊇ 1 ⊇ 2 ⊇ 3 a ⊇ b meaning a properly includes b (a is a superset of b), i.e. b is a proper subset of a or b is in a
a b a c b d f g Superset/subset relation S 1 S 2 S 1 is a subset of S 2 ; S 2 is a superset of S 1
Generative power 0.Type 0 (recursively enumerable) languages - is the most powerful system Type 3(regular language) - is the least powerful
Rule Type – 3 Name: Regular Example: Finite State Automata (Markov-process Grammar) Rule type: a) right-linearb) or left-linear A xB orA Bx or A xA x with: A, B = auxiliary nodes and x = terminal node Generates: a m b n with m,n 1 Cannot guarantee that there are as many a’s as b’s; no embedding
Example of regular grammar S→theA A→catB A→mouseB A→duckB B→bitesC B→seesC B→eatsC C→theD D→boy D→girl D→monkey the cat bites the boy the mouse eats the monkey the duck sees the girl
More regular grammars Grammar 1: Grammar 2:A → a A → a BA → B a B → b AB → A b Grammar 3:Grammar 4: S→a AA → A a S→b BA → B a A→a SB → b B→b b SB → A b S→ A → a
Example of non regular grammars Grammar 5:Grammar 6: S→A BA → a S→b BA → B a A→a SB → b B→b b SB → b A S→
NP article NP1 adjectiveNP1 nounNP2 NP → article NP1 NP1 →adjective NP1 NP1 → noun NP2
A parse tree S root node NPVP non- terminal NP nodes n v detn terminal nodes
Rule Type – 2 Name: Context Free Example: Phrase Structure Grammars/ Push-Down Automata Rule type: A with: A = auxiliary node = any number of terminal or auxiliary nodes Recursiveness (centre embedding) allowed: A A
Rule Type – 1 The following languages cannot be generated by a CF grammar (by pumping lemma): a n b m c n d m Swiss German: A string of dative nouns (e.g. aa), followed by a string of accusative nouns (e.g. bbb), followed by a string of dative- taking verbs (cc), followed by a string of accusative-taking verbs (ddd) = aabbbccddd = a n b m c n d m
More on Context Free Grammars (CFGs) Sets of rules expressing how symbols of the language fit together, e.g. S -> NP VP NP -> Det N Det -> the N -> dog
What Does Context Free Mean? LHS of rule is just one symbol. Can have NP -> Det N Cannot have X NP Y -> X Det N Y
Grammar Symbols Non Terminal Symbols Terminal Symbols – Words – Preterminals
Non Terminal Symbols Symbols which have definitions Symbols which appear on the LHS of rules S -> NP VP NP -> Det N Det -> the N -> dog
Non Terminal Symbols Same Non Terminals can have several definitions S -> NP VP NP -> Det N NP -> N Det -> the N -> dog
Terminal Symbols Symbols which appear in final string Correspond to words Are not defined by the grammar S -> NP VP NP -> Det N Det -> the N -> dog
Parts of Speech (POS) NT Symbols which produce terminal symbols are sometimes called pre-terminals S -> NP VP NP -> Det N Det -> the N -> dog Sometimes we are interested in the shape of sentences formed from pre-terminals Det N V Aux N V D N
CFG - formal definition A CFG is a tuple (N, ,R,S) N is a set of non-terminal symbols is a set of terminal symbols disjoint from N R is a set of rules each of the form A where A is non-terminal S is a designated start-symbol
CFG - Example grammar: S NP VP NP N VP V NP lexicon: V kicks N John N Bill N = {S, NP, VP, N, V} = {kicks, John, Bill} R = (see opposite) S = “S”
Exercise Write grammars that generate the following languages, for m > 0 (ab) m a n b m a n b n Which of these are Regular? Which of these are Context Free?
(ab) m for m > 0 S -> a b S -> a b S
(ab) m for m > 0 S -> a b S -> a b S S -> a X X -> b Y Y -> a b Y -> S
anbmanbm S -> A B A -> a A -> a A B -> b B -> b B S -> a AB AB -> a AB AB -> B B -> b B -> b B
Grammar Defines a Structure grammar: S NP VP NP N VP V NP lexicon: V kicks N John N Bill S NP N Johnkicks NPV VP N Bill
Different Grammar Different Stucture grammar: S NP NP NP N V NP N lexicon: V kicks N John N Bill S NP N Bill John V N NP kicks
Which Grammar is Best? The structure assigned by the grammar should be appropriate. The structure should – Be understandable – Allow us to make generalisations. – Reflect the underlying meaning of the sentence.
Ambiguity A grammar is ambiguous if it assigns two or more structures to the same sentence. NP NP CONJ NP NP N lexicon: CONJ and N John N Bill The grammar should not generate too many possible structures for the same sentence.
Criteria for Evaluating Grammars Does it undergenerate? Does it overgenerate? Does it assign appropriate structures to sentences it generates? Is it simple to understand? How many rules are there? Does it contain just a few generalisations or is it full of special cases? How ambiguous is it? How many structures does it assign for a given sentence?
Probabilistic Context Free Grammar (PCFG) A PCFG is a probabilistic version of a CFG where each production has a probability. String generation is now probabilistic where production probabilities are used to non- deterministically select a production for rewriting a given non-terminal.
Characteristics of PCFGs In a PCFG, the probability P(A β) expresses the likelihood that the non-terminal A will expand as β. – e.g. the likelihood that S NP VP (as opposed to S VP, or S NP VP PP, or… ) can be interpreted as a conditional probability: – probability of the expansion, given the LHS non-terminal – P(A β) = P(A β|A) Therefore, for any non-terminal A, probabilities of every rule of the form A β must sum to 1 – If this is the case, we say the PCFG is consistent
Simple PCFG for English S → NP VP S → Aux NP VP S → VP NP → Pronoun NP → Proper-Noun NP → Det Nominal Nominal → Noun Nominal → Nominal Noun Nominal → Nominal PP VP → Verb VP → Verb NP VP → VP PP PP → Prep NP Grammar Prob Det → the | a | that | this Noun → book | flight | meal | money Verb → book | include | prefer Pronoun → I | he | she | me Proper-Noun → Houston | NWA Aux → does 1.0 Prep → from | to | on | near | through Lexicon
Parse tree and Sentence Probability Assume productions for each node are chosen independently. Probability of a parse tree (derivation) is the product of the probabilities of its productions. Resolve ambiguity by picking most probable parse tree. Probability of a sentence is the sum of the probabilities of all of its derivations.
Probability of a tree vs. a sentence – simply the multiplication of the probability of every rule (node) that gives rise to t (i.e. the derivation of t) – this is both the joint probability of t and s, and the probability of t alone why?
P(t,s) = P(t) But P(s|t) must be 1, since the tree t is a parse of all the words of s
Picking the best parse in a PCFG A sentence will usually have several parses – we usually want them ranked, or only want the n-best parses – we need to focus on P(t|s,G) probability of a parse tree, given our sentence and our grammar – definition of the best parse for s:
Probability of a sentence Simply the sum of probabilities of all parses of that sentence – since s is only a sentence if it’s recognised by G, i.e. if there is some t for s under G all those trees which “yield” s
Example PCFG Rules & Probabilities S NP VP1.0 NP DT NN0.5 NP NNS0.3 NP NP PP 0.2 PP P NP1.0 VP VP PP 0.6 VP VBD NP0.4 DT the1.0 NN gunman0.5 NN building0.5 VBD sprayed 1.0 NNS bullets1.0 P with1.0
Example Parse t 1` The gunman sprayed the building with bullets. S 1.0 NP 0.5 VP 0.6 DT 1.0 NN 0.5 VBD 1.0 NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 NNS 1.0 bullets with buildingthe Thegunman sprayed P (t 1 ) = 1.0 * 0.5 * 1.0 * 0.5 * 0.6 * 0.4 * 1.0 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = VP 0.4
Another Parse t 2 S 1.0 NP 0.5 VP 0.4 DT 1.0 NN 0.5 VBD 1.0 NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 NNS 1.0 bullets with buildingthe Thegunmansprayed NP 0.2 P (t 2 ) = 1.0 * 0.5 * 1.0 * 0.5 * 0.4 * 1.0 * 0.2 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = The gunman sprayed the building with bullets.
Some Features of PCFGs A PCFG gives some idea of the plausibility of different parses. However, the probabilities are based on structural factors and not lexical ones. PCFG are good for grammar induction. PCFGs are robust. PCFGs give a probabilistic language model for English. The predictive power of a PCFG (measured by entropy) tends to be greater than for an HMM. PCFGs are not good models alone but they can be combined with a tri-gram model. PCFGs have certain biases which may not be appropriate.
Restrictions PCFG only consider the case of Chomsky Normal Form Grammars, a Context-Free Grammar in which: Every rule LHS is a non-terminal Every rule RHS consists of either a single terminal or two non- terminals. Examples: » A B C » NP Nominal PP » A a » Noun man But not: » NP the Nominal » S VP
Converting a CFG to CNF 1.Rules that mix terminals and non-terminals on the RHS – E.g. NP the Nominal – Solution: Introduce a dummy non-terminal to cover the original terminal – E.g. Det the Re-write the original rule: – NP Det Nominal – Det the
Converting a CFG to CNF 2.Rules with a single non-terminal on the RHS (called unit productions) such as NP Nominal – Solution: Find all rules that have the form Nominal ... – Nominal Noun PP – Nominal Det Noun Re-write the above rule several times to eliminate the intermediate non-terminal: – NP Noun PP – NP Det Noun – Note that this makes our grammar “flatter”
Converting a CFG to CNF 3.Rules which have more than two items on the RHS – E.g. NP Det Noun PP Solution: – Introduce new non-terminals to spread the sequence on the RHS over more than 1 rule. Nominal Noun PP NP Det Nominal
The outcome If we parse a sentence with a CNF grammar, we know that: – Every phrase-level non-terminal (above the part of speech level) will have exactly 2 daughters. NP Det N – Every part-of-speech level non-terminal will have exactly 1 daughter, and that daughter is a terminal: N lady
Problems with Probabilistic CFG Models Main problem with Probabilistic CFG Model: it does not take contextual effects into account. Example: Pronouns are much more likely to appear in the subject position of a sentence than an object position. But in a PCFG, the rule NP Pronoun has only one probability. One simple possible extension -- make probabilities dependent on first word of the constituent. Instead of P(C i |C), use P(C i |C,w) where w is the first word in C. Example: the rule VP V NP PP is used 93% of the time with the verb put, but only 10% of the time for like. Requires estimating a much larger set of probabilities, and can significantly improve disambiguation performance.
Probabilistic Lexicalized CFGs A solution to some of the problems with Probabilistic CFGs is to use Probabilistic Lexicalized CFGs. Use the probabilities of particular words in the computation of the probabilities in the derivation
Lexicalised PCFGs Attempt to weaken the lexical independence assumption. Most common technique: –mark each phrasal head (N,V, etc) with the lexical material –this is based on the idea that the most crucial lexical dependencies are between head and dependent –E.g.: Charniak 1997, Collins 1999
Lexicalised PCFGs: Matt walks Makes probabilities partly dependent on lexical content. P(VP VBD|VP) becomes: P(VP VBD|VP, h(VP)=walk) NB: normally, we can’t assume that all heads of a phrase of category C are equally probable. S(walks) NP(Matt) NNP(Matt) Matt VP(walk) VBD(walk) walks
Example
Great! See you upstairs!