Learning Random Walk Models for Inducing Word Dependency Distributions Kristina Toutanova Christopher D. Manning Andrew Y. Ng
Kristina Toutanova, Chris Manning and Andrew Y. Ng Word-specific information is important for many NLP disambiguation tasks Language modeling for speech recognition I would like to make a collect …. ? Need a good estimate of P(coal|collect) P(cow|collect) P(call|collect) coal cow call
Kristina Toutanova, Chris Manning and Andrew Y. Ng Sparseness: 1 million words is very little data The number of parameters to be estimated is very large – for a vocabulary size of 30,000 words, we have 900 million pair-wise parameters Pair-wise statistics involving two words are very sparse, even on topics central to the domain of the corpus. Examples from WSJ (1million words): stocks plummeted 2 occurrences stocks stabilized1 occurrence stocks rose50 occurrences stocks skyrocketed0 occurrences stocks laughed 0 occurrences
Kristina Toutanova, Chris Manning and Andrew Y. Ng It must be possible to use related words instead The hypothesis is that similar words (according to various measures) have similar distributions. Many ways of defining related words. Similarity with respect to some features or distributions extracted from a corpus (examples later) Co-occurrence in a huge corpus (Web) stabilizestabilizedstabilizing riseclimb morphology synonyms skyrocket rise “is-a” relationship s
Kristina Toutanova, Chris Manning and Andrew Y. Ng Using multiple similarity measures and chaining inferences stocksrose rise skyrocketskyrocketed
Kristina Toutanova, Chris Manning and Andrew Y. Ng Random walks in the space of words We propose to use a Markov Chain model in the state space of words, connected by different links according to their relatedness. The target word dependency distribution is estimated as the stationary distribution of the Markov Chain. This allows us to make inference based on different relatedness concepts and to chain inferences together. We automatically learn weights for the different information sources from data.
Kristina Toutanova, Chris Manning and Andrew Y. Ng Related work: PageRank PageRank A Markov Chain with transition probabilities given by the Web graph and a manually varied probability of reset to uniform With probability 1- choose a link uniformly at random from the out-links of the current page With probability pick a web page from the entire collection uniformly at random
Kristina Toutanova, Chris Manning and Andrew Y. Ng This work In our work, the Markov Chain transition matrix is defined using different “link types”. The probability of following each link type is learned from data. This allows for a richer class of models incorporating multiple knowledge sources.
Kristina Toutanova, Chris Manning and Andrew Y. Ng Word dependency probabilities for structural ambiguities Structural ambiguities: He broke [the window] [with a hammer] He broke [the window] [with the white curtains] Good probability estimates of P(hammer | broke, with) and P(curtains| window, with) will help with disambiguation
Kristina Toutanova, Chris Manning and Andrew Y. Ng The problem : prepositional phrase attachment Attachment of a phrase of a preposition, such as with,of, in, for, etc. Binary decision – attach to the preceding noun or verb phrase We represent examples by using only the head words (the main words of each phrase) She [broke] [the window] [with [a hammer]] V=broke, N 1 =window, P=with, N 2 =hammer The class variable is attachment Att=va or Att=na We learn a generative model P(V, N 1, P, N 2, Att)
Kristina Toutanova, Chris Manning and Andrew Y. Ng Sparseness of distributions Factor% Non-Zero Verb Attachment P(p|va)99.8 P(v | va,p)64.8 P(n 1 | va,p,v)15.7 P(n 2 | va,p,v)13.8 Noun Attachment P(p|na)99.5 P(v | na,p)69.4 P(n 1 | na,p,v)20.9 P(n 2 | na,p, n 1 )17.4 returning rates to normal (va) pushing rate of inflation (na) We will learn Markov Chains whose stationary distributions are good estimates of these distributions over words.
Kristina Toutanova, Chris Manning and Andrew Y. Ng Basic form of learned Markov Chains Links among states represent different kinds of relatedness hang hook fasten hooks Example: v:hang n 1 :painting p:with n 2 :peg P(n 2 =peg|v=hang,p=with,va) P with,va (n 2 =peg|v=hang) Colored states showing whether the word is a head or a dependent. e.g hang is a head and peg is a dependent in a with- type dependency The initial state distribution places probability 1 on hang
Kristina Toutanova, Chris Manning and Andrew Y. Ng An example Markov Chain P with (n 2 :peg|v:hang) hang fasten pegs peg rivet hook hooks We include the empirical P(head | dependent) and P(dependent | head) distributions as link types empirical distribution link
Kristina Toutanova, Chris Manning and Andrew Y. Ng An example Markov Chain P with (n 2 :peg|v:hang) hang fasten pegs peg rivet hook hooks we have seen pegs as dependent of hang in the training corpus peg and pegs have the same root form the probability of peg given hang will again be high
Kristina Toutanova, Chris Manning and Andrew Y. Ng An example Markov Chain P with (n 2 :peg|v:hang) hang fasten pegs peg rivet hook hooks The empirical probability of rivet given hang is high rivet and peg are semantically related The probability of peg given hang will again be high
Kristina Toutanova, Chris Manning and Andrew Y. Ng An example Markov Chain P with (n 2 :peg|v:hang) hang fasten pegs peg rivet hook hooks The probability of peg given hang will again be high
Kristina Toutanova, Chris Manning and Andrew Y. Ng Form of learned transition distributions A list of links is given. Each link is a transition distribution from states to states and is given by a matrix T i. The transition probabilities of the learned MC are given by
Kristina Toutanova, Chris Manning and Andrew Y. Ng Training the model The model is trained by maximizing the conditional log-likelihood of a development set samples. We fix the maximum number of steps in the Markov Chain to a small number d called the maximum degree of the walk, for computational reasons The stopping probability is another trainable parameter of the MC.
Kristina Toutanova, Chris Manning and Andrew Y. Ng Evaluation data Four-tuples of words and correct attachment extracted from the WSJ part of the Penn Treebank 20,801 samples training, 4,039 development, and 3,097 test set. Average human accuracy of 88.2% given four words (Ratnaparkhi 94) Many cases where the meaning is pretty much the same They built a house in London Cases where it is difficult to decide from the head words only (93% human accuracy if the whole sentence is seen) Accuracy 59% noun attachment 72.2% most common for each preposition
Kristina Toutanova, Chris Manning and Andrew Y. Ng Baseline MC model Only includes empirical distributions from the training corpus. Links from head to dependent states with the following transition probabilities back-off link Empirical distribution link A random walk with these links is the same as the familiar linear interpolation models
Kristina Toutanova, Chris Manning and Andrew Y. Ng Accuracy of Baseline Collins & Brooks 95 (1 NN) 84.15% Abney et al. 99 (Boosting) 84.40% Vanschoenwinkel et al. 03 (SVM) 84.80% Baseline 85.85% Human 88.20%
Kristina Toutanova, Chris Manning and Andrew Y. Ng Adding morphology and WordNet synonyms Morphology links for verbs and nouns – probability of transitioning to a word with the same root form is proportional to the smoothed empirical count of this word Synonym links for verbs and nouns – uniform probability of transitioning to any synonym with respect to the top three senses of a word in WordNet nails nail hangs hang fasten rivet nail rivet
Kristina Toutanova, Chris Manning and Andrew Y. Ng Accuracy using morphology and synonyms Error reduction from baseline wrt human is 28.6% but significant only at level 0.1 maximum degree d=3
Kristina Toutanova, Chris Manning and Andrew Y. Ng Using unannotated data Unannotated data is potentially useful for extracting information about words. We use the BLLIP corpus consisting of about 30 million words of automatically parsed text (Charniak 00). We extract 567,582 noisy examples of the same form as training set (called BLLIP-PP). We use this data in two ways: Create links using the empirical distribution in BLLIP-PP Create distributional similarity links
Kristina Toutanova, Chris Manning and Andrew Y. Ng Distributional similarity links from BLLIP-PP Distributional similarity of heads wrt dependents and vice versa (Dagan 99, Lee 99) Based on Jensen-Shannon divergence of empirical distributions Let Distributional similarity of heads wrt distribution over another variable, not involved in the dependency relation of interest
Kristina Toutanova, Chris Manning and Andrew Y. Ng Accuracy using BLLIP-PP Error reduction from baseline 12%, wrt human accuracy 72% maximum degree d=3
Kristina Toutanova, Chris Manning and Andrew Y. Ng Summary We built Markov Chain models defined using a base set of links. This allowed inference based on different relatedness concepts and chaining inferences together. We automatically learned weights for the different information sources. Large improvements were possible in the application of Prepositional Phrase attachment. Methods may be applicable to learning random walks on the web (cf. PageRank/HITS).