Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July.

Similar presentations


Presentation on theme: "Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July."— Presentation transcript:

1 Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July 2002

2 Dual motivation Unsupervised learning of syntactic behavior of words Unsupervised learning of syntactic behavior of words Solving a problem in the unsupervised learning of morphology: disambiguating morphs Solving a problem in the unsupervised learning of morphology: disambiguating morphs

3 Disambiguating morphs? Automatic learning of morphology can provide us with a signature associated with a given stem: Automatic learning of morphology can provide us with a signature associated with a given stem: Signature = alphabetized list of affixes associated with a given stem in a corpus. Signature = alphabetized list of affixes associated with a given stem in a corpus.

4 For example: Signature NULL.ed.ing.s: aid, ask, call, claim, help,kick aid, ask, call, claim, help,kick Signature NULL.ed.ing: add, assist, attend, consider add, assist, attend, consider Signature NULL.s achievement, acre, action, administrator, affair achievement, acre, action, administrator, affair

5 The signature NULL.ed.ing is much more a subsignature of is much more a subsignature ofNULL.ed.ing.sthanNULL.s is because of s’s ambiguity (noun, verb).

6 How can we determine whether a given morph (“ed”, “s”) represents more than 1 morpheme? I don’t think that we can do this on the basis of morphological information. I don’t think that we can do this on the basis of morphological information.

7 Goal: find a way of describing syntactic behavior in a way that is dependent only on a corpus. That is, in a fashion that is language- independent but corpus-dependent – though the global structure that is induced from 2 corpora from the same language will be very similar. That is, in a fashion that is language- independent but corpus-dependent – though the global structure that is induced from 2 corpora from the same language will be very similar.

8 French Fem. sg. nouns plural nouns Finite verbs

9 With such a method… We can look at words formed with the “same” suffix, putting words into buckets based on the signature their stem is in: Bucket 1 (NULL.ed.ing.s): aided, asked, called Bucket 1 (NULL.ed.ing.s): aided, asked, called Bucket 2 (NULL.ed.ing): added, assisted, attended. Bucket 2 (NULL.ed.ing): added, assisted, attended. Q: do the average position of each bucket form a tight cluster?

10 If the average locations of each bucket of –ed words form a tight cluster, then –ed is not ambiguous. If the average locations of each bucket (from distinct signatures) does not form a tight cluster, the morpheme is not the same across signatures.

11 Method Not a clustering method; neither top- down nor bottom-up. Not a clustering method; neither top- down nor bottom-up. Two step procedure: Two step procedure: 1. Construct a nearest-neighbor graph. 2. Reduce the graph to 2-dimensions by means of eigenvector decomposition.

12 Nearest neighbors Following a long list of researchers: We begin by assuming that a word W’s distribution can be described by a vector L describing all of its left- hand neighbors and a vector R describing all of its right-hand neighbors. We begin by assuming that a word W’s distribution can be described by a vector L describing all of its left- hand neighbors and a vector R describing all of its right-hand neighbors.

13 V = Size of corpus’ vocabulary V L w,R w are vectors that live in R V. If V is ordered alphabetically, then L w = (4, 0, 0, 0, …) # of occurrences of “a” before w # of occurrences of “abatuna” before w # of occurrences of “abandoned” before w

14 Similarity of syntactic behavior is modeled as closeness of L-vectors …where “closeness” of 2 vectors is modeled as the angle between them.

15 Construct a (non-directed) graph: Its vertices are the words W in V. For each word W: Pick the K most-similar words (K = 20, 50) (by angle of L-vector)Pick the K most-similar words (K = 20, 50) (by angle of L-vector) Add an edge to the graph connecting W to each of those words.Add an edge to the graph connecting W to each of those words.

16 Canonical matrix representation of a graph: M(i,j) = 1 iff there is an edge connecting w i and w j – that is, iff w i and w j are similar words as regards how they interact with the word immediately to the left. iff w i and w j are similar words as regards how they interact with the word immediately to the left.

17 Where is this matrix M? It’s a point in a space of size V(V-1)/2. Not very helpful, really. It’s a point in a space of size V(V-1)/2. Not very helpful, really. How can we optimally reduce it to a space of small dimension? How can we optimally reduce it to a space of small dimension? Find the eigenvectors of the normalized laplacian of the graph. Find the eigenvectors of the normalized laplacian of the graph. See Chung, Malik and Shi, Belkin and Niyogi (references in written version)

18 A graph and its matrix M The degree of a vertex (= word) is the number of edges adjacent (linked) to it. The degree of a vertex (= word) is the number of edges adjacent (linked) to it. Notice that this is not fixed across words. Notice that this is not fixed across words. The degree of vertex v i is the sum of the entries of the i th row. The degree of vertex v i is the sum of the entries of the i th row.

19 The laplacian of the graph Let D = VxV diagonal matrix s.t. diagonal entry M(i,i) = degree of v i D – M is the Laplacian of the graph. Its rows sum to 0.

20 Normalized laplacian: For each i, divide all entries in the i th row by √d(i). For each i, divide all entries in the i th row by √d(i). For each i, divide all entries in the i th column by √d(i). For each i, divide all entries in the i th column by √d(i). Result: Diagonal elements are all 1. Result: Diagonal elements are all 1. Generally: Generally:

21 Eigenvector decomposition The eigenvectors form a spectrum, ranked by the value of their eigenvalues. The eigenvectors form a spectrum, ranked by the value of their eigenvalues. Eigenvalues run from 0 to 2 (L is positive semi-definite). Eigenvalues run from 0 to 2 (L is positive semi-definite). The eigenvector with 0 eigenvalue reflects word’s frequency (“zeroth”). The eigenvector with 0 eigenvalue reflects word’s frequency (“zeroth”). But the next smallest (the “first”) gives us a good represenation of the words… But the next smallest (the “first”) gives us a good represenation of the words…

22 …in the sense that the values associated with each word show how close the words are in the original graph. We can graph the first two eigenvectors of the Left (or Right) graph: each word is located at the coordinates corresponding to it in the eigenvector(s):

23 Spanish (left) masculine plurals fem. plurals finite verbs feminine sg nouns masc. sg. nouns past participles

24 German (left) Neuter sg nouns Names of places Fem. sg. nouns numbers, centuries

25 English (right) prepositions + “to” + of nouns modals

26 English (left) infinitives the + modals past verbs

27 Results of experiment If we define the size of the minimal box that includes all of the vocabulary as being 1 by 1, then we find a small ( < 0.10 ) average distance to mean for unambiguous suffixes (e.g., -ed (English), -ait (French) ) – only for them. If we define the size of the minimal box that includes all of the vocabulary as being 1 by 1, then we find a small ( < 0.10 ) average distance to mean for unambiguous suffixes (e.g., -ed (English), -ait (French) ) – only for them.

28 Measure To repeat: we find the “virtual” location of the conflation of all of the stems of a given signature, plus the suffix in question e.g., NULL.ed.ing_ed To repeat: we find the “virtual” location of the conflation of all of the stems of a given signature, plus the suffix in question e.g., NULL.ed.ing_ed We do this for all signatures containing “ed” We do this for all signatures containing “ed” We compute average distance to the mean. We compute average distance to the mean.

29 LeftGraphRightGraph Expect coherence: ed0.0500.054 -ly0.0320.100 ‘s0.0000.012 -al0.002 -ate0.0690.080 -ment0.0120.009 -ait0.0680.034 -er0.0550.071 -a0.0230.029 -ant0.0630.088 LeftGraphRightGraph Expect little/no coherence: -s0.2650.145 -ing0.0960.143 NULL0.3120.192 -e0.2900.130 Average <= 0.10 Average > 0.10

30 Conclusion The technique appears to work appropriately for the task. The technique appears to work appropriately for the task. But we suspect that the actual use of the technique is much more interesting and open-ended than this (simple) application suggests. But we suspect that the actual use of the technique is much more interesting and open-ended than this (simple) application suggests.


Download ppt "Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July."

Similar presentations


Ads by Google