Expressing Implicit Semantic Relations without Supervision ACL 2006
2 Abstract For a given input word pair X:Y with unspecified semantic relations –The corresponding output list of patterns is ranked according to how well each pattern p i expresses the relations between X and Y. For example, X =ostrich and Y =bird –X is the largest Y and Y such as X An unsupervised learning algorithm: –Mining large text corpora for patterns –The patterns are sorted by pertinence
3 Introduction Hearst (1992): Y such as the X –X is a hyponym (type) of Y –For building a thesaurus Berland and Charniak (1999) : Y’s X and X of the Y –X is a meronym (part) of Y –For building a lexicon or ontology, like WordNet This paper inverse of this problem: –Given a word pair X : Y with some unspecified semantic relations –Mining a large text corpus for lexico-syntactic patterns to express the implicit relations between X and Y.
4 Introduction A corpus of web pages : 5*10 10 English words –From co-occurrences of the pair ostrich: bird in this corpus 516 patterns of the form “X … Y” 452 patterns of the form “Y … X” Main challenge: –To find a way of ranking the patterns –To find a way to empirically evaluate the performance
5 Pertinence - 1/3 mason:stone vs. carpenter:wood –high degree of relational similarity Assumption: –There is a measure of the relational similarity between pairs of words, sim r (X 1 :Y 1, X 2 :Y 2 ) . –Let W={X 1 :Y 1,.., X n :Y n } : be a set of word pairs –Let P={P 1,..,P m } : be a set of patterns. The pertinence of pattern P i to a word pair X j :Y j is the expected relational similarity between a word pair X k :Y k
6 Pertinence - 2/3 Let f k,i be a number of occurrences –the word pair X k :Y k with the pattern P i conditional probabilityrelational similarity
7 Pertinence - 3/3 assume p(X j :Y j ) =1/n for all pairs in W p(X j :Y j ) =1/n : Laplace smoothing
8 The Algorithm Goal: –Input a set of word pairs W={X 1 :Y 1,…,X n :Y n } –Output ranked lists of patterns for each input pair 1. Find phrases: –Corpus: 5*10 10 English words –List of the phrases that begin with X i and end with Y i –And, list for the opposite order –One to three intervening words between X i and Y i
9 The Algorithm –The first and last words in the phrase do not need to exactly match X i and Y i (allowable different suffixes) 2. Generate patterns: –For example, the phrase “carpenter nails the wood” X nails the Y X nails * Y X * the Y X * * Y –X i first and Y i last or vice versa Do not allow duplicate patterns in a list. Pattern frequency (term frequency in IR)
10 The Algorithm 3. Count pair frequency: –Pair frequency (document frequency in IR) for a pattern is the number of lists contain the given pattern. 4. Map pairs to rows: –For each pair X i : Y i, create a row for X i : Y i and another row for Y i : X i 5. Map patterns to columns: –For each unique pattern of the form “X…Y” (in step2), create a column and another column X and Y swapped, ”Y.. X”
11 The Algorithm 6. Build a sparse matrix: –Build a matrix X. value x ij is the pattern frequency of the j-th patterns for the i-th word pair. 7. Calculate entropy: –log(x ij ) * H(P) H(P)= H(X) = - x X p(x)log 2 p(x) 8: Apply SVD (singular value decomposition): –SVD is used to reduce noise and compensate for sparseness
12 The Algorithm –X = U V T, U,V are in column orthonormal form is a diagonal matrix of singular value If X is of rank r, then is also rank r. Let k (k < r) be the diagonal matrix formed from top k singular values Let U k and V k be the matrices produced by selecting the corresponding columns from U and V. K = 300
13 The Algorithm 9. Calculate cosines: –sim r (X 1 :Y 1, X 2 :Y 2 ) is given by the cosine of the angle between their corresponding row vectors in the matrix U k k V k 10. Calculate conditional probabilities: –Using Bayes’ theorem and the raw frequency data 11. Calculate pertinence:
14 Experiments with Word Analogies 374 college-level SAT test – word pair: ostrich: bird (a) lion:cat (b) goose:flock (c) ewe:sheep (d) cub:bear (e) primate:monkey –Row: 374*6*2=4488 Drop some pairs they do not co-occur in the corpus rows –Column: 1,706,845 patterns (3,413,690 columns) Drop all patterns with a frequency less than ten. 42,032 patterns (84,064 columns) –density is 0.91%
15
16
17 Skip 15 SAT questions f: pattern frequency F: maximun f n: pair frequency N: total number of word pairs
18 Experiments with Noun-Modifiers-1/3 600 noun-modifiers set 5 general classes of labels with 30 subclasses –flu virus : causality relation (the flu is caused by a virus) –causality (storm cloud), temporality (daily exercise), spatial (desert storm), participant (student protest), and quality (expensive book) Matrix: –1184 rows and 33,698 columns –density is 2.57%
19 Experiments with Noun-Modifiers-2/3 leave-one-out cross-validation –the testing set consists of a single noun-modifier pair and the training set consists of the 599 remaining noun-modifiers.
20 Experiments with Noun-Modifiers-3/3
21 Conclusion How word pairs are similar The main contribution of this paper is the idea of pertinence Although the performance on the SAT analogy questions (54.6%) is near the level of the average senior high school student (57%), there is room for improvement.