Download presentation
Presentation is loading. Please wait.
Published byElvin Allen Modified over 9 years ago
1
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering, 23(7), 2011. Danushka Bollegala, Yutaka Matsuo, & Mitsuru Ishizuka
2
2 Outline 1.Introduction 2.Related Work 3.Method 4.Experiments 5.Conclusion
3
3 1. Introduction (1/5) Semantic Similarity –Web mining: community extraction, relation detection, & entity disambiguation. –Information retrieval: to retrieve a set of documents that is semantically related to a given user query. –Natural language processing: word sense disambiguation, textual entailment, & automatic text summarization.
4
4 1. Introduction (2/5) Web search engines –Page count: the number of pages that contain the query words. –Snippets: a brief window of text extracted by a search engine around the query term in a document.
5
5 1. Introduction (3/5) Page count –In Google, “apple” AND “computer” is 288,000,000; “banana” AND “computer” is 3,590,000. –“apple” AND “computer” is much similar than “banana” AND “computer”.
6
6 Snippets –“Jaguar” AND “cat” –Jaguar is the largest cat X is the largest Y 1. Introduction (4/5)
7
7 1. Introduction (5/5) Web search engine (Google) + Page count + Snippets Semantic Similarity
8
8 2. Related Work (1/2) Normalized Google Distance (NGD) –Cilibrasi & Vitanyi, 2007. P and Q: the two words; NGD(P,Q): the distance between P and Q; H(P),H(Q): the page count for the word P and Q; H(P,Q): the page count for the query “P AND Q”.
9
9 2. Related Work (2/2) Co-occurrence Double-Checking (CODC) –Chen et al., 2006. F(P@Q): the number of occurrences of P in the top-ranking snippets for the query Q in Google; H(P): the page count for query P; α: a constant in this model, which is experimentally set to the value 0.15.
10
10 3. Method 1.Outline 2.Page Count-Based Co-Occurrence Measures 3.Lexical Pattern Extraction 4.Lexical Pattern Clustering 5.Measuring Semantic Similarity 6.Training
11
11 3.1 Outline
12
12 3.2 Page Count-Based Co-Occurrence Measures (1/2) P∩Q denotes the conjunction query “P AND Q”.
13
13 3.2 Page Count-Based Co-Occurrence Measures (2/2) N: the number of documents indexed by the search engine.
14
14 3.3 Lexical Pattern Extraction (1/2) Conditions: 1.A subsequence must contain exactly one occurrence of each X and Y. 2.The maximum length of a subsequence is L words. 3.A subsequence is allowed to skip one or more words. However, we do not skip more than g number of words consecutively. Moreover, the total number of words skipped in a subsequence should not exceed G. 4.We expand all negation contractions in a context. For example, didn’t is expanded to did not. We do not skip the word not when generating subsequences. For example, this condition ensures that from the snippet X is not a Y, we do not produce the subsequence X is a Y.
15
15 3.3 Lexical Pattern Extraction (2/2) X, a large Y X a flightless Y X, large Y lives A snippet retrieved for the query “ostrich*******bird.”
16
16 3.4 Lexical Pattern Clustering (1/2) word-pair frequency: total occurrence: a j : a pattern in pattern vector a. (P i,Q j ): a word pair.
17
17 3.4 Lexical Pattern Clustering (2/2)
18
18 3.5 Measuring Semantic Similarity (1/5) Weight to a pattern a i in a cluster c j : The jth feature for a word pair (P, Q):
19
19 3.5 Measuring Semantic Similarity (2/5) Feature vector for a word pair (P, Q) :
20
20 3.5 Measuring Semantic Similarity (3/5) Train a two-class SVM: ( synonymous / nonsynonymous ) Semantic similarity:
21
21 3.5 Measuring Semantic Similarity (4/5) Distance: b: the bias term and the hyperplane. a k : the Lagrange multiplier. f k : support vector. K(f k, f): the value of the kernel function. f : the instance to classify.
22
22 3.5 Measuring Semantic Similarity (5/5) The probability: Log likelihood:
23
23 3.6 Training (1/5) Number of Patterns Extracted for Training Data Synonymous (A, B) (C, D) Nonsynonymous (A, D) (C, B)
24
24 3.6 Training (2/5) L = 5, g = 2, G = 4, & T = 5, for lexical pattern extraction conditions. Distribution of patterns extracted from synonymous word pairs.
25
25 3.6 Training (3/5) Average similarity versus clustering threshold θ.
26
26 3.6 Training (4/5) The centroid vector of all feature vectors: The average Mahalanobis distance : |W|: the number of word pairs in W. C -1 : the inverse of the intercluster correlation Matrix.
27
27 3.6 Training (5/5) Distribution of patterns extracted from nonsynonymous word pairs.
28
28 4. Experiments 1.Benchmark Data Sets 2.Semantic Similarity 3.Community Mining
29
29 5. Conclusion (1/3) 1.A semantic similarity measure using both page counts and snippets retrieved from a web search engine for two words. 2.Four word co-occurrence measures were computed using page counts. 3.A lexical pattern extraction algorithm to extract numerous semantic relations that exist between two words.
30
30 5. Conclusion (2/3) 4.A sequential pattern clustering algorithm was proposed to identify different lexical patterns that describe the same semantic relation. 5.Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair. 6.A two-class SVM was trained using those features extracted for synonymous and nonsynonymous word pairs selected from WordNet synsets.
31
31 5. Conclusion (2/3) Experimental results on three benchmark data sets showed that the proposed method outperforms various baselines as well as previously proposed web- based semantic similarity measures, achieving a high correlation with human ratings. The proposed method improved the F-score in a community mining task, thereby underlining its usefulness in real-world tasks, that include named entities not adequately covered by manually created resources.
32
32 The End~ Thanks for your attention!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.