Download presentation
Presentation is loading. Please wait.
1
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic Semantic similarity measures between two words Why interesting? In information retrieval Query expansion Automatic annotation of Web pages Community mining In natural language processing Word-sense disambiguation Synonym extraction Language modeling … WWW 2007 Paper Presentation Zheshen Wang May 8 th, 2007
2
Solution proposed By using the information available on the Web Page Counts + Text Snippets SVM for an optimal combination WWW 2007 Paper Presentation Zheshen Wang May 8 th, 2007 Page Counts Co-occurrence measures: Jaccard, Overlap (Simpson), Dice, PMI Modification: Suppress random co-occurrences Score=0, if H(P∩Q)<c, H(x): page counts for the query x Text Snippets (context and statistical based) top 200 Pattern Freq Lexico -syntactic Patterns Extraction e.g. “Toyota and Nissan are two major Japanese car manufactures.” If the appearing times of a pattern words in snippets for synonymous words >> in snippets for non-synonymous it is a reliable indicator of synonymy. Combination 204-D Feature vector F= [200 Pattern Freq, 4 co-occurrence measures] Two-class SVM synonymous word-pairs (Positive), non-synonymous word-pairs (Negative)
3
Statistics and context based pattern selection is not reliable ( No ontology or syntax templates ) Sparse Distribution Noises (meaningless patterns) Correlations (e.g. “X and Y”, “X and Y are”, “X and Y are two”) Missing meaningful patterns due to limited n-grams range (e.g. X and Y are far apart, beyond the range of n-grams, n=2,3,4,5 “Rose is a very popular flower in the US.”) Feature vector F= [200 Pattern Freq, 4 co-occurrence measures] Error prone for uncommon words e.g. rarely used professional terms Base set from the web is too small to be reliable. Like the case of CBioC, users voting would be better WWW 2007 Paper Presentation Zheshen Wang May 8 th, 2007 My criticisms of the solution
4
Web-based information extraction (Knowledge Extraction) Extract base level knowledge (“facts”) directly from the web Page counts(Hits), e.g. Knowitall Inevitable drawback: Error prone for uncommon words in the web, e.g. CBioC Making use of Collective Unconscious—Big Idea 3 Analyzing term co-occurrences to capture semantic information Co-occurrence measures Similarity measure in terms of co-occurrence Jaccard, Overlap (Simpson), PMI… Making use of context based on statistics Patterns from context rather than from an ontology (“SemTag & Seeker”). Patterns decided by statistics rather than templates from syntax tree (Generic extraction patterns, Hearst ’92). n-grams for a word, somewhat like the “20-word-window” of “spot(l,c)” in “SemTag & Seeker”. WWW 2007 Paper Presentation Zheshen Wang May 8 th, 2007 How it is related to our course?
5
WWW 2007 Paper Presentation Zheshen Wang May 8 th, 2007 Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.