Download presentation
Presentation is loading. Please wait.
Published byMarlene McKenzie Modified over 8 years ago
1
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada islam@cs.dal.ca, eem@cs.dal.ca, vlado@cs.dal.ca COLING 2012
2
Introduction ● Word-relatedness has a wide range of applications – IR: Image retrieval, Query extention… – Paraphrase recognition – Malapropism detection and correction – Automatic creation of thesauri – Speak recognition – …
3
Introduction ● Methods can be categorized into 3: – Corpus-based ● Supervised ● Unsupervised – Knowledge-based ● Semantic resources were used – Hybrid
4
Introduction ● This paper focus on unsupervised corpus-based measures ● 6 measures have been compared
5
Problem ● Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n- grams and frequencies – The co-occurrence are corpus-specific – Most of the corpura doesn't have co-occurrence stats, thus can't be used on-line – Some use web search result, but results vary from time to time
6
Motivation ● How to compare different measures fairly? ● Observation – Co-occurrence stats were used – A corpus with co-occurrence information, eg. Google n-grams, is probably a good resource
7
Google N-Grams ● A publicly available corpus with – Co-occurrence statistics (uni-gram to 5-gram) – A large volume of web text ● Digitalized books with over 5.2 million books published since 1500 – Data format: ● ngram year match_count volume_count ● eg: – analysis is often described as 1991 1 1 1
8
Another Motivation ● To find a indirect mapping between Google n- grams and web search result – Thus, it might be used on-line
9
How About WordNet? ● In 2006, Budanitsky and Hirst evaluated 5 knowledge-based measures using WordNet – Create a resource like WordNet requires lots of efforts – Coverage of words is not enough for NLP tasks – Resource is language-specific, while Google n- grams consists more than 10 languages
10
Notations ● C(w1 … wn) – Frequency of the n-gram ● D(w1 … wn) – # of web docs (up to 5-grams) ● M(w1, w2) – C(w1 wi w2)
11
Notations ● (w1, w2) – 1/2 [ C(w1 wi w2) + C(w2 wi w1) ] ● N – # of docs used in Google n-grams ● |V| – # of uni-grams in Google n-grams ● Cmax – max frequency in Google n-grams
12
Assumptions ● Some measures use web search results, and co- occurrence information not provided by Google n-gram, but – C(w1) ≥ D(w1) – C(w1 w2) ≥ D(w1 w2) ● It is because uni-grams and bi-grams might occurs multiple times in one document
13
Assumptions ● Considering the lower limits – C(w1) ≈ D(w1) – C(w1 w2) ≈ D(w1 w2)
14
Measures ● Jaccard Coefficient ● Simpson Coefficient
15
Measures ● Dice Coefficient ● Pointwise Mutual Information
16
Measures ● Normalized Google Distane (NGD) variation
17
Measures ● Relatedness based on Tri-grams (RT)
18
Evaluation ● Compare with human judgments – It is considered to be the upper limit ● Evaluate the measures with respect to a particular application – Evaluate relatedness of words ● Text Similarity
19
Compare With Human Judgments ● Rubenstein and Goodenough's 65 Word Pairs – 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0 ● Miller and Charles' 28 Noun Pairs – Restricting R&G to 30 pairs, 38 human judges – Most of researchers use 28 pairs because 2 were omitted from early version of WordNet
20
Result
22
Application-based Evaluation ● TOEFL's 80 Synonym Questions – Given a problem word, infinite, and four alternative words limitless, relative, unusual, structural choose the most related word ● ESL's 50 Synonym Qeustions – Same as TOEFL's 80 synonym questions task – Expect the synonym questions are from English as a 2nd Language tests
23
Result
25
Text Similarity ● Find the similarity between two text items ● Use different measures on a single text similarity measure, and evaluate the results of the text similarity measure based on a standard data set ● 30 sentences pairs from one of most used data sets were used
26
Result
27
● Pearson correlation coefficient with mean human similarity ratings: – Ho et al. (2010) used one measure based-on WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895 – Tsatsaronis et al. (2010) achieved 0.856 – Islam et al. (2012) achieved 0.916 ● The improvement over Ho et al. (2010) is statistically significant at 0.05 level
28
Conclusion ● Any measures uses n-gram statistics can easily apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks ● Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions
29
Conclusion ● Measures based on n-gram are language- independent – Other languages can be implemented if it has a sufficiently large n-gram corpus
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.