Sentence Similarity Based on Semantic Nets and Corpus Statistics Author : Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett Reporter : Tze Ho-Lin 2006/1/3 TKDE, 2006
Outline Motivation Objectives Methodology Evaluation Conclusion Personal Comments
Motivation Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently Inefficient Require human input Not adaptable to some application domains
Objectives Using lexical database to enable our method to model human common sense knowledge. Incorporation of corpus statistics allows our method to be adaptable to different domains.
Methodology (WordNet) (Brown Corpus) 最左邊是輸入,也就是兩個句子。而最右邊是兩句子的相似度輸出 一個相似度是如何計算得來的呢,看圖可知相似度是由語意的相似和字與字之間順序的相似度
Evaluation Similarity correlation (Pearson) R&G No. Word Pair Human Means Algorithm Measure Sentence Pair 59 Cock rooster 0.86 0.999885 A cock is an adult male chicken. A rooster is an adult male chicken. 29 Bird woodland 0.01 0.334691 A bird is a creature with feathers and wings, female birds lay eggs and most can fly. Woodland is land with a lot of trees. 56 Coast shore 0.59 0.759004 The coast is an area of land that is next to the sea. The shores or shore of a sea, lake or wide river is the land along the edge of it. Similarity correlation (Pearson) 由於目前在句子的評估上並無適當的標準資料集,所以本研究的作者就讓本研究的方法和一些參與人員的相似度進行比較
Conclusion A method for measuring the semantic similarity based on semantic and word order information. The proposed method provides similarity measures that are fairly consistent with human knowledge.
Personal Comments Application Advantage Disadvantage Categorical data similarity measure. Knowledge Retrieval Advantage This function has a simple, easy-to-implement formula. Disadvantage This technique compares words on a word-by-word basis. The proposed method does not currently conduct word sense disambiguation for polysemous words.
Method: Semantic Similarity between Words 兩字的相似度分成了路徑長度和深度兩種方式來計算 由下圖舉例
Method: Semantic Similarity between Sentences One element Is a word in the joint word set. 求cosine夾角 那s1中的數字如何求出來?就是經由下面的Si計算 由I函數,我們可以得到corpus的權重 Is its associated word in the sentence. n is the frequency of the word w in the corpus N is the total number of words in the corpus
Method: Word Order Similarity between Sentences T1: RAM keeps things being worked with. T2: The CPU uses RAM as a short-term memory store. T={RAM keeps things being worked with The CPU uses as a short-term memory store}. 如果沒有字,我們找出句子中最相近的字的index來放
Process for Deriving the Semantic Vector (preset threshold: 0.2) T1: RAM keeps things being worked with. T2: The CPU uses RAM as a short-term memory store. T={RAM keeps things being worked with The CPU uses as a short-term memory store}.
Overall sentence similarity 公式中有一個thida,用來控制語意和字順序的權重