Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sentence Similarity Based on Semantic Nets and Corpus Statistics

Similar presentations


Presentation on theme: "Sentence Similarity Based on Semantic Nets and Corpus Statistics"— Presentation transcript:

1 Sentence Similarity Based on Semantic Nets and Corpus Statistics
Author : Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett Reporter : Tze Ho-Lin 2006/1/3 TKDE, 2006

2 Outline Motivation Objectives Methodology Evaluation Conclusion
Personal Comments

3 Motivation Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently Inefficient Require human input Not adaptable to some application domains

4 Objectives Using lexical database to enable our method to model human common sense knowledge. Incorporation of corpus statistics allows our method to be adaptable to different domains.

5 Methodology (WordNet) (Brown Corpus) 最左邊是輸入,也就是兩個句子。而最右邊是兩句子的相似度輸出
一個相似度是如何計算得來的呢,看圖可知相似度是由語意的相似和字與字之間順序的相似度

6 Evaluation Similarity correlation (Pearson) R&G No. Word Pair Human
Means Algorithm Measure Sentence Pair 59 Cock rooster 0.86 A cock is an adult male chicken. A rooster is an adult male chicken. 29 Bird woodland 0.01 A bird is a creature with feathers and wings, female birds lay eggs and most can fly. Woodland is land with a lot of trees. 56 Coast shore 0.59 The coast is an area of land that is next to the sea. The shores or shore of a sea, lake or wide river is the land along the edge of it. Similarity correlation (Pearson) 由於目前在句子的評估上並無適當的標準資料集,所以本研究的作者就讓本研究的方法和一些參與人員的相似度進行比較

7 Conclusion A method for measuring the semantic similarity based on semantic and word order information. The proposed method provides similarity measures that are fairly consistent with human knowledge.

8 Personal Comments Application Advantage Disadvantage
Categorical data similarity measure. Knowledge Retrieval Advantage This function has a simple, easy-to-implement formula. Disadvantage This technique compares words on a word-by-word basis. The proposed method does not currently conduct word sense disambiguation for polysemous words.

9 Method: Semantic Similarity between Words
兩字的相似度分成了路徑長度和深度兩種方式來計算 由下圖舉例

10 Method: Semantic Similarity between Sentences
One element Is a word in the joint word set. 求cosine夾角 那s1中的數字如何求出來?就是經由下面的Si計算 由I函數,我們可以得到corpus的權重 Is its associated word in the sentence. n is the frequency of the word w in the corpus N is the total number of words in the corpus

11 Method: Word Order Similarity between Sentences
T1: RAM keeps things being worked with. T2: The CPU uses RAM as a short-term memory store. T={RAM keeps things being worked with The CPU uses as a short-term memory store}. 如果沒有字,我們找出句子中最相近的字的index來放

12 Process for Deriving the Semantic Vector
(preset threshold: 0.2) T1: RAM keeps things being worked with. T2: The CPU uses RAM as a short-term memory store. T={RAM keeps things being worked with The CPU uses as a short-term memory store}.

13 Overall sentence similarity
公式中有一個thida,用來控制語意和字順序的權重


Download ppt "Sentence Similarity Based on Semantic Nets and Corpus Statistics"

Similar presentations


Ads by Google