Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.

Similar presentations


Presentation on theme: "1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen."— Presentation transcript:

1 1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen

2 2 Outline Introduction Semantic Similarity of Words Semantic Similarity of Texts A Walk-Through Example Evaluation Conclusion

3 3 Introduction Measures of text similarity have been used for IR, text classification, WSD, automatic evaluation of machine translation, text summarization The typical approach to use a simple lexical matching method, and produce a similarity score But most text similarity metrics will fail in these texts I own a dog I have an animal

4 4 Introduction (cont.) LSA measure similarity between texts by including Similar terms in large text collections In this paper, we explore a knowledge-based method for measuring the semantic similarity of texts There are several methods for finding the semantic similarity of words We combine these methods into a text-to-text semantic similarity method

5 5 Semantic Similarity of Words The Leacock & Chodorow (Leacock and Chodorow, 1998) similarity Length: the length of the shortest path between two concepts D: the maximum depth of the taxonomy The Wu and Palmer (Wu and Palmer, 1994) similarity

6 6 Semantic Similarity of Words (cont.) The information content (IC) of the LCS P(c): the probability of encountering an instance of concept c in a large corpus Lin’s metric(Lin, 1998) Jiang & Conrath (Jiang and Conrath, 1997)

7 7 Language Models Language models are used to account for the distribution of words in language We take into account the specificity of words For example, collie and sheepdog: higher weight go and be: give less importance TF does not always constitute a good measure of word importance The distribution of words across an entire collection can be a good indicator of the specificity of the words -- (IDF)

8 8 Semantic Similarity of Texts A directional measure of semantic similarity indicates the semantic similarity of a text segment T i with respect to a text segment T j Sets of open-class words—N, V, Adj, Adv Determine pairs of similar words across the sets corresponding to the same open-class in two text For nouns and verbs, we use a measure based on WordNet Apply lexical matching to the other word classes

9 9 Semantic Similarity of Texts (cont.) maxSim: the highest semantic similarity of the six methods The score is between 0 and 1 with respect to T i If this similarity measure results in a score greater than 0, then the word is added to the set of similar words for the corresponding word class WSpos A bidirectional similarity

10 10 A Walk-Through Example First, the text segments are tokenize, POS tagged The words are inserted into word class sets

11 11 A Walk-Through Example (cont.) We seek a WordNet-based semantic similarity for N and V Only lexical matching for Adj, Adv, and cardinals

12 12 A Walk-Through Example (cont.) We use The semantic similarity with respect to text 1 as 0.6702 With respect to text 2 as 0.7202 A bidirection measure of similarity: 0.6952

13 13 Evaluation To test the effectiveness of the text semantic similarity metric Automatically identify if two text segments are paraphrases of each other Corpus: The Microsoft paraphrase corpus  4,076 training pairs and 1,725 test pairs PASCAL corpus  580 development pairs and 800 test pairs Two setting An unsupervised setting  threshold of 0.5 An supervised setting  the optimal threshold and weights associated with various similarity methods are determined through learning on training data

14 14 Evaluation (cont.) Three baseline Randomly choosing a true or false value for each text pair A lexical matching which counts the number of matching words Using tf * idf paraphrase identification 狗正在吃骨頭 -> 骨頭正在被狗吃 entailment identification 我能看見一條狗 -> 我能看見一隻動物

15 15 Evaluation (cont.)

16 16 Evaluation (cont.)

17 17 Conclusion The accuracy of text semantic similarity for paraphrase identification(68.8%, 71.5%) For the entailment data set, the accuracy 58.3 % is better than the PASCAL entailment evaluation (Dagan et al., 2005) Our method relies on a bag-of-words approach Improves significantly over the traditional methods But ignores many of important relationships in sentence structure


Download ppt "1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen."

Similar presentations


Ads by Google