Download presentation
Presentation is loading. Please wait.
Published byTamsyn Wood Modified over 8 years ago
1
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen
2
2 Outline Introduction Semantic Similarity of Words Semantic Similarity of Texts A Walk-Through Example Evaluation Conclusion
3
3 Introduction Measures of text similarity have been used for IR, text classification, WSD, automatic evaluation of machine translation, text summarization The typical approach to use a simple lexical matching method, and produce a similarity score But most text similarity metrics will fail in these texts I own a dog I have an animal
4
4 Introduction (cont.) LSA measure similarity between texts by including Similar terms in large text collections In this paper, we explore a knowledge-based method for measuring the semantic similarity of texts There are several methods for finding the semantic similarity of words We combine these methods into a text-to-text semantic similarity method
5
5 Semantic Similarity of Words The Leacock & Chodorow (Leacock and Chodorow, 1998) similarity Length: the length of the shortest path between two concepts D: the maximum depth of the taxonomy The Wu and Palmer (Wu and Palmer, 1994) similarity
6
6 Semantic Similarity of Words (cont.) The information content (IC) of the LCS P(c): the probability of encountering an instance of concept c in a large corpus Lin’s metric(Lin, 1998) Jiang & Conrath (Jiang and Conrath, 1997)
7
7 Language Models Language models are used to account for the distribution of words in language We take into account the specificity of words For example, collie and sheepdog: higher weight go and be: give less importance TF does not always constitute a good measure of word importance The distribution of words across an entire collection can be a good indicator of the specificity of the words -- (IDF)
8
8 Semantic Similarity of Texts A directional measure of semantic similarity indicates the semantic similarity of a text segment T i with respect to a text segment T j Sets of open-class words—N, V, Adj, Adv Determine pairs of similar words across the sets corresponding to the same open-class in two text For nouns and verbs, we use a measure based on WordNet Apply lexical matching to the other word classes
9
9 Semantic Similarity of Texts (cont.) maxSim: the highest semantic similarity of the six methods The score is between 0 and 1 with respect to T i If this similarity measure results in a score greater than 0, then the word is added to the set of similar words for the corresponding word class WSpos A bidirectional similarity
10
10 A Walk-Through Example First, the text segments are tokenize, POS tagged The words are inserted into word class sets
11
11 A Walk-Through Example (cont.) We seek a WordNet-based semantic similarity for N and V Only lexical matching for Adj, Adv, and cardinals
12
12 A Walk-Through Example (cont.) We use The semantic similarity with respect to text 1 as 0.6702 With respect to text 2 as 0.7202 A bidirection measure of similarity: 0.6952
13
13 Evaluation To test the effectiveness of the text semantic similarity metric Automatically identify if two text segments are paraphrases of each other Corpus: The Microsoft paraphrase corpus 4,076 training pairs and 1,725 test pairs PASCAL corpus 580 development pairs and 800 test pairs Two setting An unsupervised setting threshold of 0.5 An supervised setting the optimal threshold and weights associated with various similarity methods are determined through learning on training data
14
14 Evaluation (cont.) Three baseline Randomly choosing a true or false value for each text pair A lexical matching which counts the number of matching words Using tf * idf paraphrase identification 狗正在吃骨頭 -> 骨頭正在被狗吃 entailment identification 我能看見一條狗 -> 我能看見一隻動物
15
15 Evaluation (cont.)
16
16 Evaluation (cont.)
17
17 Conclusion The accuracy of text semantic similarity for paraphrase identification(68.8%, 71.5%) For the entailment data set, the accuracy 58.3 % is better than the PASCAL entailment evaluation (Dagan et al., 2005) Our method relies on a bag-of-words approach Improves significantly over the traditional methods But ignores many of important relationships in sentence structure
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.