Download presentation
1
Measures of Text Similarity
Presented By: Ehsan Asgarian Ferdowsi University of Mashhad
2
Agenda Introduction Syntactical (String-Based) Similarity
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
3
Why text similarity? Used everywhere in NLP
Information retrieval (Query vs Document) Text classification (Document vs Category) Word-sense disambiguation (Context vs Context) Automatic evaluation Machine translation (Gold Standard vs Generated) Text summarization (Summary vs Original)
4
Distance Function A metric on a set X is a function
d : X × X → R (where R is the set of real numbers). For all x, y, z in X, this function is required to satisfy the following conditions: d(x, y) ≥ (non-negativity, or separation axiom) d(x, y) = 0 if and only if x = y (coincidence axiom) d(x, y) = d(y, x) (symmetry) d(x, z) ≤ d(x, y) + d(y, z) (subadditivity / triangle inequality)
5
Agenda Syntactical (String-Based) Similarity Introduction
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
6
Syntactical (String-Based) Similarity
Character-Based LCS Levenshtein N-gram Jaro Jaro-Winkler Soundex (phonetic algorithms) … Term-Based Block Distance Euclidean Distance Cosine Similarity Jaccard Similarity Dice's Coefficient Tanimoto Tversky Matching Coefficient Overlap Coefficient
7
Agenda Introduction Syntactical (String-Based) Similarity
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
8
LCS & Levenshtein Distance
LCS (Longest Common SubString) : algorithm considers the similarity between two strings is based on the length of contiguous chain of characters that exist in both strings. Levenshtein : defines distance between two strings by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
9
N-gram Distance N-gram is a sub-sequence of n items from a given sequence of text. N-gram similarity algorithms compare the n-grams from each character or word in two strings. Distance is computed by dividing the number of similar n-grams by maximal number of n-grams.
10
Jaro Distance Jaro is based on the number and order of the common characters between two strings; it takes into account typical spelling deviations and mainly used in the area of record linkage. where: m: is the number of matching characters. (Two characters from s1 and s2 respectively, are considered matching only if they are the same and not farther than: max( 𝑠1 , 𝑠2 ) 2 −1) t: is half the number of transpositions.
11
Jaro-Winkler Distance
Jaro-Winkler is an extension of Jaro distance; it uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length: where: |Pref|: is the length of common prefix at the start of the string up to a maximum of 4 characters. p: is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. (should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p=0.1) bt: The boost threshold in Winkler's implementation was 0.7
12
Example Given the strings Ehsan and Eihhoos we find:
m = 2 (Note that the two ‘S’s are not considered matches because they are outside the match window of 3.), |s1|=5 , |s2| = 7 There is mismatched character (i/h) leading to : t = 1/2 To find the Jaro–Winkler score using the standard weight p=0.1, we continue to find: |pref| = 1 Thus:
13
Character-Based Similarity (cont.)
Needleman-Wunsch (used in bioinformatics to align protein or nucleotide sequences) Smith-Waterman (performs local sequence alignment) Soundex (phonetic algorithms): is a phonetic algorithm for indexing names by sound, as pronounced in English. Keyboard-Key Distance … Needleman-Wunsch algorithm is an example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison. It performs a global alignment to find the best alignment over the entire of two sequences. It is suitable when the two sequences are of similar length, with a significant degree of similarity throughout. Smith-Waterman is another example of dynamic programming. It performs a local alignment to find the best alignment over the conserved domain of two sequences. It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Common uses: Spell checkers Search
14
Agenda Introduction Syntactical (String-Based) Similarity
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
15
Syntactical (String-Based) Similarity
Character-Based LCS Levenshtein N-gram Jaro Jaro-Winkler Soundex (phonetic algorithms) … Term-Based (syntactic) Block Distance Euclidean Distance Cosine Similarity Jaccard Similarity Dice's Coefficient Tanimoto Tversky Matching Coefficient Overlap Coefficient String similarity measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. This survey represents the most popular string similarity measures which were implemented in SimMetrics package[1]. Chapman, S. (2006). SimMetrics : a java & c# .net library of similarity metrics,
16
Document Representation: Term vector space
17
Example: Term vector space
NormalizeTokenizeStemmingStop Word Removal
18
𝑙 𝑖,𝑗 = 𝑡𝑓 𝑖,𝑗 𝑚𝑎𝑥 𝑖 ( 𝑡𝑓 𝑖,𝑗 ) +1 /2
Local Term Weighting No. Name Formula Description 1 Binary 𝑙 𝑖,𝑗 =1 if the term exists in the document, or else 0 2 Term Frequency 𝑙 𝑖,𝑗 = 𝑡𝑓 𝑖,𝑗 the number of occurrences of term i in document j 3 Log 𝑙 𝑖,𝑗 = log(𝑡𝑓 𝑖,𝑗 +1) 4 normal 𝑙 𝑖,𝑗 = 𝑡𝑓 𝑖,𝑗 𝑚𝑎𝑥 𝑖 ( 𝑡𝑓 𝑖,𝑗 ) 5 Augnorm 𝑙 𝑖,𝑗 = 𝑡𝑓 𝑖,𝑗 𝑚𝑎𝑥 𝑖 ( 𝑡𝑓 𝑖,𝑗 ) +1 /2 Position of Word in Phrase, Sentence, Paragraph or Document, Part of Speech Tag (Noun, Verb, Adj, Adv, …) Named Entity (Person, Time, Location, …),
19
Global Term Weighting No. Name Formula Description 1 Binary 2 Normal 3
IDF Where dfi is the number of documents in which term i occurs 4 GfIdf where gfi is the total number of times term i occurs in the whole collection 5 Entropy
20
Hamming & Euclidean Distance
Hamming (Block Distance, L1 distance, city block distance and Manhattan distance) : In information theory, the Hamming distance between two strings of equal length is the sum of the differences of their corresponding components. Euclidean distance or L2 distance : is the square root of the sum of squared differences between corresponding elements of the two vectors. (مانند همینگ فقط فاصله دو حرف الفبا از یکدیگر نیز در فاصله نهایی موثر است )
21
Cosine Similarity Measure
22
Matching & Overlap coefficient
Simple Matching coefficient : Number of terms (variables) in which document (object) s1 and s2 mismatch / Number of terms (variables) : Overlap coefficient : The overlap coefficient is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets:
23
Jaccard & Sørensen–Dice Distance
Jaccard : The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. Sørensen–Dice (Dice's coefficient) : Sørensen's original formula was intended to be applied to presence/absence data, and is :
24
Tanimoto & Tversky Distance
Tanimoto : Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance. Tversky : The Tversky index can be seen as a generalization of Dice's coefficient and Tanimoto coefficient. Setting produces a=b=1 the Tanimoto coefficient; setting produces a=b=0.5 Dice's coefficient.
25
Agenda Semantic (knowledge-Based) Similarity Introduction
Syntactical (String-Based) Similarity Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
26
Semantic (knowledge-Based) Similarity
Path Length Simple Path Length Wu & Palmer Leacock & Chodorow Information Content Resnik Lin Jiang & Conrath Relatedness (Dictionary-based method) Hirst-St.Onge (HSO) Lesk vector pairs
27
Agenda Introduction Syntactical (String-Based) Similarity
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
28
WordNet Similarity
29
Definitions and Notation
Pathlen(c1, c2): number of edges in shortest path depth(ci) : The depth of a node is the length of the path to it from the global root, i.e.,depth(ci)=pathlen(root,ci). LCS(c1,c2): Least Common Subsumer is lowest node in hierarchy that is a hypernym of c1 & c2. rel(c1,c2): for semantic relatedness between two conceptsc1andc2 , the relatedness rel(w1,w2) between two words w1and w2 can be calculated as 𝑟𝑒𝑙 𝑤1,𝑤2 = max 𝑐1∈𝑠 𝑤1 &𝑐2∈𝑠(𝑤2) 𝑟𝑒𝑙 𝑐1,𝑐2 (𝑜𝑟 𝐴𝑣𝑔) Where s(wi) is “the set of concepts in the taxonomy that are senses of word wi” (Resnik 1995). That is, the relatedness of two words is equal to that of the most-related pair of concepts that they denote river bank و financial bank
30
Path length path-length based similarity:
31
Leacock & Chodorow L&Ch measure returns a score denoting how similar two word senses are, based on the shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur.
32
Wu and Palmer W&P measure returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer.
33
Agenda Introduction Syntactical (String-Based) Similarity
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
34
Information Content P(C) = probability of seeing a concept of type C in a large corpus = probability of seeing instances of that concept. Where words(c) is set of words subsumed by concept c, N is the number of words in corpus and also in thesaurus. P(root)=1 since all words are subsumed by root concept, The lower a concept in the hierarchy, the lower the probability
35
Information Content (cont.)
Train probabilities by counting in a corpus: each word counts as an occurrence of all concepts “containing” it
36
Information Content (cont.)
Based on information content : محتوای اطلاعاتی (Information Content) میزان خاص بودن یک مفهوم را در دامنه موضوعی خود نشان می دهد يک مفهوم با محتواي اطلاعاتي بالا، بسيار خاص ميباشد مفاهيمي با محتواي اطلاعاتي پايين داراي معاني عمومي و کلي و درجه خاص بودن کمتري برخوردارند مفهومcarving fork به معني کندن محل انشعاب داراي محتواي اطلاعاتي بالا مفهوم entity داراي محتواي اطلاعاتي پاييني است.
37
Information Content (cont.)
e.g. corpus size = words IC(vehicle) = -log(75/10000) = 2.12 IC(caboose) = -log(10/10000) = 3 IC(freight car) = -log(1/10000) = 4 IC(coupe) = -log(14/10000) = 2.85 IC(sedan) = -log(16/10000) = 2.82 IC(taxi) = -log(34/10000) = 2.46 …
38
Resnik Resnik is equal to the information content (IC) of the Least Common Subsumer (most informative subsumer). This means that the value will always be greater-than or equal-to zero. The upper bound on the value is generally quite large and varies depending upon the size of the corpus used to determine information content values. Resnik: ميزان شباهت دو مفهوم عبارتست از مقدار اطلاعات يا محتواي اطلاعاتي که آن دو مفهوم به اشتراک گذاشتهاند محتواي اطلاعاتي نزديکترين مفهومي در ساختار سلسله مراتبي که دو مفهوم، در زير آن قرار گرفته باشند.
39
Lin & Jiang-Conrath Lin : The lin measure scales the information content of the Least Common Subsumer by the sum of the information content of concepts c1 and c2 themselves. Jiang-Conrath : takes the difference of this sum and the information content of the Least Common Subsumer. Resnik: ميزان شباهت دو مفهوم عبارتست از مقدار اطلاعات يا محتواي اطلاعاتي که آن دو مفهوم به اشتراک گذاشتهاند محتواي اطلاعاتي نزديکترين مفهومي در ساختار سلسله مراتبي که دو مفهوم، در زير آن قرار گرفته باشند.
40
Agenda Introduction Syntactical (String-Based) Similarity
Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
41
Lesk Lesk algorithm used for word sense disambiguation
Dictionary-based method : makes use of glosses, a property of dictionaries Lesk : two concepts/senses are similar if their glosses contain overlapping words Resnik: ميزان شباهت دو مفهوم عبارتست از مقدار اطلاعات يا محتواي اطلاعاتي که آن دو مفهوم به اشتراک گذاشتهاند محتواي اطلاعاتي نزديکترين مفهومي در ساختار سلسله مراتبي که دو مفهوم، در زير آن قرار گرفته باشند.
42
Extended Lesk Extended Lesk measure:
Extended gloss overlap: two concepts/senses are similar if their glosses contain overlapping words Let RELs be the set of possible WordNet relations with glosses we compare For each n-word phrase seen in both glosses, eLesk adds n2; longer overlaps are rare, and should be weighted more heavily Resnik: ميزان شباهت دو مفهوم عبارتست از مقدار اطلاعات يا محتواي اطلاعاتي که آن دو مفهوم به اشتراک گذاشتهاند محتواي اطلاعاتي نزديکترين مفهومي در ساختار سلسله مراتبي که دو مفهوم، در زير آن قرار گرفته باشند.
43
Extended Lesk (Example)
drawing paper: paper that is specially prepared for use in drafting decal: the art of transferring designs from specially prepared paper to a wood or glass or metal surface overlap(decal, drawing paper)=12+22=5 Resnik: ميزان شباهت دو مفهوم عبارتست از مقدار اطلاعات يا محتواي اطلاعاتي که آن دو مفهوم به اشتراک گذاشتهاند محتواي اطلاعاتي نزديکترين مفهومي در ساختار سلسله مراتبي که دو مفهوم، در زير آن قرار گرفته باشند. if considering hyponyms only,
44
Hirst-St.Onge (HSO) hso measure works by finding lexical chains linking the two word senses. Where C and k are constants (in practice, they used C=8 and k=1), and turns(c1 ,c2) is the number of times the path between c1 and c2 changes direction.
45
vector pairs vector measure creates a co–occurrence matrix for each word used in the WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that is the average of these co–occurrence vectors. where c1 and c2 are the two given concepts, v1 and v2 are the gloss vectors corresponding to the concepts and angle returns the angle between vectors. .
46
Agenda Statistical (Corpus-Based) Similarity Introduction
Syntactical (String-Based) Similarity Character-Based (word level) Term-Based (sentence or document level) Semantic (knowledge-Based) Similarity Path Length Information Content Relatedness (Dictionary-based method) Statistical (Corpus-Based) Similarity
47
Statistical (Corpus-Based) Similarity
Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora: PMI LSA ESA pLSI NMF LDA DISCO …
48
Thanks For Your Attention
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.