Download presentation
Presentation is loading. Please wait.
Published bySabina Marsh Modified over 9 years ago
1
Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006
2
String-to-string correction
3
A. SavarySeminarium IPIPAN, 24/04/20063 Traditional string-to-string correction (Wagner&Fischer 1974, Lawrence&Wagner 1975,…) CONTEXT: –Finite set of symbols (alphabet) –Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) –Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) –Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B INPUT: –Two words A and B OUTPUT: –Distance between A and B
4
A. SavarySeminarium IPIPAN, 24/04/20064 Examples of elementary edit operations Insertion of a letter monter montaer, monter montrer Deletion of a letter monter montr, monter monte Replacement of a letter by another monter ponter, monter conter Transposition of two adjacent letters monter mnoter, monter montre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation.
5
A. SavarySeminarium IPIPAN, 24/04/20065 Edit sequence Edit sequence = sequence of elementary edit operations For each couple of words X and Y many edit sequences exist that transform X into Y. Example 1: transforming sorting into string : –sorting srting sting string (3 operations) –sorting sotring string (2 operations) –sorting srting string (2 operations) –sorting strting string (2 operations) –sorting srting sting sing sring string (5 operations) –................. Example 2: transforming abc into ca : –abc ac ca (2 operations) –abc cabc cac ca (3 operations) From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. Linear sequence
6
A. SavarySeminarium IPIPAN, 24/04/20066 Edit (error) distance Cost of an edit sequence = sum of costs of all elementary operations included in the sequence –sorting srting sting string (3 operations) cost = 3 –sorting sotring string (2 operations) cost = 2 –sorting srting sting sing sring string (5 operations) cost = 5 Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account
7
A. SavarySeminarium IPIPAN, 24/04/20067 Calculating the edit distance (1/4) If x i+1 = y j+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) X[i+1] Y[j+1] i j Notation : word X= x 1 x 2... x i...x n ; the prefix of lenght i of X : X[i] = x 1 x 2... x i X i X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases x1x1 x2x2 x3x3...xixi xnxn
8
A. SavarySeminarium IPIPAN, 24/04/20068 Transposition’s cost If x i = y j+1 and x i+1 = y j (the 2 last characters may be inverted) then 4 sub-cases are possible: The cheapest sequence transforming X[i+1] into Y[j+1] contains a transposition of x i and x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 X[i+1] Y[j+1] i j The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost Calculating the edit distance (2/4)
9
A. SavarySeminarium IPIPAN, 24/04/20069 OTHERWISE (if x i+1 y j+1, and (x i y j+1 or x i+1 y j )) then 3 sub-cases are possible: X[i+1] Y[j+1] i j The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost Calculating the edit distance (3/4)
10
A. SavarySeminarium IPIPAN, 24/04/200610 Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2°ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if x i+1 = y j+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if x i =y j+1 et x i+1 = y j 3°ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])} Calculating the edit distance (4/4)
11
A. SavarySeminarium IPIPAN, 24/04/200611 case [n,m] contains the edit distance between the 2 words case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word Calculation the edit distance : dynamic programming sorting 01234567 s10123456 t21122345 r32212345 i43323234 n54434323 g65545432 i j n m
12
A. SavarySeminarium IPIPAN, 24/04/200612 Dynamic programming: case 1 sorting 01234??? s10123??? t21122??? r???????? i???????? n???????? g???????? i+1 j+1 x i+1 = y j+1
13
A. SavarySeminarium IPIPAN, 24/04/200613 Dynamic programming : case 2 sorting 01234??? s10123??? t21122??? r32212??? i???????? n???????? g???????? i+1 j+1 x i+1 = y j and x i+1 = y j
14
A. SavarySeminarium IPIPAN, 24/04/200614 Dynamic programming : case 3 sorting 01234??? s10123??? t21122??? r32212??? i43322??? n???????? g???????? i+1 j+1 x i+1 y j+1 et (x i+1 y j ou x i+1 y j )
15
String-to-language correction
16
A. SavarySeminarium IPIPAN, 24/04/200616 String-to-language correction: problem definition CONTEXT: –Finite set of symbols (alphabet) –Elementary edit operations on symbols (as before) with their costs (1 per operation) –Edit sequences (as before) –Edit distance (error distance) between words: as before INPUT: –Regular grammar describing words (a finite set of words in particular) –Incorrect word A (unrecognizable by the grammar) –Threshold t OUTPUT: –A set of correct words B 1, B 2, …, B n whose distance from A stays within t (the nearest neighbors of A)
17
A. SavarySeminarium IPIPAN, 24/04/200617 String-to-language correction: simplistic approach METHOD: –For each word B recognizable by the grammar calculate the edit distance matrix between A and B. –Propose candidates whose distance from A does not exceed the threshold t (ed(A,B) t). FAISABILITY: –Impossible in case of infinite languages COMPLEXITY: O(n * m * |D|)
18
A. SavarySeminarium IPIPAN, 24/04/200618 String-to-language correction: threshold-controlled depth-first exploration of an FSA (Oflazer 1996, …)
19
A. SavarySeminarium IPIPAN, 24/04/200619 Part of the matrix calculated only once for all valid words sharing the same prefix appl String correction with respect to a deterministic FSA (1/4) 1 24 5 3 6 7 8 9 a p p l y e s p l y e a Word to be corrected : *aply, threshold 2 appl... 01234 a10123 p21012 l32111 y43222 Each time a transition is followed a new column is calculated in the edit distance matrix e54322e54322 If we get to a final state and the edit distance remains within the thershold a new candidate has been found apple
20
A. SavarySeminarium IPIPAN, 24/04/200620 1 24 5 3 6 7 8 9 a p p l y e s p l y e a appl... 01234 a10123 p21012 l32111 y43222 e54322e54322 s65433s65433 apple String correction with respect to a deterministic FSA (2/4) Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold a new candidate has been found
21
A. SavarySeminarium IPIPAN, 24/04/200621 1 24 5 3 6 7 8 9 a p p l y e s p l y e a appl... 01234 a10123 p21012 l32111 y43222 e54322e54322 A backtrancking results in deleting the current column apple s65433s65433 String correction with respect to a deterministic FSA (3/4) Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold a new candidate has been found
22
A. SavarySeminarium IPIPAN, 24/04/200622 1 24 5 3 6 7 8 9 a p p l y e s p l y e a appl... 01234 a10123 p21012 l32111 y43222 y54321y54321 appleapply String correction with respect to a deterministic FSA (4/4) A backtrancking results in deleting the current column Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold a new candidate has been found
23
A. SavarySeminarium IPIPAN, 24/04/200623 1 2 8 9 a c d Word to be corrected : abcbb, t=2 abbbbbb -20123456 -2 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 01234567 a 0 ++ 10123456 b 1 ++ 21012345 c 2 ++ 32112345 b 3 ++ 43211234 b 4 ++ 54321123 b b If the current column exceeds the threshold the whole path is cut off Controlling the searchspace by the threshold
24
Tree-to-tree correction
25
A. SavarySeminarium IPIPAN, 24/04/200625 Tree-to-tree correction (Selkow 1977,…) CONTEXT: –Finite set of node symbols (alphabet) –Elementary edit operations on trees: Insertion of a leaf Deletion of a leaf Renaming of a node (leaf or internal node) –Non negatif cost for each elementary operation –Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) –Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B INPUT: –Two trees A and B OUTPUT: –Distance between A and B
26
A. SavarySeminarium IPIPAN, 24/04/200626 A partial tree A 0:i is the root of A and its subtrees A 0,...,A i The comparison is based on comparing roots, and then recursively comparing the roots’ subtrees Comparing two trees (Selkow 1977,…) A root(A) A0A0 A1A1 A2A2 B root(B) B0B0 B1B1 B2B2 B3B3 A 0:1 a b c dc cd ec ee ef bdb bb B 0:2
27
A. SavarySeminarium IPIPAN, 24/04/200627 case [-1,-1] contains the cost of renaming root(A) into root(B) Edit distance matrix between two trees (Selkow 1977,…) case [n,m] contains the edit distance between the 2 trees case [i,j] contains the edit distance between the partial trees A 0:i and B 0:j 0123 14141516 042121314 11513345 21614444 i j n m
28
A. SavarySeminarium IPIPAN, 24/04/200628 Calculation of the tree matrix (Selkow 1977,…) 0123 14141516 042121314 11513345 2161444? i j Adding the cost of inserting B j (here +1) Adding the edit distance between A i and B j (here +0) Adding the cost od deleting A i (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4
29
A. SavarySeminarium IPIPAN, 24/04/200629 Extension to the correction of XML- documents The validity of a node is described by a set of regular expressions, e.g. E = ab * c + db * The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996) The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)
30
A. SavarySeminarium IPIPAN, 24/04/200630 Main idea String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued)
31
A. SavarySeminarium IPIPAN, 24/04/200631 Edit distance matrix with edit sequences case [i,j] contains the edit distance between the partial trees A 0:i and B 0:j , and the edit sequence necessary to transform A 0:i into B 0:j 0123... 0 1 [3, ]... 2 i j
32
A. SavarySeminarium IPIPAN, 24/04/200632 Bibliography Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document Trees. Technical Report 95-372, Department of Computing and Information Science, Queen’s University, Kingston, Ontario. Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302 Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp. 381-402 Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp. 177-183 Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477 Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89 Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp. 184-186 Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp. 265-268 Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol. 21(1), pp. 168-173
33
A. SavarySeminarium IPIPAN, 24/04/200633 Some details of the state of the art Wagner & Fischer (1974): –Elegant and solid theoretical definition of the string-to-string correction problem –3 elementary operations on single letters admitted (insertion, deletion, replacement) –Model of a trace describing the edit distance between two strings –Dynamic programming method Lowrance & Wagner (1975) –Additional elementary operation: inversion of two adjacent letters –Restriction of the cost function Du & Chang (1992): –Cost 1 for each elementary operation –Restriction to linear editing sequences –Application to the nearest neighbor search in a dictionary, with a threshold Oflazer (1996): –Nearest-neighbor search in finite-state automata –Application to large natural-language dictionaries Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003): –Tree-to-tree correction problem Mihov & Schulz (2004): –Levenshtein automaton –Backward dictionary Bouchou, B. & Halfeld Ferrari Alves, M. (2003): –Incremental validation of XML documents resulting from updates: human-computer interaction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.