Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006
String-to-string correction
A. SavarySeminarium IPIPAN, 24/04/20063 Traditional string-to-string correction (Wagner&Fischer 1974, Lawrence&Wagner 1975,…) CONTEXT: –Finite set of symbols (alphabet) –Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) –Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) –Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B INPUT: –Two words A and B OUTPUT: –Distance between A and B
A. SavarySeminarium IPIPAN, 24/04/20064 Examples of elementary edit operations Insertion of a letter monter montaer, monter montrer Deletion of a letter monter montr, monter monte Replacement of a letter by another monter ponter, monter conter Transposition of two adjacent letters monter mnoter, monter montre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation.
A. SavarySeminarium IPIPAN, 24/04/20065 Edit sequence Edit sequence = sequence of elementary edit operations For each couple of words X and Y many edit sequences exist that transform X into Y. Example 1: transforming sorting into string : –sorting srting sting string (3 operations) –sorting sotring string (2 operations) –sorting srting string (2 operations) –sorting strting string (2 operations) –sorting srting sting sing sring string (5 operations) – Example 2: transforming abc into ca : –abc ac ca (2 operations) –abc cabc cac ca (3 operations) From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. Linear sequence
A. SavarySeminarium IPIPAN, 24/04/20066 Edit (error) distance Cost of an edit sequence = sum of costs of all elementary operations included in the sequence –sorting srting sting string (3 operations) cost = 3 –sorting sotring string (2 operations) cost = 2 –sorting srting sting sing sring string (5 operations) cost = 5 Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account
A. SavarySeminarium IPIPAN, 24/04/20067 Calculating the edit distance (1/4) If x i+1 = y j+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) X[i+1] Y[j+1] i j Notation : word X= x 1 x 2... x i...x n ; the prefix of lenght i of X : X[i] = x 1 x 2... x i X i X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases x1x1 x2x2 x3x3...xixi xnxn
A. SavarySeminarium IPIPAN, 24/04/20068 Transposition’s cost If x i = y j+1 and x i+1 = y j (the 2 last characters may be inverted) then 4 sub-cases are possible: The cheapest sequence transforming X[i+1] into Y[j+1] contains a transposition of x i and x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 X[i+1] Y[j+1] i j The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost Calculating the edit distance (2/4)
A. SavarySeminarium IPIPAN, 24/04/20069 OTHERWISE (if x i+1 y j+1, and (x i y j+1 or x i+1 y j )) then 3 sub-cases are possible: X[i+1] Y[j+1] i j The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost Calculating the edit distance (3/4)
A. SavarySeminarium IPIPAN, 24/04/ Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2°ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if x i+1 = y j min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if x i =y j+1 et x i+1 = y j 3°ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])} Calculating the edit distance (4/4)
A. SavarySeminarium IPIPAN, 24/04/ case [n,m] contains the edit distance between the 2 words case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word Calculation the edit distance : dynamic programming sorting s t r i n g i j n m
A. SavarySeminarium IPIPAN, 24/04/ Dynamic programming: case 1 sorting 01234??? s10123??? t21122??? r???????? i???????? n???????? g???????? i+1 j+1 x i+1 = y j+1
A. SavarySeminarium IPIPAN, 24/04/ Dynamic programming : case 2 sorting 01234??? s10123??? t21122??? r32212??? i???????? n???????? g???????? i+1 j+1 x i+1 = y j and x i+1 = y j
A. SavarySeminarium IPIPAN, 24/04/ Dynamic programming : case 3 sorting 01234??? s10123??? t21122??? r32212??? i43322??? n???????? g???????? i+1 j+1 x i+1 y j+1 et (x i+1 y j ou x i+1 y j )
String-to-language correction
A. SavarySeminarium IPIPAN, 24/04/ String-to-language correction: problem definition CONTEXT: –Finite set of symbols (alphabet) –Elementary edit operations on symbols (as before) with their costs (1 per operation) –Edit sequences (as before) –Edit distance (error distance) between words: as before INPUT: –Regular grammar describing words (a finite set of words in particular) –Incorrect word A (unrecognizable by the grammar) –Threshold t OUTPUT: –A set of correct words B 1, B 2, …, B n whose distance from A stays within t (the nearest neighbors of A)
A. SavarySeminarium IPIPAN, 24/04/ String-to-language correction: simplistic approach METHOD: –For each word B recognizable by the grammar calculate the edit distance matrix between A and B. –Propose candidates whose distance from A does not exceed the threshold t (ed(A,B) t). FAISABILITY: –Impossible in case of infinite languages COMPLEXITY: O(n * m * |D|)
A. SavarySeminarium IPIPAN, 24/04/ String-to-language correction: threshold-controlled depth-first exploration of an FSA (Oflazer 1996, …)
A. SavarySeminarium IPIPAN, 24/04/ Part of the matrix calculated only once for all valid words sharing the same prefix appl String correction with respect to a deterministic FSA (1/4) a p p l y e s p l y e a Word to be corrected : *aply, threshold 2 appl... a10123 p21012 l32111 y43222 Each time a transition is followed a new column is calculated in the edit distance matrix e54322e54322 If we get to a final state and the edit distance remains within the thershold a new candidate has been found apple
A. SavarySeminarium IPIPAN, 24/04/ a p p l y e s p l y e a appl... a10123 p21012 l32111 y43222 e54322e54322 s65433s65433 apple String correction with respect to a deterministic FSA (2/4) Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold a new candidate has been found
A. SavarySeminarium IPIPAN, 24/04/ a p p l y e s p l y e a appl... a10123 p21012 l32111 y43222 e54322e54322 A backtrancking results in deleting the current column apple s65433s65433 String correction with respect to a deterministic FSA (3/4) Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold a new candidate has been found
A. SavarySeminarium IPIPAN, 24/04/ a p p l y e s p l y e a appl... a10123 p21012 l32111 y43222 y54321y54321 appleapply String correction with respect to a deterministic FSA (4/4) A backtrancking results in deleting the current column Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold a new candidate has been found
A. SavarySeminarium IPIPAN, 24/04/ a c d Word to be corrected : abcbb, t=2 abbbbbb + ++ ++ ++ ++ ++ ++ ++ ++ ++ a 0 ++ b 1 ++ c 2 ++ b 3 ++ b 4 ++ b b If the current column exceeds the threshold the whole path is cut off Controlling the searchspace by the threshold
Tree-to-tree correction
A. SavarySeminarium IPIPAN, 24/04/ Tree-to-tree correction (Selkow 1977,…) CONTEXT: –Finite set of node symbols (alphabet) –Elementary edit operations on trees: Insertion of a leaf Deletion of a leaf Renaming of a node (leaf or internal node) –Non negatif cost for each elementary operation –Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) –Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B INPUT: –Two trees A and B OUTPUT: –Distance between A and B
A. SavarySeminarium IPIPAN, 24/04/ A partial tree A 0:i is the root of A and its subtrees A 0,...,A i The comparison is based on comparing roots, and then recursively comparing the roots’ subtrees Comparing two trees (Selkow 1977,…) A root(A) A0A0 A1A1 A2A2 B root(B) B0B0 B1B1 B2B2 B3B3 A 0:1 a b c dc cd ec ee ef bdb bb B 0:2
A. SavarySeminarium IPIPAN, 24/04/ case [-1,-1] contains the cost of renaming root(A) into root(B) Edit distance matrix between two trees (Selkow 1977,…) case [n,m] contains the edit distance between the 2 trees case [i,j] contains the edit distance between the partial trees A 0:i and B 0:j i j n m
A. SavarySeminarium IPIPAN, 24/04/ Calculation of the tree matrix (Selkow 1977,…) ? i j Adding the cost of inserting B j (here +1) Adding the edit distance between A i and B j (here +0) Adding the cost od deleting A i (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4
A. SavarySeminarium IPIPAN, 24/04/ Extension to the correction of XML- documents The validity of a node is described by a set of regular expressions, e.g. E = ab * c + db * The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996) The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)
A. SavarySeminarium IPIPAN, 24/04/ Main idea String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued)
A. SavarySeminarium IPIPAN, 24/04/ Edit distance matrix with edit sequences case [i,j] contains the edit distance between the partial trees A 0:i and B 0:j , and the edit sequence necessary to transform A 0:i into B 0:j [3, ]... 2 i j
A. SavarySeminarium IPIPAN, 24/04/ Bibliography Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document Trees. Technical Report , Department of Computing and Information Science, Queen’s University, Kingston, Ontario. Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol. 21(1), pp
A. SavarySeminarium IPIPAN, 24/04/ Some details of the state of the art Wagner & Fischer (1974): –Elegant and solid theoretical definition of the string-to-string correction problem –3 elementary operations on single letters admitted (insertion, deletion, replacement) –Model of a trace describing the edit distance between two strings –Dynamic programming method Lowrance & Wagner (1975) –Additional elementary operation: inversion of two adjacent letters –Restriction of the cost function Du & Chang (1992): –Cost 1 for each elementary operation –Restriction to linear editing sequences –Application to the nearest neighbor search in a dictionary, with a threshold Oflazer (1996): –Nearest-neighbor search in finite-state automata –Application to large natural-language dictionaries Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003): –Tree-to-tree correction problem Mihov & Schulz (2004): –Levenshtein automaton –Backward dictionary Bouchou, B. & Halfeld Ferrari Alves, M. (2003): –Incremental validation of XML documents resulting from updates: human-computer interaction