Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois,

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
YES-NO machines Finite State Automata as language recognizers.
Chapter 7 Dynamic Programming.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Chapter 3 The Greedy Method 3.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Chapter 7 Dynamic Programming 7.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
§ 8 Dynamic Programming Fibonacci sequence
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 Languages and Finite Automata or how to talk to machines...
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Topics Automata Theory Grammars and Languages Complexities
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
Modern Information Retrieval Chapter 4 Query Languages.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Important Problem Types and Fundamental Data Structures
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
ALGORITHMS.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
LIMITATIONS OF ALGORITHM POWER
Probabilistic Automaton Ashish Srivastava Harshil Pathak.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
1 Language Recognition (11.4) Longin Jan Latecki Temple University Based on slides by Costas Busch from the courseCostas Busch
Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Turing Machines CS 130 Theory of Computation HMU Textbook: Chap 8.
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Modeling Arithmetic, Computation, and Languages Mathematical Structures for Computer Science Chapter 8 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesAlgebraic.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Finite State Machines Dr K R Bond 2009
Programming Languages Translator
Busch Complexity Lectures: Turing Machines
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
Syntax Analysis Chapter 4.
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Dynamic Programming Computation of Edit Distance
Cyclic string-to-string correction
Advanced Analysis of Algorithms
Lexical Analysis Uses formalism of Regular Languages
Presentation transcript:

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006

String-to-string correction

A. SavarySeminarium IPIPAN, 24/04/20063 Traditional string-to-string correction (Wagner&Fischer 1974, Lawrence&Wagner 1975,…) CONTEXT: –Finite set of symbols (alphabet) –Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) –Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) –Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B INPUT: –Two words A and B OUTPUT: –Distance between A and B

A. SavarySeminarium IPIPAN, 24/04/20064 Examples of elementary edit operations Insertion of a letter monter  montaer, monter  montrer Deletion of a letter monter  montr, monter  monte Replacement of a letter by another monter  ponter, monter  conter Transposition of two adjacent letters monter  mnoter, monter  montre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation.

A. SavarySeminarium IPIPAN, 24/04/20065 Edit sequence Edit sequence = sequence of elementary edit operations For each couple of words X and Y many edit sequences exist that transform X into Y. Example 1: transforming sorting into string : –sorting  srting  sting  string (3 operations) –sorting  sotring  string (2 operations) –sorting  srting  string (2 operations) –sorting  strting  string (2 operations) –sorting  srting  sting  sing  sring  string (5 operations) – Example 2: transforming abc into ca : –abc  ac  ca (2 operations) –abc  cabc  cac  ca (3 operations) From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. Linear sequence

A. SavarySeminarium IPIPAN, 24/04/20066 Edit (error) distance Cost of an edit sequence = sum of costs of all elementary operations included in the sequence –sorting  srting  sting  string (3 operations)  cost = 3 –sorting  sotring  string (2 operations)  cost = 2 –sorting  srting  sting  sing  sring  string (5 operations)  cost = 5 Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account

A. SavarySeminarium IPIPAN, 24/04/20067 Calculating the edit distance (1/4) If x i+1 = y j+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) X[i+1] Y[j+1] i j Notation : word X= x 1 x 2... x i...x n ; the prefix of lenght i of X : X[i] = x 1 x 2... x i X i X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases x1x1 x2x2 x3x3...xixi xnxn

A. SavarySeminarium IPIPAN, 24/04/20068 Transposition’s cost If x i = y j+1 and x i+1 = y j (the 2 last characters may be inverted) then 4 sub-cases are possible: The cheapest sequence transforming X[i+1] into Y[j+1] contains a transposition of x i and x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 X[i+1] Y[j+1] i j The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost Calculating the edit distance (2/4)

A. SavarySeminarium IPIPAN, 24/04/20069 OTHERWISE (if x i+1  y j+1, and (x i  y j+1 or x i+1  y j )) then 3 sub-cases are possible: X[i+1] Y[j+1] i j The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost Calculating the edit distance (3/4)

A. SavarySeminarium IPIPAN, 24/04/ Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2°ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if x i+1 = y j min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if x i =y j+1 et x i+1 = y j 3°ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])} Calculating the edit distance (4/4)

A. SavarySeminarium IPIPAN, 24/04/ case [n,m] contains the edit distance between the 2 words case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word Calculation the edit distance : dynamic programming  sorting  s t r i n g i j n m

A. SavarySeminarium IPIPAN, 24/04/ Dynamic programming: case 1  sorting  01234??? s10123??? t21122??? r???????? i???????? n???????? g???????? i+1 j+1 x i+1 = y j+1

A. SavarySeminarium IPIPAN, 24/04/ Dynamic programming : case 2  sorting  01234??? s10123??? t21122??? r32212??? i???????? n???????? g???????? i+1 j+1 x i+1 = y j and x i+1 = y j

A. SavarySeminarium IPIPAN, 24/04/ Dynamic programming : case 3  sorting  01234??? s10123??? t21122??? r32212??? i43322??? n???????? g???????? i+1 j+1 x i+1  y j+1 et (x i+1  y j ou x i+1  y j )

String-to-language correction

A. SavarySeminarium IPIPAN, 24/04/ String-to-language correction: problem definition CONTEXT: –Finite set of symbols (alphabet) –Elementary edit operations on symbols (as before) with their costs (1 per operation) –Edit sequences (as before) –Edit distance (error distance) between words: as before INPUT: –Regular grammar describing words (a finite set of words in particular) –Incorrect word A (unrecognizable by the grammar) –Threshold t OUTPUT: –A set of correct words B 1, B 2, …, B n whose distance from A stays within t (the nearest neighbors of A)

A. SavarySeminarium IPIPAN, 24/04/ String-to-language correction: simplistic approach METHOD: –For each word B recognizable by the grammar calculate the edit distance matrix between A and B. –Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t). FAISABILITY: –Impossible in case of infinite languages COMPLEXITY: O(n * m * |D|)

A. SavarySeminarium IPIPAN, 24/04/ String-to-language correction: threshold-controlled depth-first exploration of an FSA (Oflazer 1996, …)

A. SavarySeminarium IPIPAN, 24/04/ Part of the matrix calculated only once for all valid words sharing the same prefix appl String correction with respect to a deterministic FSA (1/4) a p p l y e s p l y e a Word to be corrected : *aply, threshold 2  appl...  a10123 p21012 l32111 y43222 Each time a transition is followed a new column is calculated in the edit distance matrix e54322e54322 If we get to a final state and the edit distance remains within the thershold  a new candidate has been found apple

A. SavarySeminarium IPIPAN, 24/04/ a p p l y e s p l y e a  appl...  a10123 p21012 l32111 y43222 e54322e54322 s65433s65433 apple String correction with respect to a deterministic FSA (2/4) Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold  a new candidate has been found

A. SavarySeminarium IPIPAN, 24/04/ a p p l y e s p l y e a  appl...  a10123 p21012 l32111 y43222 e54322e54322 A backtrancking results in deleting the current column apple s65433s65433 String correction with respect to a deterministic FSA (3/4) Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold  a new candidate has been found

A. SavarySeminarium IPIPAN, 24/04/ a p p l y e s p l y e a  appl...  a10123 p21012 l32111 y43222 y54321y54321 appleapply String correction with respect to a deterministic FSA (4/4) A backtrancking results in deleting the current column Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl Each time a transition is followed a new column is calculated in the edit distance matrix If we get to a final state and the edit distance remains within the thershold  a new candidate has been found

A. SavarySeminarium IPIPAN, 24/04/ a c d Word to be corrected : abcbb, t=2  abbbbbb + ++ ++ ++ ++ ++ ++ ++ ++  ++ a 0 ++ b 1 ++ c 2 ++ b 3 ++ b 4 ++ b b If the current column exceeds the threshold the whole path is cut off Controlling the searchspace by the threshold

Tree-to-tree correction

A. SavarySeminarium IPIPAN, 24/04/ Tree-to-tree correction (Selkow 1977,…) CONTEXT: –Finite set of node symbols (alphabet) –Elementary edit operations on trees: Insertion of a leaf Deletion of a leaf Renaming of a node (leaf or internal node) –Non negatif cost for each elementary operation –Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) –Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B INPUT: –Two trees A and B OUTPUT: –Distance between A and B

A. SavarySeminarium IPIPAN, 24/04/ A partial tree A  0:i  is the root of A and its subtrees A 0,...,A i The comparison is based on comparing roots, and then recursively comparing the roots’ subtrees Comparing two trees (Selkow 1977,…) A root(A) A0A0 A1A1 A2A2 B root(B) B0B0 B1B1 B2B2 B3B3 A  0:1  a b c dc cd ec ee ef bdb bb B  0:2 

A. SavarySeminarium IPIPAN, 24/04/ case [-1,-1] contains the cost of renaming root(A) into root(B) Edit distance matrix between two trees (Selkow 1977,…) case [n,m] contains the edit distance between the 2 trees case [i,j] contains the edit distance between the partial trees A  0:i  and B  0:j  i j n m

A. SavarySeminarium IPIPAN, 24/04/ Calculation of the tree matrix (Selkow 1977,…) ? i j Adding the cost of inserting B j (here +1) Adding the edit distance between A i and B j (here +0) Adding the cost od deleting A i (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4

A. SavarySeminarium IPIPAN, 24/04/ Extension to the correction of XML- documents The validity of a node is described by a set of regular expressions, e.g. E = ab * c + db * The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996) The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)

A. SavarySeminarium IPIPAN, 24/04/ Main idea String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued)

A. SavarySeminarium IPIPAN, 24/04/ Edit distance matrix with edit sequences case [i,j] contains the edit distance between the partial trees A  0:i  and B  0:j , and the edit sequence necessary to transform A  0:i  into B  0:j  [3, ]... 2 i j

A. SavarySeminarium IPIPAN, 24/04/ Bibliography Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document Trees. Technical Report , Department of Computing and Information Science, Queen’s University, Kingston, Ontario. Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol. 21(1), pp

A. SavarySeminarium IPIPAN, 24/04/ Some details of the state of the art Wagner & Fischer (1974): –Elegant and solid theoretical definition of the string-to-string correction problem –3 elementary operations on single letters admitted (insertion, deletion, replacement) –Model of a trace describing the edit distance between two strings –Dynamic programming method Lowrance & Wagner (1975) –Additional elementary operation: inversion of two adjacent letters –Restriction of the cost function Du & Chang (1992): –Cost 1 for each elementary operation –Restriction to linear editing sequences –Application to the nearest neighbor search in a dictionary, with a threshold Oflazer (1996): –Nearest-neighbor search in finite-state automata –Application to large natural-language dictionaries Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003): –Tree-to-tree correction problem Mihov & Schulz (2004): –Levenshtein automaton –Backward dictionary Bouchou, B. & Halfeld Ferrari Alves, M. (2003): –Incremental validation of XML documents resulting from updates: human-computer interaction