Approximate schemas Michel de Rougemont, LRI, University Paris II
1.Distance between words (structures), O(1) Edit distance with moves 2.Distance between a word (structure) and a class of words (structures), O(1) 3.Distance between two languages (classes), Poly. 4. Applications: regular languages, DTDs Distances between languages
1.Satisfiability : Tree |= F 2.Approximate satisfiability Tree |= F 3.Approximate equivalence Image on a class K of trees 1. Approximate Satisfiability and Equivalence G
An ε -tester for a property F is a probabilistic algorithm A such that : If U |= F, A accepts If U is ε far from F, A rejects with high probability Time(A) independent of n. Tester usually implies a linear time corrector. Self-testers and correctors for Linear Algebra,Blum & Kanan 1989 Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 Testers for graph properties : k-colorability, Goldreich and al graph properties have testers, Alon and al Regular languages have testers, Alon and al. 2000s Testers for Regular tree languages, Mdr and Magniez, ICALP 2004 Testers on a class K
1.Classical Edit Distance: Insertions, Deletions, Modifications 2.Edit Distance with moves Edit Distance with Moves generalizes to Trees 2. Equality tester
Block and uniform statistics W= …… length n, subword of length k, n/k blocks For k=2, n/k=6
Goal: d 1 approximates the distance Let ε =1/k : For n>n 0 dist – ε.n < d 1 < dist + ε.n Practical application: ε=10 -2 hence k=100, stat dimension Words of length n=10 9, d 1 is approximated by N samples and a good approximation after N=O(1/ε 3 ) trials. Remarks: 1.Distance with Moves. W =000… …111 W’=1111…111000… Robustness to noise If W,W’ are noisy inputs (but ε-close), the method still works. 3.Random words are close with the moves, far without.
Tester for equality of strings Edit distance with moves. NP-complete problem, but O(1)- approximable. Uniform statistics ( ): W= Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’)/n. Sample N subwords of length k, compute Y(w) and Y(w’): Theorem 2. Y(w) approximates u.stat(w). Corollary. |Y(w)-Y(w’)| approximates dist(w,w’)/n. Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.
3a. Tester for regular words Definition: L is a regular language and A an automaton for L, Test w in L. Admissible Z= A word W is Z-feasible if there are two states init accept
Tester for regular words For every admissible path Z: else REJECT. Theorem: Tester(W,A, ε ) is an ε -tester for L(A). Tester. Input : W,A, ε
Proof schema of the Tester Theorem: Regular words are testable. Robustness lemma: If W is ε-far from L, then for every admissible path Z, there exists such that the number of Z-infeasible subwords Splitting lemma: if W is far from L there are many disjoint infeasible subwords. Amplifying lemma: If there are many infeasible words, there are many short ones.
Merging words Merging lemma: Let Z be an admissible path, and let F be a Z- feasible cut of size h’. Then C CC C C C Take each word and split it along its connected components, removing single letters. Rearrange all the words of the same component in its Z-order. Add gluing words to obtain W’ in L:
Splitting Splitting lemma: If Z is an admissible path, W a word s.t. dist(W,L) > h, then W has Proof by contraposition:
3b. Correction in practice: right branch tree 2 moves, dist=2
1.Inclusion 2.Equivalence Equivalence tester 4. Equivalent testing of Regular Languages
Automata for Regular languages Basic property: Proposition: Caratheodory’s theorem: in dimension d, convex hull of N points can be decomposed into in the union of convex hulls of d+1 points Large loops can be decomposed. Small loops (less than m=|A|) suffice.
Approximate Parikh mapping Lemma: For every X in H, w in L s. t. X. b-stat(w) w H is a fair representation of L
Construction of H in polynomial time Enumerate all loops: Number of b-stat is less : Some loops have same b-stat: ABBA and BBAA #partitions of a word of length m with « big blocks » Construct H by matrix iteration:
Example Automaton A: Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17: Loops: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)} a b b c a c d d aa ca H A ac cc bb dd
Equivalence tester Tester for w in L (regular): Compute b-stat(w) and H. Decide if dist(w,L)>ε.n Time is polynomial in m=|L|. Previous tester was exponential in m. Tester of 1.Compute H A and H B 2.Reject if H A and H B are different. Time polynomial in m=|A,B|
Application: Data Exchange SourceTarget W= , source. Which structure for the target? Answer: if the two schemas are close, run a corrector and obtain W’= , distance 3. If the two schemas are not close, no guarantee. General situation for data exchange and query answering.
Conclusion 1.Testers and Correctors 2.Constant algorithm for Edit Distance with moves 3a.Testers and Correctors for regular words 3b.Tester for regular trees and corrector for regular trees 4.Equivalence tester for automata Polynomial time algorithm Generalization to Buchi automata and Context-Free Tree regular languages
Generalizations Buchi Automata. Distance on infinite words: Two words are ε-close if A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close. Statistics: set of accumulation points of H: compatible loops of connected components of accepting states Tester for Buchi Automata: Compute H A and H B Reject if H A and H B are different. Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.
Let F be a property on a class K of structures U F is Equality Soundness: close structures have close statistics Robustness: far structures have far statistics Soundness and Robustness
Robustness of b.stat Robustness of b-stat:
Soundness of u.stat Soundness of u-stat: Simple edit: Move w=A.B.C.D, w’=A.C.B.D: Hence, for ε 2.n operations, Problem: robustness of u.stat ? Harder! You need an auxiliary distribution and two key lemmas.
Block Uniform Statistics Lemma 1:
Uniform Statistics A B Lemma 2:
Robustness of the uniform Statistics Robustness of u-stat: By Lemma 1: By Lemma 3:
Tester for the distance with moves NP-complete problem, but O(1)-approximable. Approximate u.stat: Sample N subwords of length k, compute Y: Y is a good approximation of u.stat (Chernoff), Uniform statistics is a good approximation of the distance by soundness and robustness. Tester: If Y<ε.n accept, else reject.