1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South.

Slides:



Advertisements
Similar presentations
1 Property testing and learning on strings and trees Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI.
Advertisements

Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Deterministic vs. Non-Deterministic Graph Property Testing Asaf Shapira Tel-Aviv University Joint work with Lior Gishboliner.
Gillat Kol joint work with Ran Raz Locally Testable Codes Analogues to the Unique Games Conjecture Do Not Exist.
Property testing of Tree Regular Languages Frédéric Magniez, LRI, CNRS Michel de Rougemont, LRI, University Paris II.
Lecture 24 MAS 714 Hartmut Klauck
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
1 Markov Decision Processes: Approximate Equivalence Michel de Rougemont Université Paris II & LRI
1 Polynomial Time Probabilistic Learning of a Subclass of Linear Languages with Queries Yasuhiro TAJIMA, Yoshiyuki KOTANI Tokyo Univ. of Agri. & Tech.
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Artur Czumaj Dept of Computer Science & DIMAP University of Warwick Testing Expansion in Bounded Degree Graphs Joint work with Christian Sohler.
Gillat Kol joint work with Ran Raz Locally Testable Codes Analogues to the Unique Games Conjecture Do Not Exist.
Computability and Complexity 5-1 Classifying Problems Computability and Complexity Andrei Bulatov.
Complexity 12-1 Complexity Andrei Bulatov Non-Deterministic Space.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas Dana Ron Alex Samorodnitsky.
Testing the Diameter of Graphs Michal Parnas Dana Ron.
Sparse Random Linear Codes are Locally Decodable and Testable Tali Kaufman (MIT) Joint work with Madhu Sudan (MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto.
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Normal forms for Context-Free Grammars
Theory of Computing Lecture 22 MAS 714 Hartmut Klauck.
Some 3CNF Properties are Hard to Test Eli Ben-Sasson Harvard & MIT Prahladh Harsha MIT Sofya Raskhodnikova MIT.
Final Exam Review Cummulative Chapters 0, 1, 2, 3, 4, 5 and 7.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Japanjune The correction of XML data Université Paris II & LRI Michel de Rougemont 1.Approximation and Edit Distance.
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
1 Approximate Satisfiability and Equivalence Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI, LICS.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
On Learning Regular Expressions and Patterns via Membership and Correction Queries Efim Kinber Sacred Heart University Fairfield, CT, USA.
Approximate schemas Michel de Rougemont, LRI, University Paris II.
Resource bounded dimension and learning Elvira Mayordomo, U. Zaragoza CIRM, 2009 Joint work with Ricard Gavaldà, María López-Valdés, and Vinodchandran.
Lower Bounds for Property Testing Luca Trevisan U.C. Berkeley.
1 Design and Analysis of Algorithms Yoram Moses Lecture 11 June 3, 2010
1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)
Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1 Approximate Data Exchange Michel de Rougemont Adrien Vieilleribière University Paris II & LRI University Paris-Sud & LRI ICDT 2007.
An Introduction to Rabin Automata Presented By: Tamar Aizikowitz Spring 2007 Automata Seminar.
Probabilistic Automaton Ashish Srivastava Harshil Pathak.
狄彥吾 (Yen-Wu Ti) 華夏技術學院資訊工程系 Property Testing on Combinatorial Objects.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Moment Problem and Density Questions Akio Arimoto Mini-Workshop on Applied Analysis and Applied Probability March 24-25,2010 at National Taiwan University.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
CS 154 Formal Languages and Computability May 12 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron Mak.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
On Sample Based Testers
Probabilistic Algorithms
Stochastic Streams: Sample Complexity vs. Space Complexity
Approximating the MST Weight in Sublinear Time
Lecture 18: Uniformity Testing Monotonicity Testing
Warren Center for Network and Data Sciences
Lecture 10: Sketching S3: Nearest Neighbor Search
CIS 700: “algorithms for Big Data”
Locally Decodable Codes from Lifting
Approximate Validity of XML Streaming Data
Algebraic Property Testing
CS21 Decidability and Tractability
Every set in P is strongly testable under a suitable encoding
Presentation transcript:

1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

2 1.Classical Data Exchange on words and trees 2.Approximation based on Property Testing 3.Tester for regular words and regular trees with the Edit Distance with Moves 4.Approximate Data Exchange 5. Composition of Data Exchange setting Plan

3 1. Data Exchange on Words and Trees *1* ? c(ab)*ca* Sources Targets

4 Transducer is an XSLT program: Transducers transform the data *1* cabababcaaaa c(ab)*ca* ….. ….. 0:ab 1:a abababaaaa c(ab)*ca*

5 Data Exchange setting: (K S,τ,K T ): Fagin et al. 2002: τ defined by Source-Target-Dependencies on relations Libkin et al. 2005: τ defined by Tree-Pattern-Formulas on trees 1.Source-Consistency: Given a source structure I in K S, is there a target J in K T s.t. (I,J) in τ ? 2.Typechecking: decide if for all I in K S, there is a target J in K T s.t. (I,J) in τ. 3.Composition of settings ? Main Problems

6 Data Exchange setting: (K S,τ,K T ), where τ is a transducer : ε-Source-Consistency: Given a source structure I, is there a source I’ ε-close to K S s.t. τ(I) ε-close to K T ? ε-Typechecking: decide if for all I in K S, τ(I) is ε-close to K T. ε-Composition of settings. Approximate Data Exchange

7 Let F be a property on a class K of structures U An ε -tester for F is a probabilistic algorithm A such that: If U |= F, A accepts If U is ε far from F, A rejects with high probability A property F is testable if there exists a probabilistic algorithm A s.t. For all ε it is an ε -tester for F Time(A) independent of n. Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996.Property Testing and its connection to Learning and Approximation Tester usually implies a linear time corrector. (ε 1, ε 2 )- Tolerant Tester. 2. Property Testing

8 1.Satisfiability : T |= F 2.Approximate Satisfiability T |= F 3.Approximate Equivalence Image on a class K of trees Approximate Satisfiability and Equivalence G

9 History of Testers Self-testers and correctors for Linear Algebra,Blum & Kanan 1989 Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 Testers for graph properties : k-colorability, Goldreich and al Regular languages have testers, Alon et al. 2000s Testers for Regular tree languages, Mdr and Magniez, 2004 Charaterization of testable properties on graphs, Alon et al New areas: Sublinear algorithms, Approximation of decision problems

10 1.Classical Edit Distance: Insertions, Deletions, Modifications 2.Edit Distance with moves Edit Distance with Moves generalizes to Ordered Trees Edit Distances with Moves

11 Uniform Statistics W= length n, n-k+1 blocks of length k=1/ε For k=2, n-k+1=11 Distance between words: NP-complete Testable, O(1): Sample N subwords of length k: Y(W) and Y(W’) If |Y(w)-Y(w’)| <ε. accept, else reject

12 3. Tester for a regular language W: Y: Z: T: ab H A T Y WZWZ Automaton A defines L, and a polytope H for u.stats Tester W in L: Testable, O(1): compute Y(W), If dist(Y(w),H) <ε. accept, else reject Remark: robust to noise.

13 Pair (A,H) Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17: Boucles de taille 1 bloc: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)} a b b c a c d d aa ca H A ac cc bb dd

14 Corrector of a regular language W: is ε –close to L(A) Deterministic Correction: 1.Decomposition in admissible subwords: Decomposition in connected components Recomposition (Moves) distance 3 from W ab A

15 Corrector of an ordered tree 2 moves, dist=2 Automate d’arbre ou DTD: t: l,r r: l,r

16 XML Corrector:

17 Applications Testers: Estimate the distance between two XML files, Décide if an XML F is ε-valid, Décide if two DTDs are close. Correctors: If an XML file F is ε-close from a DTD, Find a valid F’ ε-close to F; Rank XML files for a set of DTD’s (supervised learning) Program Verification: Decide if two automata are ε-close in polynomial time. Approximate Model-Checking: Specification language Model Distance

18 Data Exchange setting: (K S,τ,K T ), where τ is a transducer : ε-Typechecking: decide if for I in K S, τ(I) ε-close to K T. Words: τ(K S ) ε-close to K T ? Apply the Equivalence Tester in polynomial time, as τ(K S ) is regular. Trees: Similar technique, exponential in |DTD|. Open problem: Is DTD ε- Equivalence in P ? 4. Approximate Data Exchange: typechecking

19 Data Exchange setting: (K S,τ,K T ), where τ is a transducer : ε-Source-Consistency: Given a source structure I, is there a source I’ ε-close to K S s.t. τ(I) ε-close to K T ? Words: Case 1: Transducer with one state. Approximate Data Exchange: Source-Consistency Sample I Image by τ, Y(τ(I)) Statistics : test if Y(τ(I)) is ε-close to K T. Case 2: Transducer with many states. Distinguish between compatible paths.

20 Data Exchange setting: (K S,τ,K T ), where τ is a transducer : ε-Source-Consistency on trees: Approximate Data Exchange: Source-Consistency Sampling in T provides Statistics on τ(T). Apply tester on trees.

21 5. Composition of close settings Data Exchange settings: (K S1,τ,K T1 ), (K S2,τ’,K T2 ): Possible when the schemas are ε-close. Apply corrector at every stage to define the new τ’’ for (K S1,τ’’,K T2 ): Apply corrector to τ (I) and obtain C 1. τ (I) in K T1 then the corrector C for K S2 then τ’ then the corrector C 2 for K T2 : τ’’: C 2. τ’. C. C 1. τ (I) I τ (I)

22 Conclusion 1.Data Exchange: Source-Consistency, Typechecking, Composition. 2.Property Testing based Approximation 3.Tester and Corrector of regular languages 4.Equivalence tester for automata Polynomial time approximate algorithm (PSPACE-complete) Generalization to Buchi automata : approximate Model-Checking Context-Free Languages: exponential algorithm (undecidable problem) 5.Approximate Data Exchange 6.Connection to PAC-Learning

23 Application to learning Model: take random words according to a distribution D: U.stat representation: Negative examples could include the distance. Learning algorithm: convex hulls of positive examples.

24 PAC learning The regular language is a polytope for u.stat. Polytopes have a finite VC dimension. Hence they are PAC learnable. Problem: the learnt concept may be ε-far from the language L. For special distributions D, it may be ε-close. Example: D is uniform and the polytopes are « large ».

25 Block and Uniform statistics W= length n, b.stat: consecutive subwords of length k, n/k blocks u.stat: any subwords of length k, n-k+1 blocks For k=2, n/k=6

26 Tester for equality of strings Edit distance with moves. NP-complete problem, but approximable in constant time with additive error. Uniform statistics ( ): W= Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’). Sample N subwords of length k, compute Y(w) and Y(w’): Lemma (Chernoff). Y(w) approximates u.stat(w). Corollary. |Y(w)-Y(w’)| approximates dist(w,w’). Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.

27 Let F be a property on strings. Soundness: ε-close strings have close statistics Robustness: ε-far strings have far statistics F is Equality on pairs of strings. For theorem 1, we prove: 1.b.stat is robust 2.u.stat is sound 3.u.stat is robust Soundness and Robustness

28 Robustness of b.stat Robustness of b-stat:

29 Soundness of u.stat Soundness of u-stat: Simple edit: Move w=A.B.C.D, w’=A.C.B.D: Hence, for ε 2.n operations, Remark: b.stat is not sound. Problem: robustness of u.stat ? Harder! We need an auxiliary distribution and two key lemmas.

30 Statistics on words k k K t k-t Block statistics: b.stat Uniform statistics: u.stat Block Uniform statistics: bu.stat

31 Uniform Statistics A B Lemma 2:

32 Block Uniform Statistics Lemma 1:

33 Robustness of the uniform Statistics Robustness of u-stat: By Lemma 1: By Lemma 2: Tolerant tester: Theorem: for two words w and w’ large enough, the tester: 1.Accepts if w=w’ with probability 1 2.Accepts if w,w’ are ε 2 -close with probability 2/3 3.Rejects if w,w’ are ε-far with probability 2/3

34 Membership and Equivalence tester Membership Tester for w in L (regular): 1. Construction of the tester: Precompute H ε 2. Tester: Compute Y(w) (approx. b.stat(w)). Accept iff Y(w) is at distance less than ε to H ε Construction: Time is Tester: query complexity in time complexity in Remark 1: Time complexity of previous testers was exponential in m. Remark 2: The same method works for L context-free. Tester of 1.Compute H ε,A and H ε,B 2.Reject if H ε,A and H ε,B are different. Time polynomial in m=Max(|A |, |B |):

35 Generalizations Buchi Automata. Distance on infinite words: Two words are ε-close if A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close. Statistics: set of accumulation points of H: compatible loops of connected components of accepting states Tester for Buchi Automata: Compute H A and H B Reject if H A and H B are different. Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.