Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

XDuce Tabuchi Naoshi, M1, Yonelab.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Dynamic Bayesian Networks (DBNs)
Efficient Query Evaluation on Probabilistic Databases
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Infinite Horizon Problems
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata Prof. Amos Israeli.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata (cont.) Prof. Amos Israeli.
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
25/06/2015Marius Mikucionis, AAU SSE1/22 Principles and Methods of Testing Finite State Machines – A Survey David Lee, Senior Member, IEEE and Mihalis.
Complexity of Mechanism Design Vincent Conitzer and Tuomas Sandholm Carnegie Mellon University Computer Science Department.
BİL744 Derleyici Gerçekleştirimi (Compiler Design)1.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Flavio Lerda 1 LTL Model Checking Flavio Lerda. 2 LTL Model Checking LTL –Subset of CTL* of the form: A f where f is a path formula LTL model checking.
Distributed Constraint Optimization * some slides courtesy of P. Modi
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
Design of an Evolutionary Algorithm M&F, ch. 7 why I like this textbook and what I don’t like about it!
1 N -Queens via Relaxation Labeling Ilana Koreh ( ) Luba Rashkovsky ( )
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Facilitating Document Annotation using Content and Querying Value.
Introduction to Parsing
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
When Simulation Meets Antichains Yu-Fang Chen Academia Sinica, Taiwan Joint work with Parosh Aziz Abdulla, Lukas Holik, Richard Mayr, and Tomas Vojunar.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Exchange Intensional XML Data Tova MiloSerge Abiteboul Tova Milo INRIA & Tel-Aviv U. ; Serge Abiteboul INRIA ; Bernd AmannOmar Benjelloun Bernd Amann Cedric-CNAM.
Classification and Regression Trees
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Facilitating Document Annotation Using Content and Querying Value.
Tree Automata First: A reminder on Automata on words Typing semistructured data.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
CS416 Compiler Design1. 2 Course Information Instructor : Dr. Ilyas Cicekli –Office: EA504, –Phone: , – Course Web.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Dynamic Graph Partitioning Algorithm
Spatial Online Sampling and Aggregation
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Presentation transcript:

Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011

Adding probabilities to an XML Schema The Web is full of semi-structured documents Given a collection of XML documents, we would like to have a schema the documents conform to. – E.g., DTD or XSD – Restricts the structure, mostly parent-child node relations (using regular expressions) The schema may be very general (e.g., xhtml, RSS) We want to add probabilities to 'guide' the schema – Optimal probabilities – maximize the likelihood of a corpus Motivation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Implementation Idea: XML Editor Auto-Completion Based on previous document versions / corpus of user documents / corpus of example documents – Suggest nodes / sub-trees / node values to the user For example: Challenges: – Allow editing in every part of the document – What kind of completion to suggest? – Finding the top-k best completions Motivation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Provenance Minimization Yael A. Daniel D. Tova M. Val T. Provenance for Aggregates Yael A. Daniel D./author> Val T.

Many Other Usages for a Probabilistic Schema Testing – e.g., generating many XML messages to simulate network load and test system performance. Explaining – e.g., the probabilistic schema for DBLP shows which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. Querying – e.g., finding the probability that a paper has more than 3 authors. – e.g., finding the top-k best completions to a partial document. Schema Evaluation – how well a given schema describes a given corpus Motivation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Our solution - An Outline - 5 -Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Preliminaries – Tree Automata Generators for Schemas without Constraints Restart Generators Continuation-Test Generators Leaf Values Adding Constraints

Using a Tree Automaton for Schema Verification Preliminaries Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer q0q0 q1q1 q2q2 b ac $ An XML document is modeled as an ordered tree. Document d 0 : The children of a-labeled node are accepted by DFA A a Automaton A r : (L( A r ) = a*bc*$) This is done for every inner node in a fixed order (BF-LTR) abcd 532 $ r abc

Using the Schema as a (Probabilistic) Generator Each transition is assigned a probability We assume independent choices, thus the document probability is the product of all t-probs. Thus, PR( d )=p a ∙p a ∙p b ∙p $ The schema and generator ignore leaf values (for now!) Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer b a c $ $ papa pcpc pbpb p$p$ q0q0 q1q1 q2q2 r aab

An Algorithm for Probabilities Learning Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer b ac $ $ The frequency each transition is chosen during the corpus verification process is recorded. (q 0, a) (q 0, b) (q 1, c) (q 1, $) q0q0 q1q1 q2q2 r abc

An Algorithm for Probabilities Learning (Cont.) This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer (q 0, a)1 (q 0, b)1 (q 1, c)1 (q 1, $)1 /2 These probabilities maximize the likelihood of generating the corpus – optimal generator (similar result in PCFGs)

Our Results Relative frequencies make optimal probabilities. Optimal probability learning and document generation can be done efficiently. Additional non-trivial result: Generation terminates with probability 1. – Guaranteed because of the choice of probabilities according to the corpus Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Integrity Constraints We want to allow the use of the following constraints in the schema: – Key Constraint: the leaves of a-labeled leaves have unique values (unary key) – Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves – Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Restart Generators A naïve idea: – Use a probabilistic generator to generate a document – Check if it has a value assignment valid w.r.t. the constraints – If not, 'restart' and try again until a valid document is generated Good news: Checking the existence of a valid assignment is in PTIME Bad news: number of restarts can be unboundedly large in an optimal generator – A different quality measure for restart generators? Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Continuation-test Generators Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Add to the schema of d 0 the constraints: – c is included in a – c is unique The generation process: Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer b a c $ $ papa pcpc pbpb p$p$ q0q0 q1q1 q2q2 r abc Pr( d ) = p a ∙p b ∙p c ∙1 Perform a continuation-test before taking the transition Implies |c|≤|a|

Algorithm for Learning Probabilities with Constraints The probabilities are again relative frequencies, but – only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer (q 0, a)1 (q 0, b)1 (q 1, c)1 (q 1, $)0 /2 /1 (q 1, $) was chosen only when (q 1, c) was not available

Our Results The algorithm learns an optimal continuation-test generator, for automata with binary choices. – Extensions to non-binary are discussed in the paper Bad News: Continuation-test is NP-Complete – But only in the size of the schema; it is polynomial in the document size (not so bad?) – Based on schema satisfiability test [David et al. 2011] More Bad News: probability of termination may be arbitrarily small! – Even for simple, non-recursive schemas – Can be handled by adding a constraint on the document size. – Sub-classes of schemas that guarantee termination? Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Suggested Algorithm We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step Leaf Values Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer $ r abc abcd efg

Possible improvement to the basic algorithm Annotate the leaves with 'old' or 'new' For 'old' a-labeled leaves choose values already chosen for some a-labeled leaf For 'new' choose a value unused by a-labeled leaves yet Annotations can be learned from the corpus, and generated: – Offline – after the document generation, using a PTIME validity test – Online – during document generation, using a continuation test. – Both methods are incomparable in terms of quality Leaf Values Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer newoldnew $ r aab

Ideas for Experimental Study on Probabilistic generators How many schemas in practice may require continuation tests? How many have termination probability < 1? Continuation test is expensive, but how expensive is it in practice? 'Competition' between restart and continuation-test generators More? Implementation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Thank You! Questions, Ideas?...

Using a Tree Automaton for Schema Verification Preliminaries Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer q0q0 q1q1 q2q2 b ac $ r abc $ An XML is modeled as an ordered tree. Document d 0 : The children of a-labeled node are accepted by automaton A a Automaton A r : This is done for every inner node in a fixed order (BF-LTR)

Sentence Generation Example Input: a simple paragraph in an XML format Sam is a student. She goes to school on Weekdays. Marley thinks Sam is nice. Input: a (manually created) schema Output: randomly generated paragraphs a student is nice. a student thinks Sam thinks Sam thinks a student is nice. Sam thinks Sam is nice. Sam thinks She is nice. She is Sam. Sam is Marley. Marley thinks a student goes to school on weekdays. Sam goes to school on weekdays. Sam is nice. Marley thinks Sam is nice. Marley is Sam. Challenges: – Can constraints be useful here? – Creating an elaborate schema (classical NLP problem) Implementation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer