Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

Similar presentations


Presentation on theme: "Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]"— Presentation transcript:

1 Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]

2 Adding probabilities to an XML Schema XML schemas are useful for describing the structures of XML documents. – E.g., DTD or XSD Schemas may be very general (e.g., xhtml, RSS) We want to add probabilities that reflect the likelihood of different parts of the schema – We will use the probabilities to turn the schema into a probabilistic generative model for XML documents – In particular, we want them to maximize the likelihood of a given XML document or document collection - 2 - Motivation Optimal Probabilistic Generators for XML Collections

3 One Application: XML Auto- Completion [SIGMOD 2012] Based on previous document versions / corpus of example documents Suggest nodes / sub-trees / node values to the user For example: Challenges: – Allow editing every part of the document – What kind of completion to suggest? – Finding the top-k best completions - 3 - Motivation Optimal Probabilistic Generators for XML Collections XML for Beginners M. Jones H. Q. David L. Martin S. Smith Advanced XML M. Jones J. E. Peterson G. L. Williams

4 Many Other Usages for a Probabilistic Schema... - 4 - Motivation Optimal Probabilistic Generators for XML Collections Testing – e.g., generating many XML messages to simulate network load and test system performance. Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. Schema Evaluation – how well a given schema describes a given corpus. ✗ ✓

5 Our solution - An Outline - 5 -Optimal Probabilistic Generators for XML Collections Preliminaries – Tree Automata Generators for Schemas without Constraints Restart Generators Continuation-Test Generators Leaf Values Adding Constraints

6 Schema as a Deterministic Tree Automaton - 6 - Preliminaries Optimal Probabilistic Generators for XML Collections q0q0 q1q1 q2q2 b ac $ An XML document is modeled as an ordered tree. Document d 0 : Schema validation: the children of an a-labeled node are accepted by DFA A a Automaton A r : (L( A r ) = a*bc*$) Validation is performed for the children of every inner node. abcd 532 $ r abc

7 Using the Schema as a Generator Recall that we want to turn the schema from an acceptor into a probabilistic generative model. Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly. Adding probabilities: we consider two problem settings 1.Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus. 2.Additionally, imposing integrity constraints on the documents (e.g., key constraints) - 7 - Preliminaries Optimal Probabilistic Generators for XML Collections

8 Probabilistic Generator Each transition is assigned a probability We assume independent choices, (a Markovian process) thus the document probability is the product. In this case, Pr( d )=p a ∙p a ∙p b ∙p $ The schema and generator ignore leaf values (for now!) - 8 - Without Constraints Optimal Probabilistic Generators for XML Collections b a c $ papa pcpc pbpb p$p$ q0q0 q1q1 q2q2 $ r aab

9 Formal Problem Definition Given a corpus D of documents, and a deterministic schema S that accepts every document in D We want to find an optimal generator based on S : – Find probabilities for the transitions of S that maximize the probability of generating D, – i.e., the maximum likelihood estimator (MLE). - 9 - Without Constraints Optimal Probabilistic Generators for XML Collections

10 A Learning Algorithm - 10 - Without Constraints Optimal Probabilistic Generators for XML Collections b ac $ $ The frequency of using each transition during the corpus verification process is recorded. (q 0, a) (q 0, b) (q 1, c) (q 1, $) 1 1 1 1 q0q0 q1q1 q2q2 r abc

11 An Algorithm for Probabilities Learning (Cont.) This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency. - 11 - Without Constraints Optimal Probabilistic Generators for XML Collections (q 0, a)1 (q 0, b)1 (q 1, c)1 (q 1, $)1 /2 Theorem: This efficient algorithm learns the MLE probabilities – finds an optimal probabilistic generator

12 Termination Theorem: generation terminates with probability 1. – Guaranteed only because of the choice of probabilities according to the corpus. - 12 - Without Constraints Optimal Probabilistic Generators for XML Collections

13 Integrity Constraints We want to support integrity constraints, which are used in XML schema languages. Key Constraint: the leaves of a-labeled leaves have unique values (unary key) Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain - 13 - Adding Constraints Optimal Probabilistic Generators for XML Collections

14 New Problem We want to find optimal generators for XML schemas with constraints. Valid generator output: an XML document, which 1.is a accepted by the schema, and 2.there exists a valid leaf value assignment – which does not violate the constraints – Example: a, b, c are unique and contain each other - 14 - Adding Constraints Optimal Probabilistic Generators for XML Collections $ r aab c r ab b c … b

15 Restart Generators A simple idea: – Use a probabilistic generator to generate a document – Check if it has a value assignment valid w.r.t. the constraints – If not, 'restart' and try again until a valid document is generated Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME – Proof: By translating the constraints to bounds on the number of unique values for each leaf label Bad news: number of restarts can be unboundedly large in an optimal generator - 15 - Adding Constraints Optimal Probabilistic Generators for XML Collections

16 Continuation-test Generators Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Example: add to the schema of d 0 the constraints: – c is included in a – c is unique The generation process: - 16 - Adding Constraints Optimal Probabilistic Generators for XML Collections b a c $ $ papa pcpc pbpb p$p$ q0q0 q1q1 q2q2 r abc Pr( d ) = p a ∙p b ∙p c ∙1 Perform a continuation-test before taking the transition Implies |c|≤|a|

17 Learning Algorithm for Continuation-test Generators The probabilities are again relative frequencies, but – only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints Optimal Probabilistic Generators for XML Collections (q 0, a)1 (q 0, b)1 (q 1, c)1 (q 1, $)0 /2 /1 (q 1, $) was chosen only when (q 1, c) was not available. - 17 -

18 Results for Continuation- test Generators Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices. – Extensions to non-binary are discussed in the paper Theorem: Continuation-test is NP-Complete – But only in the size of the schema; it is polynomial in the document size – Both generation and finding the optimal generator are polynomial when using a continuation-test oracle. – Based on schema satisfiability test [David et al. 2011] Theorem: probability of termination for a continuation-test generator may be arbitrarily small! – Proof – by construction of a simple, non-recursive schema – Can be handled by adding a constraint on the document size. – Sub-classes of schemas that guarantee termination? - 18 - Adding Constraints Optimal Probabilistic Generators for XML Collections

19 Adding Values to the Structure So far our generators were used only for the document structure Leaf values may also have a distribution according to which they can be generated – The distribution may be learned from the same document collection We will focus on the interesting case – generating leaf values for a schema with constraints - 19 - Leaf Values Optimal Probabilistic Generators for XML Collections

20 Suggested Algorithm We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step Improvements presented in the paper - 20 - Leaf Values Optimal Probabilistic Generators for XML Collections $ r abc abcd efg

21 Related Work Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011] Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010] Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011] Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008] AXML [Abiteboul, Benjelloun & Milo 2008] PCFGs [e.g., Chi & Geman 1998] - 21 - Summary Optimal Probabilistic Generators for XML Collections

22 Conclusion A model for a probabilistic XML generators Unconstrained case – Generation and learning optimal generators can be done efficiently – Termination is guaranteed Constrained case – Restart generator # of restarts is unbounded – Continuation-test generators Generation and learning optimal generators are expensive Termination is not guaranteed Leaf Value generation In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled. Future work – More Efficient combinations of restart and continuation-test generators - 22 - Summary Optimal Probabilistic Generators for XML Collections

23 Thank You! Q&A

24 Using a Tree Automaton for Schema Verification Preliminaries Optimal Probabilistic Generators for XML Collections q0q0 q1q1 q2q2 b ac $ r abc $ An XML is modeled as an ordered tree. Document d 0 : The children of a-labeled node are accepted by automaton A a Automaton A r : This is done for every inner node in a fixed order (BF-LTR)

25 Sentence Generation Example Input: a simple paragraph in an XML format Sam is a student. She goes to school on Weekdays. Marley thinks Sam is nice. Input: a (manually created) schema Output: randomly generated paragraphs a student is nice. a student thinks Sam thinks Sam thinks a student is nice. Sam thinks Sam is nice. Sam thinks She is nice. She is Sam. Sam is Marley. Marley thinks a student goes to school on weekdays. Sam goes to school on weekdays. Sam is nice. Marley thinks Sam is nice. Marley is Sam. Challenges: – Can constraints be useful here? – Creating an elaborate schema (classical NLP problem) - 18 - Implementation Optimal Probabilistic Generators for XML Collections

26 An Algorithm for Probabilities Learning - 26 - Without Constraints Optimal Probabilistic Generators for XML Collections b ac $ $ The frequency each transition is chosen during the corpus verification process is recorded. (q 0, a) (q 0, b) (q 1, c) (q 1, $) 1 1 1 1 q0q0 q1q1 q2q2 r abc

27 Example for Unboundedly Many Restarts Consider the following schema and corpus The schema allows 0 or 1 a -labeled leaves. We want to choose α that maximizes the likelihood of d The probability of d is the probability of generating it on the 1 st attempt + the probability of restarting once and generating d on the 2 nd attempt, and so on – a geometric series. Monotonically increases as α approaches 1. But so is also the probability of restarting… - 27 - Adding Constraints Optimal Probabilistic Generators for XML Collections r a d $ a $ α 1-α q1q1 q2q2 S a is unique and taken from {0}

28 Possible improvement to the basic algorithm Annotate the leaves with 'old' or 'new' For 'old' a-labeled leaves choose values already chosen for some a-labeled leaf For 'new' choose a value unused by a-labeled leaves yet Annotations can be learned from the corpus, and generated: – Offline – after the document generation, using a PTIME validity test – Online – during document generation, using a continuation test. – Both methods are incomparable in terms of quality - 28 - Leaf Values Optimal Probabilistic Generators for XML Collections newoldnew $ r aab


Download ppt "Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart [ICDT 2012]"

Similar presentations


Ads by Google