Download presentation
Presentation is loading. Please wait.
Published byLucinda Amber Griffith Modified over 9 years ago
1
Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011
2
Adding probabilities to an XML Schema The Web is full of semi-structured documents Given a collection of XML documents, we would like to have a schema the documents conform to. – E.g., DTD or XSD – Restricts the structure, mostly parent-child node relations (using regular expressions) The schema may be very general (e.g., xhtml, RSS) We want to add probabilities to 'guide' the schema – Optimal probabilities – maximize the likelihood of a corpus - 2 - Motivation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
3
Implementation Idea: XML Editor Auto-Completion Based on previous document versions / corpus of user documents / corpus of example documents – Suggest nodes / sub-trees / node values to the user For example: Challenges: – Allow editing in every part of the document – What kind of completion to suggest? – Finding the top-k best completions - 3 - Motivation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Provenance Minimization Yael A. Daniel D. Tova M. Val T. Provenance for Aggregates Yael A. Daniel D./author> Val T.
4
Many Other Usages for a Probabilistic Schema Testing – e.g., generating many XML messages to simulate network load and test system performance. Explaining – e.g., the probabilistic schema for DBLP shows which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. Querying – e.g., finding the probability that a paper has more than 3 authors. – e.g., finding the top-k best completions to a partial document. Schema Evaluation – how well a given schema describes a given corpus. - 4 - Motivation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
5
Our solution - An Outline - 5 -Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Preliminaries – Tree Automata Generators for Schemas without Constraints Restart Generators Continuation-Test Generators Leaf Values Adding Constraints
6
Using a Tree Automaton for Schema Verification - 6 - Preliminaries Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer q0q0 q1q1 q2q2 b ac $ An XML document is modeled as an ordered tree. Document d 0 : The children of a-labeled node are accepted by DFA A a Automaton A r : (L( A r ) = a*bc*$) This is done for every inner node in a fixed order (BF-LTR) abcd 532 $ r abc
7
Using the Schema as a (Probabilistic) Generator Each transition is assigned a probability We assume independent choices, thus the document probability is the product of all t-probs. Thus, PR( d )=p a ∙p a ∙p b ∙p $ The schema and generator ignore leaf values (for now!) - 7 - Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer b a c $ $ papa pcpc pbpb p$p$ q0q0 q1q1 q2q2 r aab
8
An Algorithm for Probabilities Learning - 8 - Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer b ac $ $ The frequency each transition is chosen during the corpus verification process is recorded. (q 0, a) (q 0, b) (q 1, c) (q 1, $) 1 1 1 1 q0q0 q1q1 q2q2 r abc
9
An Algorithm for Probabilities Learning (Cont.) This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency. - 9 - Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer (q 0, a)1 (q 0, b)1 (q 1, c)1 (q 1, $)1 /2 These probabilities maximize the likelihood of generating the corpus – optimal generator (similar result in PCFGs)
10
Our Results Relative frequencies make optimal probabilities. Optimal probability learning and document generation can be done efficiently. Additional non-trivial result: Generation terminates with probability 1. – Guaranteed because of the choice of probabilities according to the corpus. - 10 - Without Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
11
Integrity Constraints We want to allow the use of the following constraints in the schema: – Key Constraint: the leaves of a-labeled leaves have unique values (unary key) – Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves – Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain - 11 - Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
12
Restart Generators A naïve idea: – Use a probabilistic generator to generate a document – Check if it has a value assignment valid w.r.t. the constraints – If not, 'restart' and try again until a valid document is generated Good news: Checking the existence of a valid assignment is in PTIME Bad news: number of restarts can be unboundedly large in an optimal generator – A different quality measure for restart generators? - 12 - Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
13
Continuation-test Generators Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Add to the schema of d 0 the constraints: – c is included in a – c is unique The generation process: - 13 - Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer b a c $ $ papa pcpc pbpb p$p$ q0q0 q1q1 q2q2 r abc Pr( d ) = p a ∙p b ∙p c ∙1 Perform a continuation-test before taking the transition Implies |c|≤|a|
14
Algorithm for Learning Probabilities with Constraints The probabilities are again relative frequencies, but – only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer (q 0, a)1 (q 0, b)1 (q 1, c)1 (q 1, $)0 /2 /1 (q 1, $) was chosen only when (q 1, c) was not available. - 14 -
15
Our Results The algorithm learns an optimal continuation-test generator, for automata with binary choices. – Extensions to non-binary are discussed in the paper Bad News: Continuation-test is NP-Complete – But only in the size of the schema; it is polynomial in the document size (not so bad?) – Based on schema satisfiability test [David et al. 2011] More Bad News: probability of termination may be arbitrarily small! – Even for simple, non-recursive schemas – Can be handled by adding a constraint on the document size. – Sub-classes of schemas that guarantee termination? - 15 - Adding Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
16
Suggested Algorithm We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step - 16 - Leaf Values Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer $ r abc abcd efg
17
Possible improvement to the basic algorithm Annotate the leaves with 'old' or 'new' For 'old' a-labeled leaves choose values already chosen for some a-labeled leaf For 'new' choose a value unused by a-labeled leaves yet Annotations can be learned from the corpus, and generated: – Offline – after the document generation, using a PTIME validity test – Online – during document generation, using a continuation test. – Both methods are incomparable in terms of quality - 17 - Leaf Values Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer newoldnew $ r aab
18
Ideas for Experimental Study on Probabilistic generators How many schemas in practice may require continuation tests? How many have termination probability < 1? Continuation test is expensive, but how expensive is it in practice? 'Competition' between restart and continuation-test generators More? - 18 - Implementation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
19
Thank You! Questions, Ideas?...
20
Using a Tree Automaton for Schema Verification Preliminaries Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer q0q0 q1q1 q2q2 b ac $ r abc $ An XML is modeled as an ordered tree. Document d 0 : The children of a-labeled node are accepted by automaton A a Automaton A r : This is done for every inner node in a fixed order (BF-LTR)
21
Sentence Generation Example Input: a simple paragraph in an XML format Sam is a student. She goes to school on Weekdays. Marley thinks Sam is nice. Input: a (manually created) schema Output: randomly generated paragraphs a student is nice. a student thinks Sam thinks Sam thinks a student is nice. Sam thinks Sam is nice. Sam thinks She is nice. She is Sam. Sam is Marley. Marley thinks a student goes to school on weekdays. Sam goes to school on weekdays. Sam is nice. Marley thinks Sam is nice. Marley is Sam. Challenges: – Can constraints be useful here? – Creating an elaborate schema (classical NLP problem) - 18 - Implementation Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.