Download presentation
Presentation is loading. Please wait.
Published byEmerald Small Modified over 9 years ago
1
XML Typing and Query Evaluation
2
Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but abstracting details away – This will allow us to solve problems that are important in practice, in a foundational way Important issues – Types: defining a language of trees Will be useful for verifying validity of a tree And for optimizations of queries – Query evaluation on a document – Query evaluation on a type (type inference / checking)
3
XML typing Not compulsory Simplify writing software for XML – Improve interoperability between programs Improve storage and performance Simplify data protection – Reject illegal update
4
Improve performance Bib paperbook year journal title intstring address author title zip city street last name first name string \\adress\\zip[@zip=“12345”] \\adress\zip[@zip=“12345”] Typing semistructured data
5
Improve storage Root Company Employee string company person works-for c.e.o. address name managed-by name Company Employee Store rest in XML (file) “Well-behaved” parts can go to relational DBs Typing semistructured data
6
Type checking Who checks – XML editor: check that the data conforms to its type – XML exchange, e.g., with Web service Server when delivering the data Client/application: when receiving it Dynamic verification: after the data is produced Static verification: verification of the program that generates the data
7
Static verification Input: input type T and code of function f – f is Xquery, Xpath, etc. Verification of T’ – Is it true that d ╞ T, f(d) ╞ T’ ? Type inference – Find the smallest T’ such that d ╞ T, f(d) ╞ T’ A type is a language of trees
8
Example F= for $p in doc("parts.xml“)//part[color=“red"] return $p/name/text() $p/desc/node() Result type (part (name (string) desc (any) )* If the type of parts.xml//part/desc is string (part (name (string) desc (string) )*
9
Difficulty Semantics: for $X in Input, $Y in the input do { output ( } Can be written in XQuery Input: Result: Problem: { b i i=n 2 for n ≥ 0 } cannot be described in XML schema There is no « best » result – b* – + b 2 b * – + b 2 + b 4 b * – + b 2 + b 4 + b 9 b * – …
10
Why tree automata? XML = unranked trees No theory for XML Rich theory for strings: Automata Extend to rich theory for ranked trees: Tree automata – Nice algorithms – Nice theorems – Can this carry to unranked trees and XML? Yes!
11
From strings to trees a b b a a b b a b b ab a b b a b b ab ab ab Word Binary tree… Unranked tree automata Finite State Ranked tree automatano bound on number of children Automata a bbb
12
Why not then use unranked tree automata? Missing practical gadgets Complexity of verification – Goal: typing at reasonable cost
13
Automata Automata on words Typing semistructured data
14
Finite state automata on words Alphabet State Initial state Accepting states Transitions Typing semistructured data
15
q0q0 Nondeterministic automaton: Example a b a a b - a b a - q0q0 q1q1 q0q0 q0q0 q1q1 q0q0 q1q1 q0q0 q0q0 q1q1 q0q0 q0q0 q2q2 q1q1 q0q0 KO OK
16
Deterministic – No transition – No alternative transitions such as Determinization – It is possible to obtain an equivalent deterministic automaton – State of new automaton = set of states of the original one – Possible exponential blow-up Minimization Limitations – cannot do – Context-free languages Essential tool – e.g., lexical analysis Reminder
17
Reminder (2) L(A) = set of words accepted by automata A Regular languages Can be described by regular expressions, e.g. a(b+c)*d Closed under complement Closed under union, intersection – Product automata with states (s,s’) where s is from A and s’ is from A’
18
Automata on words versus trees a bba a b b a b b ab a Left to right Right to left No difference Bottom upBottom up Top downTop down Differences
19
Automata Automata on ranked trees Typing semistructured data
20
Binary tree automata Parallel evaluation For leaves: For other nodes: a b b a b ab a Bottom upBottom up q q’ b q” q1q” q2 qqq’ Typing semistructured data
21
Bottom-up tree automata Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’ Accepts is the root is in some state in F Not deterministic if alternatives or -transitions:
22
Example: deterministic bottom-up
23
Boolean circuit evaluation v v v 1 v v 1 1 0 v 0 1 1 OK
24
Regular tree language = set of trees accepted by a bottom-up tree automaton Typing semistructured data
25
Regular tree languages Theorem: the following are equivalent – L is a regular tree language – L is accepted by a nondeterministic bottom-up automaton – L is accepted by a deterministic bottom-up automaton – L is accepted by a nondeterministic top-down automaton Deterministic top-down is weaker
26
Top-down tree automata Top-down: if a node labeled a is in state q”, then its left child moves to state q, right to q’ Accepts is all leaves are in states in F Not deterministic if
27
Why deterministic top-down is weaker? Consider the language – L = {,,, ) } It can be accepted by a bottom-up TA – Exercise: write a BUTA A such that L = L(A) Suppose that B is a deterministic top-down TA that accepts both trees in L – Exercise: Show that B also accepts – A contradiction Fact: No deterministic top-down tree automata accepts exactly L
28
Ranked trees automata: Properties Like for words Determinization Minimization Closed under – Complement – Intersection – Union
29
But… XML documents are unranked: book (intro,section*,conclusion)
30
Automata Automata on unranked tree Typing semistructured data
31
Unranked tree automata Issue: represent an infinite set of transitions Solution: a regular language
32
Rule: Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r 1,…,r m } Unranked tree automata (2)
33
Building on ranked trees a b b b b ab ab a b b b b ab ab Ranked tree: FirstChild-NextSibling F: encoding into a ranked tree F is a bijection F -1 : decoding
34
Building on bottom-up ranked trees (2) For each Unranked TA A, there is a Ranked TA accepting F(L(A)) For each Ranked TA A, there is an unranked TA accepting F -1 (L(A)) Both are easy to construct Consequence: Unranked TA are closed under union, intersection, complement Determinaztaion also possible, a bit more tricky
35
Top-down? This is more delicate Transition (a,q)=A(a,q) – The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node – Accepts if all the leaves are in accepting state
36
Boolean circuit evaluation v v v 1 v 1 1 01 0 v 1 1 1 1 10 0 v v v A tree is accepted if, for some possible run, the states of all leaves are final
37
Tree Automata and monadic second-order logic Typing semistructured data
38
Monadic second-order logic Representation of a tree as a logical structure E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) a b b b b ab ab 1 6 342 789 5
39
Monadic second-order logic E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) MSO syntax Set variable Quantification over a set variable
40
Example of MSO Each a node has a b-descendant This corresponds to the formula For each node x labeled a: each set X that ( ) contains x and that ( ) is closed under descendant, X contains some y labeled b
41
Bridge Theorem: for a set L of trees, the following are equivalent 1.L = L(A) for some bottom-up tree automata A i.e. L is definable with bottom-tree automata 2.L = {T | T satisfies } for some MSO formula i.e. L is definable in MSO
42
XML typing DTDs Typing semistructured data
43
DTD Describe the children of a node of a label a by a regular expression Bizarre syntax
44
DTD and deterministism Regular expressions in DTD should be deterministic – Complicated definition Intuition: the corresponding automata should be deterministic – (a+b)*a is not – When reading, one cannot tell whether it is an a from (a+b) or if it is the a of the end – (b*a)(b*a)* is an equivalent expression that is deterministic
45
Very efficient validation It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata A a Possible to type check the document while scanning it
46
Very efficient validation (2) a bc dd stu bc AaAa s’t’ d d AbAb s’ st t’ Accept u
47
Warning The previous example can be checked with a simple automata on words But not the following one The stack is needed for accepting … … n n
48
Some bad news for DTD Not closed under union DTD1… DTD2… L(DTD1) L(DTD2) cannot be described by a DTD but can be described easily by a tree automata – Problem with the type of ad that depends of its parent Also not closed under complement Limited expressive power
49
Car example continued The best DTD we can choose does not distinguish between ads for used and new cars – Car UsedNew BrandYearBrand “Renault”“2008”“BMW”
50
Decoupled types in XML schema Each type corresponds to a label, not conversely car: [car]( used + new )* used:[used] (ad1*) new: [new] (ad2*) ad1: [ad] (year, brand) ad2: [ad] (brand) The tags are in green; type names in blue Nice closure properties Many other « gadgets » in XML schemas
51
Xpath Query Evaluation
52
Goal Evaluating an Xpath query against a given document – To find all matches We will also consider Type Checking – Given an Xpath query Q, input DTD T and output DTD T’, Does it hold that for every document D satisfying T, Q(D) satisfies T’ Complexity is important – Huge Documents
53
Data complexity vs. Combined Complexity Two inputs to the query evaluation problem – Data (XML document) of size |D| – Query (Xpath expression) of size |Q| – Usually |Q| << |D| Polynomial data complexity – Complexity that is polynomial in |D|, possibly exponential in |Q| Polynomial combined complexity – Complexity that is polynomial in |D| and |Q| Fixed Parameter Tractable complexity
54
Xpath Query Evaluation Input: XML Document D, Xpath query Q Output: A subset of the nodes of D, as defined by Q We will follow Efficient Algorithms for Processing Xpath Queries /Gottlob, Koch, Pichler 2003
55
Simple algorithm process-location-step(n,Q) { S:-= Apply Q.first to n; If |Q|> 1 For each node n’ in s do process-location-step(n’,Q.next) }
56
Complexity Worst case: in each step of Q the axis is “following” So we apply the query in each step on O(|D|) nodes And we get Time(|Q|)= |D|*Time(|Q|-1) I.e. the complexity is O(|D|^|Q|)
57
Polynomial data complexity Sometimes considered good even if exponential in the query size But can we have polynomial combined complexity? Yes!
58
Xpath query parse tree descendant::b/following-sibling::* [position() != last()]
59
Bottom-up vs. Top-down evaluation We will discuss two kinds of query evaluation algorithms: – Bottom-up means that the query parse tree is processed from the leaves up to the root – Top-down means that the parse tree is processed from the root to the leaves When processing we will fill in a Context- value table
60
Bottom-up evaluation Main idea: compute the value for each leaf for every possible context Propagate upwards until the root Dynamic programming algorithm to avoid re- evaluation of queries in the same context
61
An equivalent semantics to XPath The domain of contexts is C= dom X { | 1<k<n< |dom|} A context is c= where x is a context node k is context position n is the context size
63
Context-value Table Given a query sub-expression e, the context- value table of e specifies all combinations of context c and value v, such that computing e on the context c results in v Bottom-up algorithm follows: compute the context-value table in a bottom-up fashion with respect to the query
64
Bottom-up algorithm
65
Example
66
Complexity O(|D|^3*|Q|) space ignoring strings and numbers – O(|Q|) tables, with 3 columns, each including values in 1…|D| thus O(|D|^3*|Q|) – An extra O(|D|*|Q|) multiplicative factor for strings and numbers O(|D|^5*|Q|) time ignoring strings and numbers – It can take O(|D|^2) to combine two nodesets – Extra O(|Q|) in case of strings and numbers
67
Optimization Represent contexts as pairs of current and previous node Allows to get the time complexity down to O(|D|^4* |Q|^2) Space complexity can be brought down to O(|D|^2*|Q|^2) via more optimizations
68
Top-down evaluation Similar idea But allows to compute only values for contexts that are needed Same worst-case bounds
69
Top-down or bottom-up? General question in processing XML trees The tradeoff: – Usually easier to combine results computed in children to obtain the result at the parent So bottom-up traversal is usually easier to design – On the other hand, some of the computation is redundant since we don’t know if it will become relevant So top-down traversal may be more efficient
70
Linear-time fragment Core Xpath includes only navigation – \ and \\ Core Xpath can be evaluated in O(|D|*|Q|) Observtion: no need to consider the entire triple, only current context node Top-down or bottom-up evaluation with essentially the same algorithm But smaller tables (for every query node, all document nodes and values of evaluation)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.