XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.

XML Typing and Query Evaluation

Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but abstracting details away – This will allow us to solve problems that are important in practice, in a foundational way Important issues – Types: defining a language of trees Will be useful for verifying validity of a tree And for optimizations of queries – Query evaluation on a document – Query evaluation on a type (type inference / checking)

XML typing Not compulsory Simplify writing software for XML – Improve interoperability between programs Improve storage and performance Simplify data protection – Reject illegal update

Improve performance Bib paperbook year journal title intstring address author title zip city street last name first name string \\adress\\zip[@zip=“12345”] \\adress\zip[@zip=“12345”] Typing semistructured data

Improve storage Root Company Employee string company person works-for c.e.o. address name managed-by name Company Employee Store rest in XML (file) “Well-behaved” parts can go to relational DBs Typing semistructured data

Type checking Who checks – XML editor: check that the data conforms to its type – XML exchange, e.g., with Web service Server when delivering the data Client/application: when receiving it Dynamic verification: after the data is produced Static verification: verification of the program that generates the data

Static verification Input: input type T and code of function f – f is Xquery, Xpath, etc. Verification of T’ – Is it true that  d ╞ T, f(d) ╞ T’ ? Type inference – Find the smallest T’ such that  d ╞ T, f(d) ╞ T’ A type is a language of trees

Example F= for $p in doc("parts.xml“)//part[color=“red"] return $p/name/text() $p/desc/node() Result type (part (name (string) desc (any) )* If the type of parts.xml//part/desc is string (part (name (string) desc (string) )*

Difficulty Semantics: for $X in Input, $Y in the input do { output ( } Can be written in XQuery Input: Result: Problem: { b i  i=n 2 for n ≥ 0 } cannot be described in XML schema There is no « best » result – b* –  + b 2 b * –  + b 2 + b 4 b * –  + b 2 + b 4 + b 9 b * – …

Why tree automata? XML = unranked trees No theory for XML Rich theory for strings: Automata Extend to rich theory for ranked trees: Tree automata – Nice algorithms – Nice theorems – Can this carry to unranked trees and XML? Yes!

From strings to trees a b b a a b b a b b ab a b b a b b ab ab ab Word Binary tree… Unranked tree automata Finite State Ranked tree automatano bound on number of children Automata a bbb

Why not then use unranked tree automata? Missing practical gadgets Complexity of verification – Goal: typing at reasonable cost

Automata Automata on words Typing semistructured data

Finite state automata on words Alphabet State Initial state Accepting states Transitions Typing semistructured data

q0q0 Nondeterministic automaton: Example a b a a b - a b a - q0q0 q1q1 q0q0 q0q0 q1q1 q0q0 q1q1 q0q0 q0q0 q1q1 q0q0 q0q0 q2q2 q1q1 q0q0 KO OK

Deterministic – No  transition – No alternative transitions such as Determinization – It is possible to obtain an equivalent deterministic automaton – State of new automaton = set of states of the original one – Possible exponential blow-up Minimization Limitations – cannot do – Context-free languages Essential tool – e.g., lexical analysis Reminder

Reminder (2) L(A) = set of words accepted by automata A Regular languages Can be described by regular expressions, e.g. a(b+c)*d Closed under complement Closed under union, intersection – Product automata with states (s,s’) where s is from A and s’ is from A’

Automata on words versus trees a bba a b b a b b ab a Left to right Right to left No difference Bottom upBottom up Top downTop down Differences

Automata Automata on ranked trees Typing semistructured data

Binary tree automata Parallel evaluation For leaves: For other nodes: a b b a b ab a Bottom upBottom up q q’ b q” q1q” q2 qqq’ Typing semistructured data

Bottom-up tree automata Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’ Accepts is the root is in some state in F Not deterministic if alternatives or  -transitions:

Example: deterministic bottom-up

Boolean circuit evaluation v v v 1 v v 1 1 0 v 0 1 1 OK

Regular tree language = set of trees accepted by a bottom-up tree automaton Typing semistructured data

Regular tree languages Theorem: the following are equivalent – L is a regular tree language – L is accepted by a nondeterministic bottom-up automaton – L is accepted by a deterministic bottom-up automaton – L is accepted by a nondeterministic top-down automaton Deterministic top-down is weaker

Top-down tree automata Top-down: if a node labeled a is in state q”, then its left child moves to state q, right to q’ Accepts is all leaves are in states in F Not deterministic if

Why deterministic top-down is weaker? Consider the language – L = {,,, ) } It can be accepted by a bottom-up TA – Exercise: write a BUTA A such that L = L(A) Suppose that B is a deterministic top-down TA that accepts both trees in L – Exercise: Show that B also accepts – A contradiction Fact: No deterministic top-down tree automata accepts exactly L

Ranked trees automata: Properties Like for words Determinization Minimization Closed under – Complement – Intersection – Union

But… XML documents are unranked: book (intro,section*,conclusion)

Automata Automata on unranked tree Typing semistructured data

Unranked tree automata Issue: represent an infinite set of transitions Solution: a regular language

Rule: Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r 1,…,r m } Unranked tree automata (2)

Building on ranked trees a b b b b ab ab a b b b b ab ab Ranked tree: FirstChild-NextSibling F: encoding into a ranked tree F is a bijection F -1 : decoding

Building on bottom-up ranked trees (2) For each Unranked TA A, there is a Ranked TA accepting F(L(A)) For each Ranked TA A, there is an unranked TA accepting F -1 (L(A)) Both are easy to construct Consequence: Unranked TA are closed under union, intersection, complement Determinaztaion also possible, a bit more tricky

Top-down? This is more delicate Transition  (a,q)=A(a,q) – The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node – Accepts if all the leaves are in accepting state

Boolean circuit evaluation v v v 1 v 1 1 01 0 v 1 1 1 1 10 0 v v v A tree is accepted if, for some possible run, the states of all leaves are final

Tree Automata and monadic second-order logic Typing semistructured data

Monadic second-order logic Representation of a tree as a logical structure E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) a b b b b ab ab 1 6 342 789 5

Monadic second-order logic E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) MSO syntax Set variable Quantification over a set variable

Example of MSO Each a node has a b-descendant This corresponds to the formula For each node x labeled a: each set X that (  )  contains x and that (  ) is closed under descendant, X contains some y labeled b

Bridge Theorem: for a set L of trees, the following are equivalent 1.L = L(A) for some bottom-up tree automata A i.e. L is definable with bottom-tree automata 2.L = {T | T satisfies  } for some MSO formula  i.e. L is definable in MSO

XML typing DTDs Typing semistructured data

DTD Describe the children of a node of a label a by a regular expression Bizarre syntax

DTD and deterministism Regular expressions in DTD should be deterministic – Complicated definition Intuition: the corresponding automata should be deterministic – (a+b)*a is not – When reading, one cannot tell whether it is an a from (a+b) or if it is the a of the end – (b*a)(b*a)* is an equivalent expression that is deterministic

Very efficient validation It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata A a Possible to type check the document while scanning it

Very efficient validation (2) a bc dd stu bc AaAa s’t’ d d AbAb s’ st t’ Accept u

Warning The previous example can be checked with a simple automata on words But not the following one The stack is needed for accepting … … n n

Some bad news for DTD Not closed under union DTD1… DTD2… L(DTD1)  L(DTD2) cannot be described by a DTD but can be described easily by a tree automata – Problem with the type of ad that depends of its parent Also not closed under complement Limited expressive power

Car example continued The best DTD we can choose does not distinguish between ads for used and new cars – Car UsedNew BrandYearBrand “Renault”“2008”“BMW”

Decoupled types in XML schema Each type corresponds to a label, not conversely car: [car]( used + new )* used:[used] (ad1*) new: [new] (ad2*) ad1: [ad] (year, brand) ad2: [ad] (brand) The tags are in green; type names in blue Nice closure properties Many other « gadgets » in XML schemas

Xpath Query Evaluation

Goal Evaluating an Xpath query against a given document – To find all matches We will also consider Type Checking – Given an Xpath query Q, input DTD T and output DTD T’, Does it hold that for every document D satisfying T, Q(D) satisfies T’ Complexity is important – Huge Documents

Data complexity vs. Combined Complexity Two inputs to the query evaluation problem – Data (XML document) of size |D| – Query (Xpath expression) of size |Q| – Usually |Q| << |D| Polynomial data complexity – Complexity that is polynomial in |D|, possibly exponential in |Q| Polynomial combined complexity – Complexity that is polynomial in |D| and |Q| Fixed Parameter Tractable complexity

Xpath Query Evaluation Input: XML Document D, Xpath query Q Output: A subset of the nodes of D, as defined by Q We will follow Efficient Algorithms for Processing Xpath Queries /Gottlob, Koch, Pichler 2003

Simple algorithm process-location-step(n,Q) { S:-= Apply Q.first to n; If |Q|> 1 For each node n’ in s do process-location-step(n’,Q.next) }

Complexity Worst case: in each step of Q the axis is “following” So we apply the query in each step on O(|D|) nodes And we get Time(|Q|)= |D|*Time(|Q|-1) I.e. the complexity is O(|D|^|Q|)

Polynomial data complexity Sometimes considered good even if exponential in the query size But can we have polynomial combined complexity? Yes!

Xpath query parse tree descendant::b/following-sibling::* [position() != last()]

Bottom-up vs. Top-down evaluation We will discuss two kinds of query evaluation algorithms: – Bottom-up means that the query parse tree is processed from the leaves up to the root – Top-down means that the parse tree is processed from the root to the leaves When processing we will fill in a Context- value table

Bottom-up evaluation Main idea: compute the value for each leaf for every possible context Propagate upwards until the root Dynamic programming algorithm to avoid re- evaluation of queries in the same context

An equivalent semantics to XPath The domain of contexts is C= dom X { | 1<k<n< |dom|} A context is c= where x is a context node k is context position n is the context size

Context-value Table Given a query sub-expression e, the context- value table of e specifies all combinations of context c and value v, such that computing e on the context c results in v Bottom-up algorithm follows: compute the context-value table in a bottom-up fashion with respect to the query

Bottom-up algorithm

Example

Complexity O(|D|^3*|Q|) space ignoring strings and numbers – O(|Q|) tables, with 3 columns, each including values in 1…|D| thus O(|D|^3*|Q|) – An extra O(|D|*|Q|) multiplicative factor for strings and numbers O(|D|^5*|Q|) time ignoring strings and numbers – It can take O(|D|^2) to combine two nodesets – Extra O(|Q|) in case of strings and numbers

Optimization Represent contexts as pairs of current and previous node Allows to get the time complexity down to O(|D|^4* |Q|^2) Space complexity can be brought down to O(|D|^2*|Q|^2) via more optimizations

Top-down evaluation Similar idea But allows to compute only values for contexts that are needed Same worst-case bounds

Top-down or bottom-up? General question in processing XML trees The tradeoff: – Usually easier to combine results computed in children to obtain the result at the parent So bottom-up traversal is usually easier to design – On the other hand, some of the computation is redundant since we don’t know if it will become relevant So top-down traversal may be more efficient

Linear-time fragment Core Xpath includes only navigation – \ and \\ Core Xpath can be evaluated in O(|D|*|Q|) Observtion: no need to consider the entire triple, only current context node Top-down or bottom-up evaluation with essentially the same algorithm But smaller tables (for every query node, all document nodes and values of evaluation)

XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.

Similar presentations

Presentation on theme: "XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.

Similar presentations

Presentation on theme: "XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but."— Presentation transcript:

Similar presentations

About project

Feedback