Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.

Similar presentations


Presentation on theme: "XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but."— Presentation transcript:

1 XML Typing and Query Evaluation

2 Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but abstracting details away – This will allow us to solve problems that are important in practice, in a foundational way Important issues – Types: defining a language of trees Will be useful for verifying validity of a tree And for optimizations of queries – Query evaluation on a document – Query evaluation on a type (type inference / checking)

3 XML typing Not compulsory Simplify writing software for XML – Improve interoperability between programs Improve storage and performance Simplify data protection – Reject illegal update

4 Improve performance Bib paperbook year journal title intstring address author title zip city street last name first name string \\adress\\zip[@zip=“12345”] \\adress\zip[@zip=“12345”] Typing semistructured data

5 Improve storage Root Company Employee string company person works-for c.e.o. address name managed-by name Company Employee Store rest in XML (file) “Well-behaved” parts can go to relational DBs Typing semistructured data

6 Type checking Who checks – XML editor: check that the data conforms to its type – XML exchange, e.g., with Web service Server when delivering the data Client/application: when receiving it Dynamic verification: after the data is produced Static verification: verification of the program that generates the data

7 Static verification Input: input type T and code of function f – f is Xquery, Xpath, etc. Verification of T’ – Is it true that  d ╞ T, f(d) ╞ T’ ? Type inference – Find the smallest T’ such that  d ╞ T, f(d) ╞ T’ A type is a language of trees

8 Example F= for $p in doc("parts.xml“)//part[color=“red"] return $p/name/text() $p/desc/node() Result type (part (name (string) desc (any) )* If the type of parts.xml//part/desc is string (part (name (string) desc (string) )*

9 Difficulty Semantics: for $X in Input, $Y in the input do { output ( } Can be written in XQuery Input: Result: Problem: { b i  i=n 2 for n ≥ 0 } cannot be described in XML schema There is no « best » result – b* –  + b 2 b * –  + b 2 + b 4 b * –  + b 2 + b 4 + b 9 b * – …

10 Why tree automata? XML = unranked trees No theory for XML Rich theory for strings: Automata Extend to rich theory for ranked trees: Tree automata – Nice algorithms – Nice theorems – Can this carry to unranked trees and XML? Yes!

11 From strings to trees a b b a a b b a b b ab a b b a b b ab ab ab Word Binary tree… Unranked tree automata Finite State Ranked tree automatano bound on number of children Automata a bbb

12 Why not then use unranked tree automata? Missing practical gadgets Complexity of verification – Goal: typing at reasonable cost

13 Automata Automata on words Typing semistructured data

14 Finite state automata on words Alphabet State Initial state Accepting states Transitions Typing semistructured data

15 q0q0 Nondeterministic automaton: Example a b a a b - a b a - q0q0 q1q1 q0q0 q0q0 q1q1 q0q0 q1q1 q0q0 q0q0 q1q1 q0q0 q0q0 q2q2 q1q1 q0q0 KO OK

16 Deterministic – No  transition – No alternative transitions such as Determinization – It is possible to obtain an equivalent deterministic automaton – State of new automaton = set of states of the original one – Possible exponential blow-up Minimization Limitations – cannot do – Context-free languages Essential tool – e.g., lexical analysis Reminder

17 Reminder (2) L(A) = set of words accepted by automata A Regular languages Can be described by regular expressions, e.g. a(b+c)*d Closed under complement Closed under union, intersection – Product automata with states (s,s’) where s is from A and s’ is from A’

18 Automata on words versus trees a bba a b b a b b ab a Left to right Right to left No difference Bottom upBottom up Top downTop down Differences

19 Automata Automata on ranked trees Typing semistructured data

20 Binary tree automata Parallel evaluation For leaves: For other nodes: a b b a b ab a Bottom upBottom up q q’ b q” q1q” q2 qqq’ Typing semistructured data

21 Bottom-up tree automata Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’ Accepts is the root is in some state in F Not deterministic if alternatives or  -transitions:

22 Example: deterministic bottom-up

23 Boolean circuit evaluation v v v 1 v v 1 1 0 v 0 1 1 OK

24 Regular tree language = set of trees accepted by a bottom-up tree automaton Typing semistructured data

25 Regular tree languages Theorem: the following are equivalent – L is a regular tree language – L is accepted by a nondeterministic bottom-up automaton – L is accepted by a deterministic bottom-up automaton – L is accepted by a nondeterministic top-down automaton Deterministic top-down is weaker

26 Top-down tree automata Top-down: if a node labeled a is in state q”, then its left child moves to state q, right to q’ Accepts is all leaves are in states in F Not deterministic if

27 Why deterministic top-down is weaker? Consider the language – L = {,,, ) } It can be accepted by a bottom-up TA – Exercise: write a BUTA A such that L = L(A) Suppose that B is a deterministic top-down TA that accepts both trees in L – Exercise: Show that B also accepts – A contradiction Fact: No deterministic top-down tree automata accepts exactly L

28 Ranked trees automata: Properties Like for words Determinization Minimization Closed under – Complement – Intersection – Union

29 But… XML documents are unranked: book (intro,section*,conclusion)

30 Automata Automata on unranked tree Typing semistructured data

31 Unranked tree automata Issue: represent an infinite set of transitions Solution: a regular language

32 Rule: Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r 1,…,r m } Unranked tree automata (2)

33 Building on ranked trees a b b b b ab ab a b b b b ab ab Ranked tree: FirstChild-NextSibling F: encoding into a ranked tree F is a bijection F -1 : decoding

34 Building on bottom-up ranked trees (2) For each Unranked TA A, there is a Ranked TA accepting F(L(A)) For each Ranked TA A, there is an unranked TA accepting F -1 (L(A)) Both are easy to construct Consequence: Unranked TA are closed under union, intersection, complement Determinaztaion also possible, a bit more tricky

35 Top-down? This is more delicate Transition  (a,q)=A(a,q) – The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node – Accepts if all the leaves are in accepting state

36 Boolean circuit evaluation v v v 1 v 1 1 01 0 v 1 1 1 1 10 0 v v v A tree is accepted if, for some possible run, the states of all leaves are final

37 Tree Automata and monadic second-order logic Typing semistructured data

38 Monadic second-order logic Representation of a tree as a logical structure E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) a b b b b ab ab 1 6 342 789 5

39 Monadic second-order logic E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) MSO syntax Set variable Quantification over a set variable

40 Example of MSO Each a node has a b-descendant This corresponds to the formula For each node x labeled a: each set X that (  )  contains x and that (  ) is closed under descendant, X contains some y labeled b

41 Bridge Theorem: for a set L of trees, the following are equivalent 1.L = L(A) for some bottom-up tree automata A i.e. L is definable with bottom-tree automata 2.L = {T | T satisfies  } for some MSO formula  i.e. L is definable in MSO

42 XML typing DTDs Typing semistructured data

43 DTD Describe the children of a node of a label a by a regular expression Bizarre syntax

44 DTD and deterministism Regular expressions in DTD should be deterministic – Complicated definition Intuition: the corresponding automata should be deterministic – (a+b)*a is not – When reading, one cannot tell whether it is an a from (a+b) or if it is the a of the end – (b*a)(b*a)* is an equivalent expression that is deterministic

45 Very efficient validation It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata A a Possible to type check the document while scanning it

46 Very efficient validation (2) a bc dd stu bc AaAa s’t’ d d AbAb s’ st t’ Accept u

47 Warning The previous example can be checked with a simple automata on words But not the following one The stack is needed for accepting … … n n

48 Some bad news for DTD Not closed under union DTD1… DTD2… L(DTD1)  L(DTD2) cannot be described by a DTD but can be described easily by a tree automata – Problem with the type of ad that depends of its parent Also not closed under complement Limited expressive power

49 Car example continued The best DTD we can choose does not distinguish between ads for used and new cars – Car UsedNew BrandYearBrand “Renault”“2008”“BMW”

50 Decoupled types in XML schema Each type corresponds to a label, not conversely car: [car]( used + new )* used:[used] (ad1*) new: [new] (ad2*) ad1: [ad] (year, brand) ad2: [ad] (brand) The tags are in green; type names in blue Nice closure properties Many other « gadgets » in XML schemas

51 Xpath Query Evaluation

52 Goal Evaluating an Xpath query against a given document – To find all matches We will also consider Type Checking – Given an Xpath query Q, input DTD T and output DTD T’, Does it hold that for every document D satisfying T, Q(D) satisfies T’ Complexity is important – Huge Documents

53 Data complexity vs. Combined Complexity Two inputs to the query evaluation problem – Data (XML document) of size |D| – Query (Xpath expression) of size |Q| – Usually |Q| << |D| Polynomial data complexity – Complexity that is polynomial in |D|, possibly exponential in |Q| Polynomial combined complexity – Complexity that is polynomial in |D| and |Q| Fixed Parameter Tractable complexity

54 Xpath Query Evaluation Input: XML Document D, Xpath query Q Output: A subset of the nodes of D, as defined by Q We will follow Efficient Algorithms for Processing Xpath Queries /Gottlob, Koch, Pichler 2003

55 Simple algorithm process-location-step(n,Q) { S:-= Apply Q.first to n; If |Q|> 1 For each node n’ in s do process-location-step(n’,Q.next) }

56 Complexity Worst case: in each step of Q the axis is “following” So we apply the query in each step on O(|D|) nodes And we get Time(|Q|)= |D|*Time(|Q|-1) I.e. the complexity is O(|D|^|Q|)

57 Polynomial data complexity Sometimes considered good even if exponential in the query size But can we have polynomial combined complexity? Yes!

58 Xpath query parse tree descendant::b/following-sibling::* [position() != last()]

59 Bottom-up vs. Top-down evaluation We will discuss two kinds of query evaluation algorithms: – Bottom-up means that the query parse tree is processed from the leaves up to the root – Top-down means that the parse tree is processed from the root to the leaves When processing we will fill in a Context- value table

60 Bottom-up evaluation Main idea: compute the value for each leaf for every possible context Propagate upwards until the root Dynamic programming algorithm to avoid re- evaluation of queries in the same context

61 An equivalent semantics to XPath The domain of contexts is C= dom X { | 1<k<n< |dom|} A context is c= where x is a context node k is context position n is the context size

62

63 Context-value Table Given a query sub-expression e, the context- value table of e specifies all combinations of context c and value v, such that computing e on the context c results in v Bottom-up algorithm follows: compute the context-value table in a bottom-up fashion with respect to the query

64 Bottom-up algorithm

65 Example

66 Complexity O(|D|^3*|Q|) space ignoring strings and numbers – O(|Q|) tables, with 3 columns, each including values in 1…|D| thus O(|D|^3*|Q|) – An extra O(|D|*|Q|) multiplicative factor for strings and numbers O(|D|^5*|Q|) time ignoring strings and numbers – It can take O(|D|^2) to combine two nodesets – Extra O(|Q|) in case of strings and numbers

67 Optimization Represent contexts as pairs of current and previous node Allows to get the time complexity down to O(|D|^4* |Q|^2) Space complexity can be brought down to O(|D|^2*|Q|^2) via more optimizations

68 Top-down evaluation Similar idea But allows to compute only values for contexts that are needed Same worst-case bounds

69 Top-down or bottom-up? General question in processing XML trees The tradeoff: – Usually easier to combine results computed in children to obtain the result at the parent So bottom-up traversal is usually easier to design – On the other hand, some of the computation is redundant since we don’t know if it will become relevant So top-down traversal may be more efficient

70 Linear-time fragment Core Xpath includes only navigation – \ and \\ Core Xpath can be evaluated in O(|D|*|Q|) Observtion: no need to consider the entire triple, only current context node Top-down or bottom-up evaluation with essentially the same algorithm But smaller tables (for every query node, all document nodes and values of evaluation)


Download ppt "XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but."

Similar presentations


Ads by Google