Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

Similar presentations


Presentation on theme: "A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto."— Presentation transcript:

1 A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto

2 2 Motivating Example courses course info @cno student @snonamegrade name@sno studentname@sno... 123 A+B+ cs100cs225 Fox 123Fox

3 3 Our Goal courses course* course @cno, student* student @sno, name, grade name S grade S DTD:Integrity Constraints:, info* info @sno, name two students with the same @sno value must have the same name. @sno is the key of info.

4 4 A Non-relational Example DBLP conf titleissue article @yeartitle @year ICDT @year author@yeartitleauthor 1999 Dong2001Jarke 2001...

5 5 Problems to Address Functional dependencies for XML. Normal form for XML documents (XNF). Generalizes BCNF. Algorithm for normalizing XML documents. Implication problem for functional dependencies.

6 6 Framework: DTDs We do not consider mixed content, IDs and IDREFs. Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.name courses.course.student.name.S We distinguish three kinds of elements: attributes (@), strings (S) and element types.

7 7 Framework: XML Trees v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v0v0... courses course @cno cs100 @sno name grade@sno name grade student 123456 FoxB+Smith A- SSSS

8 8 Towards FDs for XML We know how to handle FDs in relational DBs. We need a relational representation of XML. We do it by considering tree tuples: a tree tuple in a DTD D is a mapping t : Paths(D) Vertices Strings { } consistent with D: could be extracted from an XML tree conforming to D.

9 9 Tree Tuples v1v1 v2v2 v0v0 courses course @cnostudent cs100 t(courses) = v 0 t(courses.course) = v 1 t(courses.course.@cno) = cs100 t(courses.course.student) = v 2 t(p) =, for the remaining paths A tree tuple represents an XML tree:

10 10 XML Tree: set of Tree Tuples v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v0v0... courses course @cno cs100 @sno name grade@sno name grade student 123456 FoxB+Smith A- SSSS v1v1 v2v2 courses course @cno cs100 student v0v0 v3v3 v4v4 @sno name grade 123 FoxB+ SS v5v5 v6v6 v7v7 @sno name grade student 456 Smith A- SS... course

11 11 Functional Dependencies Expressions of the form: X Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). XML tree T can be tested for satisfaction of X Y if: X Y Paths(T) Paths(D) T X Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X implies u.Y = v.Y

12 12 FD: Examples University DTD:courses course* course @cno, student* student @sno, name, grade Two students with the same @sno value must have the same name: courses.course.student.@sno courses.course.student.name.S Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno } courses.course.student.grade.S

13 13 Checking FD Satisfaction v1v1 v2v2 v3v3 v4v4 v6v6 v7v7 v8v8 v0v0 courses course @cno cs100 @sno name grade@sno name grade student 123 FoxB+FoxA+ SSSS v5v5 @cno cs225 student v1v1 v2v2 v3v3 v4v4 v0v0 courses course @cno cs100 @sno name grade student 123 FoxB+ SS v6v6 v7v7 v8v8 course @sno name grade 123 FoxA+ SS v5v5 @cno cs225 student { courses.course, courses.course.student.@sno } courses.course.student.grade.S

14 14 Checking FD Satisfaction v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v0v0 courses course @cno cs100 @sno name grade@sno name grade student 123 FoxB+FoxA+ SSSS student v1v1 v2v2 v3v3 v4v4 v0v0 courses course @cno cs100 @sno name grade student 123 FoxB+ SS v5v5 v6v6 v7v7 @sno name grade 123 FoxA+ SS student { courses.course, courses.course.student.@sno } courses.course.student.grade.S

15 15 Implication Problem for FD Given a DTD D and a set of functional dependencies { }: (D, ) if for any XML tree T conforming to D and satisfying, it is the case that T (D, ) + = { | (D, ) } Functional dependency is trivial if it is implied by the DTD alone: (D, )

16 16 XNF: XML Normal Form XML specification: a DTD D and a set of functional dependencies. A Relational DB is in BCNF if for every non- trivial functional dependency X Y in the specification, X is a key. (D, ) is in XNF if: For each non-trivial FD X p.@l or X p.S in (D, ) +, X p is in (D, ) +.

17 17 Back to DBLP DBLP is not in XNF: DBLP.conf.issue DBLP.conf.issue.article.@year (D, ) + DBLP.conf.issue DBLP.conf.issue.article (D, ) + Proposed solution is in XNF.

18 18 Normalization Algorithm The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD of the form: DBLP.conf.issue DBLP.conf.issue.article.@year then apply the DBLP example rule. Otherwise: choose a minimal anomalous FD and apply the University example rule.

19 19 Normalizing XML Documents Theorem The decomposition algorithm terminates and outputs a specification in XNF. It does not lose information: UnnormalizedNormalized XML documentXML Document Q 1, Q 2 are XQuery core queries. It works even if we cannot compute (D, ) +. If we know (D, ) +, the output is better. Q1Q1 Q2Q2

20 20 Implication Problem (contd) Typically, regular expressions used in DTDs are rather simple. Trivial regular expression: s 1, s 2, …, s n Each s i is either one of a i, a i ?, a i + or a i * For i j, a i a j D is a simple DTD if all its productions use permutations of trivial regular expressions. Example: (a | b)* is a permutation of a*, b*

21 21 Implication Problem (contd) Theorem For simple DTDs: The implication problem for FDs is solvable in O(n 2 ). Testing if a specification is in XNF can be done in O(n 3 ). Other results: There is a larger class of DTDs for which these problems are tractable. There is an even larger class of DTDs for which they are coNP-complete.

22 22 Final Comments The normalization algorithm can be improved in various ways. Why XNF? Theorem XNF generalizes BCNF and a normal form for nested relations (called NNF) when those are coded as XML documents. A complete classification of the complexity of the implication problem for FDs remains open. Implementation.

23 Backup Slides

24 24 Normalization Algorithm We consider FDs of the form: {q, p 1.@l 1, …, p n.@l n } p where n 0 and q ends with an element type. The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD with n = 0: DBLP.conf.issue DBLP.conf.issue.article.@year then apply the DBLP example rule. Otherwise: choose a minimal anomalous FD and apply the University example rule.


Download ppt "A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto."

Similar presentations


Ads by Google