A Normal Form for XML Documents

A Normal Form for XML Documents
Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto

Motivating Example courses course course info @cno student student
@sno name “cs100” . . . “cs225” “123” “Fox” @sno name grade @sno name grade “123” “Fox” “B+” “123” “Fox” “A+”

Our Goal DTD: Integrity Constraints: courses  course*
course student* student name, grade name  S grade  S , info* two students with the value must have the same name. @sno is the key of info. info name

A “Non-relational” Example
DBLP conf conf . . . title issue issue “ICDT” article article @year article @year Issue = kiadas “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001”

Problems to Address Functional dependencies for XML.
Normal form for XML documents (XNF). Generalizes BCNF. Algorithm for normalizing XML documents. Implication problem for functional dependencies.

Framework: DTDs We do not consider mixed content, IDs and IDREFs.
Paths(D): all paths in a DTD D courses.course courses.course.student.name courses.course.student.name.S We distinguish three kinds of elements: attributes strings (S) and element types.

Framework: XML Trees v0 v1 . . . v2 v5 “cs100” v3 v4 v6 v7 “123” “456”
courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-”

Towards FDs for XML We know how to handle FDs in relational DBs.
We need a relational representation of XML. We do it by considering tree tuples: a tree tuple in a DTD D is a mapping t : Paths(D)  Vertices  Strings  {} consistent with D: could be extracted from an XML tree conforming to D.

Tree tuple example v0 v1 v2 “cs100” v3 v4 “123” “Fox” “B+” courses
@cno student v2 “cs100” @sno grade name v3 v4 “123” S S “Fox” “B+”

XML Tree: set of Tree Tuples
courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v2 v5 v5 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-”

Functional Dependencies
Expressions of the form: X  Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). XML tree T can be tested for satisfaction of X  Y if: X  Y  Paths(T)  Paths(D) T |= X  Y (satisfies) if for every pair u, v of tree tuples in T: u.X = v.X and u.X ≠  implies u.Y = v.Y node equality = equality between vertices value equality = equality between strings We say T is compatible with D (T ◄ D) iff path(T) included in path(D) We say T conforms to D (Definition 3 Libkin cikk) jele |=> We say T satisfies X -> Y (satisfies Libkin cikk jele |=>) itt 

FD: Examples University DTD: courses  course* course  @cno, student*
student name, grade Two students with the value must have the same name:  courses.course.student.name.S Every student can have at most one grade in every course: { courses.course, }  courses.course.student.grade.S

Checking FD Satisfaction
{ courses.course, }  courses.course.student.grade.S courses courses v0 v0 course course course course v1 v1 v5 v5 @cno @cno student student @cno @cno student student “cs100” “cs100” v2 v2 “cs225” “cs225” v6 v6 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v7 v7 v8 v8 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”

Checking FD Satisfaction
{ courses.course, }  courses.course.student.grade.S courses courses v0 v0 course course v1 v1 @cno @cno student student student student “cs100” “cs100” v2 v2 v5 v5 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v6 v6 v7 v7 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”

Implication Problem for FD
Given a DTD D and a set of functional dependencies   {}: (D, ) |-  (implies) if for any XML tree T conforming to D and satisfying  , it is the case that T |=  (D, )+ = {  | (D, ) |-  } Functional dependency  is trivial if it is implied by the DTD alone: (D, )   (D, ) |-  we read implies T satisfies  (satisfies Libkin cikk jele |=>) itt |-

XNF: XML Normal Form XML specification: a DTD D and a set of functional dependencies . A Relational DB is in BCNF if for every non-trivial functional dependency X  Y in the specification, X is a key. (D, ) is in XNF if: For each non-trivial FD X  or X  p.S in (D, )+, X  p is in (D, )+.

Back to DBLP DBLP is not in XNF: Proposed solution is in XNF.
DBLP.conf.issue   (D,)+ DBLP.conf.issue  DBLP.conf.issue.article  (D,)+ Proposed solution is in XNF.

Normalization Algorithm
The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD of the form: DBLP.conf.issue  then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

Normalizing XML Documents
Theorem The decomposition algorithm terminates and outputs a specification in XNF. It does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. It works even if we cannot compute (D,)+. If we know (D,)+, the output is better. Q1 Q2

Implication Problem (cont’d)
Typically, regular expressions used in DTDs are rather simple. Trivial regular expression: s1, s2, …, sn Each si is either one of ai, ai?, ai+ or ai* For i ≠ j, ai ≠ aj D is a simple DTD if all its productions use “permutations” of trivial regular expressions. Example: (a | b)* is a permutation of a*, b*

Implication Problem (cont’d)
Theorem For simple DTDs: The implication problem for FDs is solvable in O(n2). Testing if a specification is in XNF can be done in O(n3). Other results: There is a larger class of DTDs for which these problems are tractable. There is an even larger class of DTDs for which they are coNP-complete.

Final Comments The normalization algorithm can be improved in various ways. Why XNF? Theorem XNF generalizes BCNF and a normal form for nested relations (called NNF) when those are coded as XML documents. A complete classification of the complexity of the implication problem for FDs remains open. Implementation.

Backup Slides

Normalization Algorithm
We consider FDs of the form: {q, …, }  p where n >= 0 and q ends with an element type. The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD with n = 0: DBLP.conf.issue  then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

A Normal Form for XML Documents

Similar presentations

Presentation on theme: "A Normal Form for XML Documents"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Normal Form for XML Documents

Similar presentations

Presentation on theme: "A Normal Form for XML Documents"— Presentation transcript:

Similar presentations

About project

Feedback