A Normal Form for XML Documents

Slides:



Advertisements
Similar presentations
A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.
Advertisements

XML DOCUMENTS AND DATABASES
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
Well-designed XML Data
An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto.
Keys For XML Peter Buneman Susan Davidson Wenfei Fan Carmem Hara Wang Chiew Tan.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Maziar Sanaii Ashtiani – SCT – EMU, Fall 2011/12.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
Database Normalization Revisited: An information-theoretic approach Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Identity Constraints.
XML – A Quick Introduction Kerry Raymond (stolen from others)
© D. Wong Ch. 3 (continued)  Database design problems  Functional Dependency  Keys of relations  Decompositions based on Functional Dependency.
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
XML and Database.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
Towards the Preservation of Keys in XML Data Transformation for Integration Md. Sumon Shahriar and Jixue Liu Data and Web Engineering Lab Computer and.
Towards STEP Interchange Seeing the Document as a Snapshot of the Data Presentation by Daniel Rivers-Moore, Technical Director, RivCom to the 4th International.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Relational Database Design Algorithms and Further Dependencies.
1 CS 430 Database Theory Winter 2005 Lecture 8: Functional Dependencies Second, Third, and Boyce-Codd Normal Forms.
CITA 330 Section 2 DTD. Defining XML Dialects “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it Any useful.
Modified Slides from Dr.Peter Buneman 1 XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS 480: Database Systems Lecture 26 March 18, 2013.
Conceptual Modeling for XML Data
XML: Extensible Markup Language
CS 480: Database Systems Lecture 25 March 15, 2013.
Chapter 15 Relational Design Algorithms and Further Dependencies
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
CS422 Principles of Database Systems Normalization
Schedule Today: Next After that Normal Forms. Section 3.6.
Database Management Systems (CS 564)
Normalization First Normal Form (1NF) Boyce-Codd Normal Form (BCNF)
CS422 Principles of Database Systems Normalization
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Relational Database Design by Dr. S. Sridhar, Ph. D
Chapter 8: Relational Database Design
Handout 4 Functional Dependencies
Managing XML and Semistructured Data
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Fourth normal form: 4NF.
Functional Dependencies and Normalization
Lecture 6: Design Theory
Faloutsos & Pavlo SCS /615 Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Lecture #16: Schema Refinement & Normalization.
A Normal Form for XML Documents
Temple University – CIS Dept. CIS331– Principles of Database Systems
Semi-Structured data (XML Data MODEL)
Advanced Algorithms Analysis and Design
Normalization Part II cs3431.
Normalization cs3431.
Instructor: Mohamed Eltabakh
XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type – thus constraints are often the only.
eXtensible Markup Language
Constraint satisfaction problems
Functional Dependencies
Constraint satisfaction problems
Functional Dependencies and Normalization
Functional Dependencies and Normalization
Presentation transcript:

A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto

Motivating Example courses course course info @cno student student @sno name “cs100” . . . “cs225” “123” “Fox” @sno name grade @sno name grade “123” “Fox” “B+” “123” “Fox” “A+”

Our Goal DTD: Integrity Constraints: courses  course* course  @cno, student* student  @sno, name, grade name  S grade  S , info* two students with the same @sno value must have the same name. @sno is the key of info. info  @sno, name

A “Non-relational” Example DBLP conf conf . . . title issue issue “ICDT” article article @year article @year Issue = kiadas “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001”

Problems to Address Functional dependencies for XML. Normal form for XML documents (XNF). Generalizes BCNF. Algorithm for normalizing XML documents. Implication problem for functional dependencies.

Framework: DTDs We do not consider mixed content, IDs and IDREFs. Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.name courses.course.student.name.S We distinguish three kinds of elements: attributes (@), strings (S) and element types.

Framework: XML Trees v0 v1 . . . v2 v5 “cs100” v3 v4 v6 v7 “123” “456” courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-”

Towards FDs for XML We know how to handle FDs in relational DBs. We need a relational representation of XML. We do it by considering tree tuples: a tree tuple in a DTD D is a mapping t : Paths(D)  Vertices  Strings  {} consistent with D: could be extracted from an XML tree conforming to D.

Tree tuple example v0 v1 v2 “cs100” v3 v4 “123” “Fox” “B+” courses @cno student v2 “cs100” @sno grade name v3 v4 “123” S S “Fox” “B+”

XML Tree: set of Tree Tuples courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v2 v5 v5 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-”

Functional Dependencies Expressions of the form: X  Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). XML tree T can be tested for satisfaction of X  Y if: X  Y  Paths(T)  Paths(D) T |= X  Y (satisfies) if for every pair u, v of tree tuples in T: u.X = v.X and u.X ≠  implies u.Y = v.Y node equality = equality between vertices value equality = equality between strings We say T is compatible with D (T ◄ D) iff path(T) included in path(D) We say T conforms to D (Definition 3 Libkin cikk) jele |=> We say T satisfies X -> Y (satisfies Libkin cikk jele |=>) itt 

FD: Examples University DTD: courses  course* course  @cno, student* student  @sno, name, grade Two students with the same @sno value must have the same name: courses.course.student.@sno  courses.course.student.name.S Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno }  courses.course.student.grade.S

Checking FD Satisfaction { courses.course, courses.course.student.@sno }  courses.course.student.grade.S courses courses v0 v0 course course course course v1 v1 v5 v5 @cno @cno student student @cno @cno student student “cs100” “cs100” v2 v2 “cs225” “cs225” v6 v6 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v7 v7 v8 v8 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”

Checking FD Satisfaction { courses.course, courses.course.student.@sno }  courses.course.student.grade.S courses courses v0 v0 course course v1 v1 @cno @cno student student student student “cs100” “cs100” v2 v2 v5 v5 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v6 v6 v7 v7 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”

Implication Problem for FD Given a DTD D and a set of functional dependencies   {}: (D, ) |-  (implies) if for any XML tree T conforming to D and satisfying  , it is the case that T |=  (D, )+ = {  | (D, ) |-  } Functional dependency  is trivial if it is implied by the DTD alone: (D, )   (D, ) |-  we read implies T satisfies  (satisfies Libkin cikk jele |=>) itt |-

XNF: XML Normal Form XML specification: a DTD D and a set of functional dependencies . A Relational DB is in BCNF if for every non-trivial functional dependency X  Y in the specification, X is a key. (D, ) is in XNF if: For each non-trivial FD X  p.@l or X  p.S in (D, )+, X  p is in (D, )+.

Back to DBLP DBLP is not in XNF: Proposed solution is in XNF. DBLP.conf.issue  DBLP.conf.issue.article.@year  (D,)+ DBLP.conf.issue  DBLP.conf.issue.article  (D,)+ Proposed solution is in XNF.

Normalization Algorithm The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD of the form: DBLP.conf.issue  DBLP.conf.issue.article.@year then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

Normalizing XML Documents Theorem The decomposition algorithm terminates and outputs a specification in XNF. It does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. It works even if we cannot compute (D,)+. If we know (D,)+, the output is better. Q1 Q2

Implication Problem (cont’d) Typically, regular expressions used in DTDs are rather simple. Trivial regular expression: s1, s2, …, sn Each si is either one of ai, ai?, ai+ or ai* For i ≠ j, ai ≠ aj D is a simple DTD if all its productions use “permutations” of trivial regular expressions. Example: (a | b)* is a permutation of a*, b*

Implication Problem (cont’d) Theorem For simple DTDs: The implication problem for FDs is solvable in O(n2). Testing if a specification is in XNF can be done in O(n3). Other results: There is a larger class of DTDs for which these problems are tractable. There is an even larger class of DTDs for which they are coNP-complete.

Final Comments The normalization algorithm can be improved in various ways. Why XNF? Theorem XNF generalizes BCNF and a normal form for nested relations (called NNF) when those are coded as XML documents. A complete classification of the complexity of the implication problem for FDs remains open. Implementation.

Backup Slides

Normalization Algorithm We consider FDs of the form: {q, p1.@l1, …, pn.@ln }  p where n >= 0 and q ends with an element type. The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD with n = 0: DBLP.conf.issue  DBLP.conf.issue.article.@year then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”.