Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.

Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto

Outline Part 1 - Database Normalization from the 1970s and 1980s Part 2 - Classical theory revisited: normalizing XML documents Part 3 - Classical theory re-done: new justifications for normalization

Part 1: Classical Normalization Design: decide how to represent the information in a particular data model. Even for simple application domains there is a large number of ways of representing the data of interest. We have to design the schema of the database. Set of relations. Set of attributes for each relation. Set of data dependencies.

Designing a Database: An Example Attributes: number, title, section, room. Data dependency: every course number is associated with only one title. Relational Schema: R(number, title, section, room), number  title GOOD alternative:S(number, title), number  title T(number, section, room),  BAD alternative:

Problems with BAD: Update Anomaly numbertitlesectionroom CSC258Computer Organization1LP266 CSC258Computer Organization2GB258 CSC258Computer Organization3GB248 CSC434Database Systems1GB248 Title of CSC258 is changed to Computer Organization I.

Problems with BAD: Update Anomaly numbertitlesectionroom CSC258Computer Organization I1LP266 CSC258Computer Organization I2GB258 CSC258Computer Organization I3GB248 CSC434Database Systems1GB248 Title of CSC258 is changed to Computer Organization I. The instance stores redundant information.

Deletion Anomaly numbertitlesectionroom CSC258Computer Organization I1LP266 CSC258Computer Organization I2GB258 CSC258Computer Organization I3GB248 CSC434Database Systems1GB248 CSC434 is not given in this term.

Deletion Anomaly numbertitlesectionroom CSC258Computer Organization I1LP266 CSC258Computer Organization I2GB258 CSC258Computer Organization I3GB248 CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted.

Insertion Anomaly numbertitlesectionroom CSC258Computer Organization I1LP266 CSC258Computer Organization I2GB258 CSC258Computer Organization I3GB248 A new course is created: (CSC336, Numerical Methods)

Insertion Anomaly numbertitlesectionroom CSC258Computer Organization I1LP266 CSC258Computer Organization I2GB258 CSC258Computer Organization I3GB248 CSC336Numerical Methods?? A new course is created: (CSC336, Numerical Methods) The instance stores attributes that are not directly related.

Avoiding Update Anomalies numbertitle CSC258Computer Organization CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC2583GB248 CSC4341GB248 Title of CSC258 is changed to Computer Organization I.

Avoiding Update Anomalies numbertitle CSC258Computer Organization I CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC2583GB248 CSC4341GB248 Title of CSC258 is changed to Computer Organization I.CSC434 is not given in this term. The instance does not store redundant information.

Avoiding Update Anomalies numbertitle CSC258Computer Organization I CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC2583GB248 CSC4341GB248 CSC434 is not given in this term.

Avoiding Update Anomalies numbertitle CSC258Computer Organization I CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC2583GB248 CSC434 is not given in this term. The title of CSC434 is not removed from the instance. A new course is created: (CSC336, Numerical Methods)

Avoiding Update Anomalies numbertitle CSC258Computer Organization I CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC2583GB248 A new course is created: (CSC336, Numerical Methods)

Avoiding Update Anomalies numbertitle CSC258Computer Organization I CSC434Database Systems CSC336Numerical Methods numbersectionroom CSC2581LP266 CSC2582GB258 CSC2583GB248 A new course is created: (CSC336, Numerical Methods) No information about sections has to be provided. Each relation stores attributes that are directly related.

Normalization Theory Main idea: a normal form defines a condition that a well designed database should satisfy. Normal form: syntactic condition on the database schema. Defined for a class of data dependencies. Main problems: How to test whether a database schema is in a particular normal form. How to transform a database schema into an equivalent one satisfying a particular normal form.

BCNF: a Normal Form for FDs Functional dependency (FD) over R(A 1, …, A n ) : X  Y, X, Y  {A 1, …, A n }. X  Y : two rows with the same X-values must have the same Y-values. number  title : two rows with the same course number must have the same title. Key dependency : X  A 1  A n X is a key: two distinct rows must have distinct X-values.

BCNF: a Normal Form for FDs  is a set of FD over R(A 1, …, A n ). Relation schema R(A 1, …, A n ),  is in BCNF if for every X  Y in , X is a key. A relational schema is in BCNF if every relation schema is in BCNF. In BCNF:S(number, title), number  title T(number, section, room),  Not in BCNF:R(number, title, section, room), number  title In BCNF:S(number, title), number  title T(number, section, room),  In BCNF:S(number, title), number  title T(number, section, room), 

Normalization Theory Today Normalization theory for relational databases was developed in the 70s and 80s. Why do we need normalization theory today? New data models have emerged: XML. XML documents can contain redundant information. Redundant information in XML documents: Can be discovered if the user provides semantic information. Can be eliminated.

XML Documents courses course @cno taken_by student @sno@name@grade @name@sno student... “st1” “A+”“B+” “CSC258”“CSC434” “Fox”

XML Databases D :  : : Two students with the same @sno value must have the same name. courses  course* course  @cno course  taken_by  student* student  @sno, @name, @grade student  ε XML Schema: (D,  )

Redundancy in XML courses course info @cno taken_by student @name@sno... “CSC258”“CSC434”“st1”“Fox” @sno@name@grade @name@sno “st1” “A+”“B+”“Fox”

XML Database Normalization DTD:Data dependency: Two students with the same @sno value must have the same name. courses  course* course  @cno course  taken_by  student* student  @sno, @name, @grade student  ε

XML Database Normalization DTD:, info*@sno is the identifier of info elements. courses  course* course  @cno course  taken_by  student* student  @sno, @grade student  ε info  @sno, @name Data dependency: Two students with the same @sno value must have the same name.

A “Non-relational” Example DBLP conf @title issue article @year@title @year “ICDT” @year @title “1999” “2001” “...”...

XNF: XML Normal Form Proposed in [AL02]. It eliminates two types of anomalies. It was defined for XML functional dependencies: DBLP.conf.@title  DBLP.conf DBLP.conf.issue  DBLP.conf.issue.article.@year

Part 3: What was Missing? Justification! What is a good database design? Well-known solutions: BCNF, 4NF, … But what is it that makes a database design good? Elimination of update anomalies. Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. Previous work was specific for the relational model. Classical problems have to be revisited in the XML context.

Justification of Normal Forms Problematic to evaluate XML normal forms. No XML update language has been standardized. No XML query language yet has the same “yardstick” status as relational algebra. We do not even know if implication of XML FDs is decidable! We need a different approach. It must be based on some intrinsic characteristics of the data. It must be applicable to new data models. It must be independent of query/update/constraint issues. Our approach is based on information theory.

Information Theory Entropy measures the amount of information provided by a certain event. Assume that an event can have n different outcomes with probabilities p 1, …, p n. Amount of information gained by knowing that event i occurred : Average amount of information gained (entropy) : Entropy is maximal if each p i = 1/n :

Entropy and Redundancies Database schema: R(A,B,C), A  B Instance I : Pick a domain properly containing adom(I) : Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 Entropy: log 5 ≈ 2.322 ABC 123 124 ABC 123 124 ABC 12 124 ABC 123 124 ABC 13 124 Pick a domain properly containing adom(I) : {1, …, 6} Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 Entropy: log 1 = 0 {1, …, 6}

Entropy and Normal Forms Let  be a set of FDs over a schema S. Theorem (S,  ) is in BCNF if and only if for every instance of (S,  ) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0). This is a clean characterization of BCNF, but the measure is not accurate enough...

Problems with the Measure The measure cannot distinguish between different types of data dependencies. It cannot distinguish between different instances of the same schema: ABC 123 124 15 ABC 123 14 entropy = 0 R(A,B,C), A  B entropy = 0

A General Measure Instance I of schema R(A,B,C), A  B : ABC 123 124

A General Measure Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. ABC 123 124

A General Measure Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. ABC 13 124 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}.

A General Measure Instance I of schema R(A,B,C), A  B : ABC 13 124 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}.

A General Measure Instance I of schema R(A,B,C), A  B : ABC 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}.

A General Measure Instance I of schema R(A,B,C), A  B : ABC 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =

A General Measure Instance I of schema R(A,B,C), A  B : ABC 123 123 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/

A General Measure Instance I of schema R(A,B,C), A  B : ABC 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) =

A General Measure Instance I of schema R(A,B,C), A  B : ABC a 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) =

A General Measure Instance I of schema R(A,B,C), A  B : ABC 2 a 3 127 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) =

A General Measure Instance I of schema R(A,B,C), A  B : ABC 1 a 3 126 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) =42/ (48 + 6  42) = 0.16 (48 + 6  42) = 0.14 Entropy ≈ 2.8057 (log 7 ≈ 2.8073)

A General Measure Instance I of schema R(A,B,C), A  B : ABC 13 124 Value : we consider the average over all sets X  Pos(I) – {p}. Average: 2.4558 < log 7 (maximal entropy) It corresponds to conditional entropy. It depends on the value of k...

A General Measure Previous value: For each k, we consider the ratio: How close the given position p is to having the maximum possible information content. General measure:

Basic Properties The measure is well defined: For every set of firstorder constraints  defined over a schema S, every I  inst(S,  ), and every p  Pos(I): exists. Bounds:

Basic Properties The measure does not depend on a particular representation of constraints. If  1 and  2 are equivalent: It overcomes the limitations of the simple measure: R(A,B,C), A  B ABC 123 124 15 ABC 123 14 0.8750.781

Well-Designed Databases Definition A database specification (S,  ) is well- designed if for every I  inst(S,  ) and every p  Pos(I), = 1. In other words, every position in every instance carries the maximum possible amount of information. We would like to test this definition in the relational world...

Relational Databases  is a set of data dependencies over a schema S :  =  : (S,  ) is well-designed.  is a set of FDs: (S,  ) is well-designed if and only if (S,  ) is in BCNF.  is a set of FDs and MVDs: (S,  ) is well-designed if and only if (S,  ) is in 4NF.  is a set of FDs and JDs: If (S,  ) is in PJ/NF or in 5NFR, then (S,  ) is well-designed. The converse is not true. A syntactic characterization of being well-designed is given in [AL03].

Relational Databases If (S,  ) is in DK/NF, then (S,  ) is well-designed. The converse is not true. The problem of verifying whether a relational schema is well-designed is undecidable. If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem is co-NEXPTIME- complete. If each relation in S has at most m attributes, then the problem is -complete. Now we would like to apply our definition in the XML world...

XML Databases XML schema: (D,  ). D is a DTD.  is a set of data dependencies over D. We would like to evaluate XML normal forms. The notion of being well-designed extends from relations to XML. The measure is robust; we just need to define the set of positions in an XML tree T : Pos(T).

Positions in an XML Tree DBLP conf @titleissue article @year@title @year “ICDT” @year@title “1999” “2001” “...” “ICDT” “1999” “2001” “...”

Well-Designed XML Data We consider k such that adom(T)  {1, …,k}. For each k : We consider the ratio: General measure:

XNF: XML Normal Form For arbitrary XML data dependencies: Definition An XML specification (D,  ) is well- designed if for every T  inst(D,  ) and every p  Pos(T), = 1. For functional dependencies: Theorem An XML specification (D,  ) is in XNF if and only if (D,  ) is well-designed.

Normalization Algorithms: BCNF Relation schema: R(X,Y,Z),  Not in BCNF:   X  Y and   X  A, for every A  Z. Basic decomposition: replace R(X,Y,Z) by S(X,Y) and T(X,Z). Example: R(number, title, section, room), number  title S(number, title), number  title T(number, section, room), 

Normalization Algorithms: BCNF numbertitlesectionroom CSC258Computer Organization1LP266 CSC258Computer Organization2GB258 CSC434Database Systems1GB248 numbertitle CSC258Computer Organization CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC4341GB248  number, title (R)  number, section, room (R)

Normalization Algorithms: BCNF numbertitlesectionroom CSC258Computer Organization1LP266 CSC258Computer Organization2GB258 CSC434Database Systems1GB248 numbertitle CSC258Computer Organization CSC434Database Systems numbersectionroom CSC2581LP266 CSC2582GB258 CSC4341GB248 S  T

Normalization Algorithms: XNF The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD of the form: DBLP.conf.issue  DBLP.conf.issue.article.@year then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

Normalization Algorithms The information-theoretic measure can also be used for reasoning about normalization algorithms. For BCNF and XNF decomposition algorithms: Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease.

Future Work We would like to consider more complex XML constraints and characterize good designs they give rise to. We would like to characterize 3NF by using the measure developed in this paper. In general, we would like to characterize “non-perfect” normal forms. We would like to develop better characterizations of normalization algorithms using our measure. Why is the “usual” BCNF decomposition algorithm good? Why does it always stop?

Backup Slides

XNF: XML Normal Form Given a DTD D and a set of functional dependencies   {  }: (D,  )   if for any XML tree T conforming to D and satisfying , it is the case that T   (D,  ) + = {  | (D,  )   } Functional dependency  is trivial if it is implied by the DTD alone: (D,  )  

XNF: XML Normal Form XML specification: a DTD D and a set of functional dependencies . A Relational DB is in BCNF if for every non-trivial functional dependency X  Y in the specification, X is a key. (D,  ) is in XNF if: For each non-trivial FD X  p.@l in (D,  ) +, X  p is in (D,  ) +.

A Normal Form for FDs and JDs Let  be a set of FDs and JDs over a schema S : Theorem (S,  ) is well-designed if and only if for every R  S and every nontrivial JD: implied by , there exists M  {1,..., m} such that: 1. 2.For every i,j  M,  implies

A Normal Form for FDs and JDs (cont’d) Schema: S = { R(A,B,C) } and  = {  [AB, AC, BC], AB  C, AC  B }. (S,  ) is not in PJ/NF: {AB  ABC, AC  ABC} does not imply  [AB, AC, BC]. (S,  ) is not in 5NFR:  [AB, AC, BC] is strong- reduced and BC is not a superkey. (S,  ) is well-designed.

Tree Tuples Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.@name We distinguish two kinds of elements: attributes (@) and element types. FDs are defined by means of a relational representation of XML documents.

XML Trees v1v1 v2v2 v3v3 v0v0... courses course @cno “cs100” @sno @name @grade @sno @name @grade student “123” “Fox”“B+”“Smith”“A-”“456”

Tree Tuples v1v1 v2v2 v0v0 courses course @cnostudent “cs100” t(courses) = v 0 t(courses.course) = v 1 t(courses.course.@cno) = “cs100” t(courses.course.student) = v 2 t(p) = , for the remaining paths Relational representation: tree tuples - mappings t : Paths(D)  Vertices  Strings  {  } A tree tuple represents an XML tree:

XML Tree: set of Tree Tuples v1v1 v2v2 v3v3 v0v0... courses course @cno “cs100” @sno @name @grade @sno @name @grade student “123”“456” “Fox”“B+”“Smith”“A-” v1v1 v2v2 courses course @cno “cs100” student v0v0 @sno @name @grade “123” “Fox”“B+” v3v3 @sno @name @grade student “456” “Smith”“A-”... course

Functional Dependencies for XML Expressions of the form: X  Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). XML tree T can be tested for satisfaction of X  Y if: X  Y  Paths(T)  Paths(D) T  X  Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X ≠  implies u.Y = v.Y

FD: Examples University DTD:courses  course* course  @cno, student* student  @sno, name, grade Two students with the same @sno value must have the same name: courses.course.student.@sno  courses.course.student.@name Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno }  courses.course.student.@grade

Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.

Similar presentations

Presentation on theme: "Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.

Similar presentations

Presentation on theme: "Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto."— Presentation transcript:

Similar presentations

About project

Feedback