Download presentation
1
Well-designed XML Data
Marcelo Arenas and Leonid Libkin University of Toronto
2
Outline Part 1 - Database Normalization from the 1970s and 1980s.
Part 2 - Classical theory revisited: normalizing XML documents. Part 3 - Classical theory re-done: new justifications for normalization. 2
3
Part 1: Classical Normalization
Design: decide how to represent the information in a particular data model. Even for simple application domains there is a large number of ways of representing the data of interest. We have to design the schema of the database. Set of relations. Set of attributes for each relation. Set of data dependencies. 3
4
Designing a Database: An Example
Attributes: number, title, section, room. Data dependency: every course number is associated with only one title. Relational Schema: BAD alternative: R(number, title, section, room), number title GOOD alternative: S(number, title), number title T(number, section, room), 4
5
Problems with BAD: Update Anomaly
number title section room CSC258 Computer Organization 1 LP266 2 GB258 3 GB248 CSC434 Database Systems Title of CSC258 is changed to Computer Organization I. 5
6
Problems with BAD: Update Anomaly
number title section room CSC258 Computer Organization 1 LP266 2 GB258 3 GB248 CSC434 Database Systems Title of CSC258 is changed to Computer Organization I. 5
7
Problems with BAD: Update Anomaly
number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 Database Systems Title of CSC258 is changed to Computer Organization I. The instance stores redundant information. 5
8
Computer Organization I
Deletion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 Database Systems CSC434 is not given in this term. 6
9
Computer Organization I
Deletion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 Database Systems CSC434 is not given in this term. 6
10
Computer Organization I
Deletion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted. 6
11
Computer Organization I
Insertion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) 7
12
Computer Organization I
Insertion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) 7
13
Computer Organization I
Insertion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC336 Numerical Methods ? A new course is created: (CSC336, Numerical Methods) The instance stores attributes that are not directly related. 7
14
Avoiding Update Anomalies
number title CSC258 Computer Organization CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 Title of CSC258 is changed to Computer Organization I. 8
15
Avoiding Update Anomalies
number title CSC258 Computer Organization CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 Title of CSC258 is changed to Computer Organization I. 8
16
Avoiding Update Anomalies
number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 CSC434 is not given in this term. Title of CSC258 is changed to Computer Organization I. The instance does not store redundant information. 8
17
Avoiding Update Anomalies
number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 CSC434 is not given in this term. 8
18
Avoiding Update Anomalies
number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) CSC434 is not given in this term. The title of CSC434 is not removed from the instance. 8
19
Avoiding Update Anomalies
number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) 8
20
Avoiding Update Anomalies
number title CSC258 Computer Organization I CSC434 Database Systems CSC336 Numerical Methods number section room CSC258 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) No information about sections has to be provided. Each relation stores attributes that are directly related. 8
21
Normalization Theory Main idea: a normal form defines a condition that a well designed database should satisfy. Normal form: syntactic condition on the database schema. Defined for a class of data dependencies. Main problems: How to test whether a database schema is in a particular normal form. How to transform a database schema into an equivalent one satisfying a particular normal form. 9
22
Normalization Theory Today
Normalization theory for relational databases was developed in the 70s and 80s. Why do we need normalization theory today? New data models have emerged: XML. XML documents can contain redundant information. Redundant information in XML documents: Can be discovered if the user provides semantic information. Can be eliminated. 10
23
Part 2: XML and Normalization
XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11
24
Part 2: XML and Normalization
XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11
25
Part 2: XML and Normalization
XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11
26
Part 2: XML and Normalization
XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11
27
Part 2: XML and Normalization
XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11
28
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
29
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
30
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> … </course> <course cno=“CSC434”> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
31
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
32
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
33
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
34
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
35
Part 2: XML and Normalization
XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11
36
XML Databases XML Schema: (D, ) D : : courses course* course @cno
taken_by student* student @sno name, grade Two students with the value must have the same name. 12
37
Redundancy in XML courses course course info @cno taken_by @cno
@sno name “CSC258” “CSC434” “st1” “Fox” student student student As a first example, let’s see a university database. [Describe the Example] In this case, if we know is an identifier for students, we are storing redundant information. In this case, the student with number “st1” appears twice . . . @sno name grade @sno name grade “st1” “Fox” “B+” “st1” “Fox” “A+” 13
38
XML Database Normalization
DTD: Data dependency: courses course* course @cno taken_by student* student @sno name, grade Two students with the value must have the same name. 14
39
XML Database Normalization
DTD: Data dependency: courses course* course @cno taken_by student* student @sno grade , info* Two students with the value must have the same name. @sno is the identifier of info elements. info @sno name 14
40
A “Non-relational” Example
DBLP conf conf title issue issue “ICDT” article article @year article @year “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001” 15
41
XNF: XML Normal Form It eliminates two types of anomalies.
It was defined for XML functional dependencies: DBLP.conf DBLP.conf.issue 16
42
Problems to Address Functional dependencies for XML.
Normal form for XML documents (XNF). Generalizes BCNF. Algorithm for normalizing XML documents. Implication problem for functional dependencies. 17
43
Framework: Paths in DTDs
Paths(D): all paths in a DTD D courses.course courses.course.student.name courses.course.student.name.S We distinguish three kinds of elements: attributes strings (S) and element types. FDs are defined by means of a relational representation of XML documents. 18
44
Framework: XML Trees v0 v1 . . . v2 v5 “cs100” v3 v4 v6 v7 “123” “456”
courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-” 19
45
Tree Tuples Relational representation: tree tuples - mappings
t : Paths(D) Vertices Strings {} A tree tuple represents an XML tree: courses t(courses) = v0 t(courses.course) = v1 = “cs100” t(courses.course.student) = v2 t(p) = , for the remaining paths v0 course v1 @cno student v2 “cs100” 20
46
XML Tree: set of Tree Tuples
courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v2 v5 v5 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-” 21
47
Functional Dependencies for XML
Expressions of the form: X Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). XML tree T can be tested for satisfaction of X Y if: X Y Paths(T) Paths(D) T X Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X ≠ implies u.Y = v.Y 22
48
FD: Examples University DTD: courses course* course @cno, student*
student name, grade Two students with the value must have the same name: courses.course.student.name.S Every student can have at most one grade in every course: { courses.course, } courses.course.student.grade.S 23
49
Implication Problem for FD
Given a DTD D and a set of functional dependencies {}: (D, ) if for any XML tree T conforming to D and satisfying , it is the case that T (D, )+ = { | (D, ) } Functional dependency is trivial if it is implied by the DTD alone: (D, ) 24
50
XNF: XML Normal Form XML specification: a DTD D and a set of functional dependencies . A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key. (D, ) is in XNF if: For each non-trivial FD X or X p.S in (D, )+, X p is in (D, )+. 25
51
Back to DBLP DBLP is not in XNF:
DBLP.conf.issue (D,)+ DBLP.conf.issue DBLP.conf.issue.article (D,)+ Proposed solution is in XNF. 26
52
Normalization Algorithm
The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD of the form: DBLP.conf.issue then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”. 27
53
Normalizing XML Documents
Theorem The decomposition algorithm terminates and outputs a specification in XNF. Furthermore, it does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. Q1 Q2 28
54
Part 3: What was Missing? Justification!
What is a good database design? Well-known solutions: BCNF, 4NF, … But what is it that makes a database design good? Elimination of update anomalies. Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. Previous work was specific for the relational model. Classical problems have to be revisited in the XML context. 29
55
Justification of Normal Forms
Problematic to evaluate XML normal forms. No XML update language has been standardized. No XML query language yet has the same “yardstick” status as relational algebra. We do not even know if implication of XML FDs is decidable! We need a different approach. It must be based on some intrinsic characteristics of the data. It must be applicable to new data models. It must be independent of query/update/constraint issues. Our approach is based on information theory. 30
56
Information Theory Entropy measures the amount of information provided by a certain event. Assume that an event can have n different outcomes with probabilities p1, …, pn. Amount of information gained by knowing that event i occurred : Average amount of information gained (entropy) : Entropy is maximal if each pi = 1/n : 31
57
Entropy and Redundancies
Database schema: R(A,B,C), A B Instance I: Pick a domain properly containing adom(I) : Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 Entropy: log 5 ≈ 2.322 A B C 1 2 3 4 A B C 1 3 2 4 A B C 1 2 4 A B C 1 2 3 4 A B C 1 2 3 4 Pick a domain properly containing adom(I) : {1, …, 6} Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 Entropy: log 1 = 0 {1, …, 6} 32
58
Entropy and Normal Forms
Let be a set of FDs over a schema S. Theorem (S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0). A similar result holds for 4NF and MVDs. This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ... 33
59
Problems with the Measure
The measure cannot distinguish between different types of data dependencies. It cannot distinguish between different instances of the same schema: R(A,B,C), A B A B C 1 2 3 4 A B C 1 2 3 4 5 entropy = 0 entropy = 0 34
60
A General Measure A B C 1 2 3 4 Instance I of schema R(A,B,C), A B :
35
61
A General Measure A B C 1 2 3 4 Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7. 35
62
A General Measure A B C 1 2 3 4 Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7. 35
63
A General Measure A B C 1 3 2 4 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7. 35
64
A General Measure A B C 1 3 2 4 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. 35
65
A General Measure A B C 3 1 2 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. 35
66
A General Measure A B C 3 1 2 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 35
67
A General Measure A B C 2 3 1 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 35
68
A General Measure A B C 1 2 3 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 35
69
A General Measure A B C 4 2 3 1 7 Instance I of schema R(A,B,C), A B : Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 35
70
A General Measure A B C 1 2 3 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ 35
71
A General Measure A B C 3 1 2 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) = 35
72
A General Measure A B C a 3 1 2 Instance I of schema R(A,B,C), A B :
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) = 35
73
A General Measure A B C 2 a 3 1 7 Instance I of schema R(A,B,C), A B : Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) = 35
74
A General Measure A B C 1 a 3 2 6 Instance I of schema R(A,B,C), A B : Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ ( 42) = 0.16 For a ≠ 2, P(a | X) = 42/ ( 42) = 0.14 Entropy ≈ (log 7 ≈ ) 35
75
A General Measure A B C 1 3 2 4 Instance I of schema R(A,B,C), A B :
Value : we consider the average over all sets X Pos(I) – {p}. Average: < log 7 (maximal entropy) It corresponds to conditional entropy. It depends on the value of k ... 35
76
A General Measure Previous value: For each k, we consider the ratio:
How close the given position p is to having the maximum possible information content. General measure: 36
77
Basic Properties The measure is well defined: Bounds:
For every set of firstorder constraints defined over a schema S, every I inst(S,), and every p Pos(I): exists. Bounds: 37
78
Basic Properties The measure does not depend on a particular representation of constraints. If 1 and 2 are equivalent: It overcomes the limitations of the simple measure: R(A,B,C), A B A B C 1 2 3 4 5 A B C 1 2 3 4 0.875 0.781 38
79
Well-Designed Databases
Definition A database specification (S,) is well-designed if for every I inst(S,) and every p Pos(I), = 1. In other words, every position in every instance carries the maximum possible amount of information. We would like to test this definition in the relational world ... 39
80
Relational Databases is a set of data dependencies over a schema S:
= : (S,) is well-designed. is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF. is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF. is a set of FDs and JDs: If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true. A syntactic characterization of being well-designed is given in [AL03]. 40
81
Relational Databases The problem of verifying whether a relational schema is well-designed is undecidable. If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. Now we would like to apply our definition in the XML world ... 41
82
XML Databases XML schema: (D,).
D is a DTD. is a set of data dependencies over D. We would like to evaluate XML normal forms. The notion of being well-designed extends from relations to XML. The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T). 42
83
Positions in an XML Tree
DBLP conf conf title issue issue “ICDT” “ICDT” article article article author title @year author title @year title @year “Dong” “Dong” “. . .” “. . .” “1999” “1999” “Jarke” “Jarke” “. . .” “. . .” “1999” “1999” “. . .” “. . .” “2001” “2001” 43
84
Well-Designed XML Data
We consider k such that adom(T) {1, …,k}. For each k : We consider the ratio: General measure: 44
85
XNF: XML Normal Form For arbitrary XML data dependencies:
Definition An XML specification (D,) is well-designed if for every T inst(D,) and every p Pos(T), = 1. For functional dependencies: Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed. 45
86
Normalization Algorithms
The information-theoretic measure can also be used for reasoning about normalization algorithms. For BCNF and XNF decomposition algorithms: Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease. 46
87
Future Work We would like to consider more complex XML constraints and characterize good designs they give rise to. We would like to characterize 3NF by using the measure developed in this paper. In general, we would like to characterize “non-perfect” normal forms. We would like to develop better characterizations of normalization algorithms using our measure. Why is the “usual” BCNF decomposition algorithm good? Why does it always stop? 47
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.