An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto.

Slides:



Advertisements
Similar presentations
5NF and other normal forms
Advertisements

Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.
A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.
Schema Refinement: Normal Forms
Shantanu Narang.  Background  Why and What of Normalization  Quick Overview of Lower Normal Forms  Higher Order Normal Forms.
Normalization Decomposition techniques for ensuring: Lossless joins Dependency preservation Redundancy avoidance We will look at some normal forms: Boyce-Codd.
Announcements Read 6.1 – 6.3 for Wednesday Project Step 3, due now Homework 5, due Friday 10/22 Project Step 4, due Monday Research paper –List of sources.
Schema Refinement and Normal Forms Given a design, how do we know it is good or not? What is the best design? Can a bad design be transformed into a good.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Further Dependencies by Pinar Senkul resources: mostly froom Elmasri, Navathe and other books.
1 CS 430 Database Theory Winter 2005 Lecture 9: Fourth and Fifth Normal Forms.
1 Design Theory. 2 Minimal Sets of Dependancies A set of dependencies is minimal if: 1.Every right side is a single attribute 2.For no X  A in F and.
Chapter 3 Notes. 3.1 Functional Dependencies A functional dependency is a statement that – two tuples of a relation that agree on some particular set.
Relational Normalization Theory. Limitations of E-R Designs Provides a set of guidelines, does not result in a unique database schema Does not provide.
Lossless Decomposition (2) Prof. Sin-Min Lee Department of Computer Science San Jose State University.
Efficient Query Evaluation on Probabilistic Databases
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases.
Lossless Decomposition (2) Prof. Sin-Min Lee Department of Computer Science San Jose State University.
Well-designed XML Data
Nov 11, 2003Murali Mani Normalization B term 2004: lecture 7, 8, 9.
1 Database Design Theory Which tables to have in a database Normalization.
Normal Form Design addendum by C. Zaniolo. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Normal Form Design Compute the canonical cover.
Design Theory.
1 Normalization Chapter What it’s all about Given a relation, R, and a set of functional dependencies, F, on R. Assume that R is not in a desirable.
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
Decomposition By Yuhung Chen CS157A Section 2 October
Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.
1 Multi-valued Dependencies. 2 Multivalued Dependencies There are database schemas in BCNF that do not seem to be sufficiently normalized. Consider a.
Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.
Cs3431 Normalization Part II. cs3431 Attribute Closure : Example Consider R (A, B, C, D, E) with FDs A  B, B  C, CD  E Does A  E hold ? (Is A  E.
MVDs: 1 Join Dependencies—Example Let r = A B C = A B |  | A C 1 a x 1 a 1 x 1 a y 1 b 1 y 1 b x 2 a 2 y 1 b y 2 b 2 a y 2 b y Observe: r =  AB r | 
1 Triggers: Correction. 2 Mutating Tables (Explanation) The problems with mutating tables are mainly with FOR EACH ROW triggers STATEMENT triggers can.
1 Lecture 7: Schema refinement: Normalisation
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
Functional Dependencies Prof. Yin-Fu Huang CSIE, NYUST Chapter 11.
CS 405G: Introduction to Database Systems 16. Functional Dependency.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Copyright © Curt Hill Schema Refinement III 4 th NF and 5 th NF.
Relational Database Design by Relational Database Design by Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING.
Chapter 8: Relational Database Design First Normal Form First Normal Form Functional Dependencies Functional Dependencies Decomposition Decomposition Boyce-Codd.
Chapter 13 Further Normalization II: Higher Normal Forms.
Database Normalization Revisited: An information-theoretic approach Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi.
NormalizationNormalization Chapter 4. Purpose of Normalization Normalization  A technique for producing a set of relations with desirable properties,
Further Normalization II: Higher Normal Forms Prof. Yin-Fu Huang CSIE, NYUST Chapter 13.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
Lecture 09: Functional Dependencies. Outline Functional dependencies (3.4) Rules about FDs (3.5) Design of a Relational schema (3.6)
SCUJ. Holliday - coen 1784–1 Schedule Today: u Normal Forms. u Section 3.6. Next u Relational Algebra. Read chapter 5 to page 199 After that u SQL Queries.
Functional Dependencies and Normalization 1 Instructor: Mohamed Eltabakh
Functional Dependencies. FarkasCSCE 5202 Reading and Exercises Database Systems- The Complete Book: Chapter 3.1, 3.2, 3.3., 3.4 Following lecture slides.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
Functional Dependencies. Outline Functional dependencies (3.4) Rules about FDs (3.5) Design of a Relational schema (3.6)
1 Lecture 7: Normal Forms, Relational Algebra Monday, 10/15/2001.
Chapter 7: Relational Database Design. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Chapter 7: Relational Database Design First Normal.
Third Normal Form (3NF) Zaki Malik October 23, 2008.
Dr. Mohamed Osman Hegaz1 Logical data base design (2) Normalization.
Functional Dependencies and Relational Schema Design.
Lecture 13: Relational Decomposition and Relational Algebra February 5 th, 2003.
Multivalued Dependencies and 4th NF CIS 4301 Lecture Notes Lecture /21/2006.
CS 338Database Design and Normal Forms9-1 Database Design and Normal Forms Lecture Topics Measuring the quality of a schema Schema design with normalization.
Ch 7: Normalization-Part 1
Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 19.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
1 CS 430 Database Theory Winter 2005 Lecture 8: Functional Dependencies Second, Third, and Boyce-Codd Normal Forms.
Advanced Database System
Lecture 9: Query Complexity Tuesday, January 30, 2001.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
Formal definition of a key A key is a set of attributes A 1,..., A n such that for any other attribute B: A 1,..., A n  B A minimal key is a set of attributes.
A Normal Form for XML Documents
A Normal Form for XML Documents
Review  Only two chapters (6 & 7) Normalization Theory Triggers.
Chapter 7a: Overview of Database Design -- Normalization
Presentation transcript:

An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto

Motivation What is a good database design? Well-known solutions: BCNF, 4NF, … But what is it that makes a database design good? Elimination of update anomalies. Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. Previous work was specific for the relational model. Classical problems have to be revisited in the XML context. 2

Motivation Problematic to evaluate XML normal forms. No XML update language has been standardized. No XML query language yet has the same “yardstick” status as relational algebra. We do not even know if implication of XML FDs is decidable! We need a different approach. It must be based on some intrinsic characteristics of the data. It must be applicable to new data models. It must be independent of query/update/constraint issues. Our approach is based on information theory. 3

Outline Information theory. A simple information-theoretic measure. A general information-theoretic measure. Definition of being well-designed. Relational databases. XML databases. 4

Information Theory Entropy measures the amount of information provided by a certain event. Assume that an event can have n different outcomes with probabilities p 1, …, p n. Amount of information gained by knowing that event i occurred : Average amount of information gained (entropy) : Entropy is maximal if each p i = 1/n : 5

Entropy and Redundancies Database schema: R(A,B,C), A  B Instance I : Pick a domain properly containing adom(I) : Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 Entropy: log 5 ≈ ABC ABC ABC ABC ABC Pick a domain properly containing adom(I) : {1, …, 6} Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 Entropy: log 1 = 0 {1, …, 6} 6

Entropy and Normal Forms Let  be a set of FDs over a schema S. Theorem (S,  ) is in BCNF if and only if for every instance of (S,  ) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0). A similar result holds for 4NF and MVDs. This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough... 7

Problems with the Measure The measure cannot distinguish between different types of data dependencies. It cannot distinguish between different instances of the same schema: ABC ABC entropy = 0 R(A,B,C), A  B entropy = 0 8

A General Measure Instance I of schema R(A,B,C), A  B : ABC

A General Measure Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. ABC

A General Measure Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. ABC

A General Measure Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. ABC Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC a 3 12 Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC 2 a Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) = 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC 1 a Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) =48/ For a ≠ 2, P(a | X) =42/ (  42) = 0.16 (  42) = 0.14 Entropy ≈ (log 7 ≈ ) 9

A General Measure Instance I of schema R(A,B,C), A  B : ABC Value : we consider the average over all sets X  Pos(I) – {p}. Average: < log 7 (maximal entropy) It corresponds to conditional entropy. It depends on the value of k... 9

A General Measure Previous value: For each k, we consider the ratio: How close the given position p is to having the maximum possible information content. General measure: 10

Basic Properties The measure is well defined: For every set of first­order constraints  defined over a schema S, every I  inst(S,  ), and every p  Pos(I): exists. Bounds: 11

Basic Properties The measure does not depend on a particular representation of constraints. If  1 and  2 are equivalent: It overcomes the limitations of the simple measure: R(A,B,C), A  B ABC ABC

Well-Designed Databases Definition A database specification (S,  ) is well- designed if for every I  inst(S,  ) and every p  Pos(I), = 1. In other words, every position in every instance carries the maximum possible amount of information. We would like to test this definition in the relational world... 13

Relational Databases  is a set of data dependencies over a schema S :  =  : (S,  ) is well-designed.  is a set of FDs: (S,  ) is well-designed if and only if (S,  ) is in BCNF.  is a set of FDs and MVDs: (S,  ) is well-designed if and only if (S,  ) is in 4NF.  is a set of FDs and JDs: If (S,  ) is in PJ/NF or in 5NFR, then (S,  ) is well-designed. The converse is not true. A syntactic characterization of being well-designed is given in the paper. 14

Relational Databases The problem of verifying whether a relational schema is well-designed is undecidable. If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. Now we would like to apply our definition in the XML world... 15

XML Databases XML specification: (D,  ). D is a DTD.  is a set of data dependencies over D. We would like to evaluate XML normal forms. The notion of being well-designed extends from relations to XML. The measure is robust; we just need to define the set of positions in an XML tree T : Pos(T). 16

Positions in an XML Tree DBLP conf “ICDT” “1999” “Dong”“2001”“Jarke”“...” “ICDT” “1999” “Dong”“2001”“Jarke”“...” 17

Well-Designed XML Data We consider k such that adom(T)  {1, …,k}. For each k : We consider the ratio: General measure: 18

XNF: XML Normal Form XNF was proposed in [AL02]. It was defined for XML FDs:  DBLP.conf DBLP.conf.issue  It eliminates two types of anomalies. One of them is inspired by the type of anomalies found in relational databases containing FDs. 19

XNF: XML Normal Form DBLP conf “1999” “Dong”“2001”“Jarke” “2001” “...” 20

XNF: XML Normal Form For arbitrary XML data dependencies: Definition An XML specification (D,  ) is well- designed if for every T  inst(D,  ) and every p  Pos(T), = 1. For functional dependencies: Theorem An XML specification (D,  ) is in XNF if and only if (D,  ) is well-designed. 21

Normalization Algorithms The information-theoretic measure can also be used for reasoning about normalization algorithms. For BCNF and XNF decomposition algorithms: Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease. 22

Future Work We would like to consider more complex XML constraints and characterize good designs they give rise to. We would like to characterize 3NF by using the measure developed in this paper. In general, we would like to characterize “non-perfect” normal forms. We would like to develop better characterizations of normalization algorithms using our measure. Why is the “usual” BCNF decomposition algorithm good? Why does it always stop? 23

A Normal Form for FDs and JDs Let  be a set of FDs and JDs over a schema S : Theorem (S,  ) is well-designed if and only if for every R  S and every nontrivial JD: implied by , there exists M  {1,..., m} such that: 1. 2.For every i,j  M,  implies

A Normal Form for FDs and JDs (cont’d) Schema: S = { R(A,B,C) } and  = {  [AB, AC, BC], AB  C, AC  B }. (S,  ) is not in PJ/NF: {AB  ABC, AC  ABC} does not imply  [AB, AC, BC]. (S,  ) is not in 5NFR:  [AB, AC, BC] is strong- reduced and BC is not a superkey. (S,  ) is well-designed.