Schema Refinement and Normalization

Schema Refinement and Normalization
Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 28, 2018 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Administrivia Homework 2 will be due Wednesday

1:Many (1:n) Relationships
Placing an arrow in the many  one direction, i.e. towards the entity that’s ref’d via a foreign key Suppose profs teach multiple courses, but may not have taught yet: Suppose profs must teach to be on the roster: Teaches PROFESSORS COURSES Partial participation (0 or more…) Teaches PROFESSORS COURSES Total participation (1 or more…)

Representing 1:n Relationships in Tables
CREATE TABLE Teaches( fid INTEGER, serno CHAR(15), semester CHAR(4), PRIMARY KEY (serno), FOREIGN KEY (fid) REFERENCES PROFESSORS, FOREIGN KEY (serno) REFERENCES COURSES) Key of relationship set: CREATE TABLE Teaches_Course( serno INTEGER, subj VARCHAR(30), cid CHAR(15), fid CHAR(15), name CHAR(40), PRIMARY KEY (serno), FOREIGN KEY (fid) REFERENCES PROFESSORS) • Or embed relationship in “many” entity set:

1:1 Relationships If you borrow money or have credit, you might get:
What are the table options? Describes CreditReport Borrower ssn rid delinquent? debt name

Roles: Labeled Edges Sometimes a relationship connects the same entity, and the entity has more than one role: This often indicates the need for recursive queries Includes qty Assembly Subpart id Parts name

DDL for Role Example CREATE TABLE Parts (Id INTEGER, Name CHAR(15), …
PRIMARY KEY (ID) ) CREATE TABLE Includes (Assembly INTEGER, Subpart INTEGER, Qty INTEGER, PRIMARY KEY (Assemb, Sub), FOREIGN KEY (Assemb) REFERENCES Parts, FOREIGN KEY (Sub) REFERENCES Parts)

Roles vs. Separate Entities
Married Husband Wife id id name name Married What is the difference between these two representations? Husband Wife id Person name

ISA Relationships: Subclassing (Structurally)
Inheritance states that one entity is a “special kind” of another entity: “subclass” should be member of “base class” id People name ISA Employees salary

But How Does this Translate into the Relational Model?
Compare these options: Two tables, disjoint tuples Two tables, disjoint attributes One table with NULLs Object-relational databases

Weak Entities A weak entity can only be identified uniquely using the primary key of another (owner) entity. Owner and weak entity sets in a one-to-many relationship set, 1 owner : many weak entities Weak entity set must have total participation Feeds People Pets ssn name weeklyCost name species

Translating Weak Entity Sets
Weak entity set and identifying relationship set are translated into a single table; when the owner entity is deleted, all owned weak entities must also be deleted CREATE TABLE Feed_Pets ( name VARCHAR(20), species INTEGER, weeklyCost REAL, ssn CHAR(11) NOT NULL, PRIMARY KEY (pname, ssn), FOREIGN KEY (ssn) REFERENCES People, ON DELETE CASCADE)

N-ary Relationships Relationship sets can relate an arbitrary number of entity sets: Student Project Indep Study Advisor

Summary of ER Diagrams One of the primary ways of designing logical schemas CASE tools exist built around ER (e.g. ERWin, PowerBuilder, etc.) Translate the design automatically into DDL, XML, UML, etc. Use a slightly different notation that is better suited to graphical displays Some tools support constraints beyond what ER diagrams can capture Can you get different ER diagrams from the same data?

Not All Designs are Equally Good
Why is this a poor schema design? And why is this one better? Stuff(sid, name, serno, cid, subj, grade) Student(sid, name) Course(serno, cid) Subject(cid, subj) Takes(sid, cid, exp-grade)

Focus on the Bad Design Certain items (e.g., name) get repeated
Some information requires that a student be enrolled (e.g., courses) due to the key sid name serno subj cid exp-grade 1 Sam 520109 AI 520 B 23 Nitin 550109 DB 550 A 45 Jill 505109 OS 505 C

Functional Dependencies Describe “Key-Like” Relationships
A key is a set of attributes where: If keys match, then the tuples match A functional dependency (FD) is a generalization: If an attribute set determines another, written X  Y then if two tuples agree on attribute set X, they must agree on X: sid  name What other FDs are there in this data? FDs are independent of our schema design choice

Formal Definition of FD’s
Def. Given a relation schema R and subsets X, Y of R: An instance r of R satisfies FD X  Y if, for any two tuples t1, t2  r, t1[X ] = t2[X] implies t1[Y] = t2[Y] For an FD to hold for schema R, it must hold for every possible instance of r (Can a DBMS verify this? Can we determine this by looking at an instance?)

General Thoughts on Good Schemas
We want all attributes in every tuple to be determined by the tuple’s key attributes, i.e. part of a superkey (for key X  Y, a superkey is a “non-minimal” X) What does this say about redundancy? But: What about tuples that don’t have keys (other than the entire value)? What about the fact that every attribute determines itself?

Armstrong’s Axioms: Inferring FDs
Some FDs exist due to others; can compute using Armstrong’s axioms: Reflexivity: If Y  X then X  Y (trivial dependencies) name, sid  name Augmentation: If X  Y then XW  YW serno  subj so serno, exp-grade  subj, exp-grade Transitivity: If X  Y and Y  Z then X  Z serno  cid and cid  subj so serno  subj

Armstrong’s Axioms Lead to…
Union: If X  Y and X  Z then X  YZ Pseudotransitivity: If X  Y and WY  Z then XW  Z Decomposition: If X  Y and Z  Y then X  Z Let’s prove these from Armstrong’s Axioms XX -> XY, XY -> YZ XX -> YZ X -> YZ X -> Y XW -> WY WY -> Z XW -> Z Y -> Z X -> Z

{X  Y | X  Y is derivable from F by Armstrong’s Axioms}
Closure of a Set of FD’s Defn. Let F be a set of FD’s. Its closure, F+, is the set of all FD’s: {X  Y | X  Y is derivable from F by Armstrong’s Axioms} Which of the following are in the closure of our Student-Course FD’s? name  name cid  subj serno  subj cid, sid  subj cid  sid

Attribute Closures: Is Something Dependent on X?
Defn. The closure of an attribute set X, X+, is: X+ =  {Y | X  Y  F +} This answers the question “is Y determined (transitively) by X?”; compute X+ by: Does sid, serno  subj, exp-grade? closure := X; repeat until no change { if there is an FD U  V in F such that U is in closure then add V to closure}

Equivalence of FD sets Defn. Two sets of FD’s, F and G, are equivalent if their closures are equivalent, F + = G + e.g., these two sets are equivalent: {XY  Z, X  Y} and {X  Z, X  Y} F + contains a huge number of FD’s (exponential in the size of the schema) Would like to have smallest “representative” FD set

Minimal Cover we express each FD in simplest form
Defn. A FD set F is minimal if: 1. Every FD in F is of the form X  A, where A is a single attribute 2. For no X  A in F is: F – {X  A } equivalent to F 3. For no X  A in F and Z  X is: F – {X  A }  {Z  A } equivalent to F Defn. F is a minimum cover for G if F is minimal and is equivalent to G. e.g., {X  Z, X  Y} is a minimal cover for {XY  Z, X  Z, X  Y} in a sense, each FD is “essential” to the cover

More on Closures If F is a set of FD’s and X  Y  F + then for some attribute A  Y, X  A  F + Proof by counterexample. Assume otherwise and let Y = {A1,..., An} Since we assume X  A1, ..., X  An are in F + then X  A1 ... An is in F + by union rule, hence, X  Y is in F + which is a contradiction

Why Armstrong’s Axioms?
Why are Armstrong’s axioms (or an equivalent rule set) appropriate for FD’s? They are: Consistent: any relation satisfying FD’s in F will satisfy those in F + Complete: if an FD X  Y cannot be derived by Armstrong’s axioms from F, then there exists some relational instance satisfying F but not X  Y In other words, Armstrong’s axioms derive all the FD’s that should hold

Proving Consistency We prove that the axioms’ definitions must be true for any instance, e.g.: For augmentation (if X  Y then XW  YW): If an instance satisfies X  Y, then: For any tuples t1, t2 r, if t1[X] = t2[X] then t1[Y] = t2[Y] by defn. If, additionally, it is given that t1[W] = t2[W], then t1[YW] = t2[YW]

Proving Completeness Suppose X  Y  F + and define a relational instance r that satisfies F + but not X  Y: Then for some attribute A  Y, X  A  F + Let some pair of tuples in r agree on X+ but disagree everywhere else: x1 x xn a1,1 v1 v vm w1,1 w2,1... x1 x xn a1,2 v1 v vm w1,2 w2,2... X A X+ – X R – X+ – {A}

Proof of Completeness cont’d
Clearly this relation fails to satisfy X  A and X  Y. We also have to check that it satisfies any FD in F + . The tuples agree on only X Thus the only FD’s that might be violated are of the form X’  Y’ where X’  X+ and Y’ contains attributes in R – X+ – {A}. But if X’  Y’ F+ and X’  X+ then Y’  X+ (reflexivity and augmentation). Therefore X’  Y’ is satisfied.

Stuff(sid, name, serno, subj, cid, exp-grade)
Decomposition Consider our original “bad” attribute set We could decompose it into But this decomposition loses information about the relationship between students and courses. Why? Stuff(sid, name, serno, subj, cid, exp-grade) Student(sid, name) Course(serno, cid) Subject(cid, subj)

Lossless Join Decomposition
R1, … Rk is a lossless join decomposition of R w.r.t. an FD set F if for every instance r of R that satisfies F, ÕR1(r) ⋈ ... ⋈ ÕRk(r) = r Consider: What if we decompose on (sid, name) and (serno, subj, cid, exp-grade)? sid name serno subj cid exp-grade 1 Sam 520109 AI 520 B 23 Nitin 550109 DB 550 A

Testing for Lossless Join
R1, R2 is a lossless join decomposition of R with respect to F iff at least one of the following dependencies is in F+ (R1  R2)  R1 – R2 (R1  R2)  R2 – R1 So for the FD set: sid  name serno  cid, exp-grade cid  subj Is (sid, name) and (serno, subj, cid, exp-grade) a lossless decomposition?

Dependency Preservation
Ensures we can “easily” check whether a FD X  Y is violated during an update to a database: The projection of an FD set F onto a set of attributes Z, FZ is {X  Y | X  Y  F +, X  Y Í Z} i.e., it is those FDs local to Z’s attributes A decomposition R1, …, Rk is dependency preserving if F + = (FR1 ... FRk)+ The decomposition hasn’t “lost” any essential FD’s, so we can check without doing a join

Example of Lossless and Dependency-Preserving Decompositions
Given relation scheme R(name, street, city, st, zip, item, price) And FD set name  street, city street, city  st street, city  zip name, item  price Consider the decomposition R1(name, street, city, st, zip) and R2(name, item, price) Is it lossless? Is it dependency preserving? What if we replaced the first FD by name, street  city?

Another Example Given scheme: R(sid, fid, subj) and FD set: fid  subj
sid, subj  fid Consider the decomposition R1(sid, fid) and R2(fid, subj) Is it lossless? Is it dependency preserving?

FD’s and Keys Ideally, we want a design s.t. for each nontrivial dependency X  Y, X is a superkey for some relation schema in R We just saw that this isn’t always possible Hence we have two kinds of normal forms

Two Important Normal Forms
Boyce-Codd Normal Form (BCNF). For every relation scheme R and for every X  A that holds over R, either A  X (it is trivial) ,or or X is a superkey for R Third Normal Form (3NF). For every relation scheme R and for every X  A that holds over R, either A  X (it is trivial), or X is a superkey for R, or A is a member of some key for R

Normal Forms Compared BCNF is preferable, but sometimes in conflict with the goal of dependency preservation It’s strictly stronger than 3NF Let’s see algorithms to obtain: A BCNF lossless join decomposition A 3NF lossless join, dependency preserving decomposition

BCNF Decomposition Algorithm (from Korth et al
BCNF Decomposition Algorithm (from Korth et al.; our book gives recursive version) result := {R} compute F+ while there is a schema Ri in result that is not in BCNF { let A  B be a nontrivial FD on Ri s.t. A  Ri is not in F+ and A and B are disjoint result:= (result – Ri)  {(Ri - B), (A,B)} }

3NF Decomposition Algorithm by Phil Bernstein, now @ MS Research
Let F be a minimal cover i:=0 for each FD A  B in F { if none of the schemas Rj, 1 j  i, contains AB { increment i Ri := (A, B) } if no schema Rj, 1  j  i contains a candidate key for R { Ri := any candidate key for R return (R1, …, Ri) Build dep.- preserving decomp. Ensure lossless decomp.

Summary We can always decompose into 3NF and get:
Lossless join Dependency preservation But with BCNF we are only guaranteed lossless joins BCNF is stronger than 3NF: every BCNF schema is also in 3NF The BCNF algorithm is nondeterministic, so there is not a unique decomposition for a given schema R

Schema Refinement and Normalization

Similar presentations

Presentation on theme: "Schema Refinement and Normalization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Schema Refinement and Normalization

Similar presentations

Presentation on theme: "Schema Refinement and Normalization"— Presentation transcript:

Similar presentations

About project

Feedback