CS 564 Database Management Systems: Design and Implementation Lecture 3: Schema Normalization Chapter 19 in Cow Book Slide ACKs: AnHai Doan, Jeff Naughton,

CS 564 Database Management Systems: Design and Implementation Lecture 3: Schema Normalization Chapter 19 in Cow Book Slide ACKs: AnHai Doan, Jeff Naughton, and Jignesh Patel Arun Kumar 1

Boring Logistics Announcements 1. Email your team information to the TA by this Wed 09/16 before class: Apul Jain: jain37@wisc.edu Teams of up to 3 persons (all from within Sec 2) Email me (arun@cs.wisc.edu) if you are unable to find teammates and do not want to work alone 2. Project 1 release this Thu 09/17; due by Sun 10/04 3. Discussion this Fri 09/18 here by the TA to give more examples of the concepts taught this week 4. Discussion next Fri 09/25 by the TA on Project 1 2

Recall the Netflix Example User Name Age UserID Movie Name Director MovieID Rating Timestamp RatingID NumStars JoinDate ReleaseDate 3

4 Recall the Netflix Example RatingIDNumStarsTimestampUserIDMovieID 13.508/27/157920 …………… UserIDNameAgeJoinDate 79Alice2301/10/13 ……… MovieIDNameReleaseDateDirector 20Inception07/13/2010Christopher Nolan ……… Ratings Users Movies

Q: Why not this? UserI D Na me Ag e JoinDa te RatingI D NumSt ars Timest amp MovieI D NameReleas eYEar Direct or 79Alic e 2301/10/1 3 13.508/27/1 5 20Incepti on 2010Christo pher Nolan …………………………… AllStuff 5 RatingIDStarsMovieIDNameYearDirector 13.520Inception2010Christopher Nolan 25.020Inception2010Christopher Nolan 34.016Avatar2009Jim Cameron 44.516Avatar2009Jim Cameron ……………… MostStuff Data Redundancy!

Why is data redundancy a big deal? 6

Why Data Redundancy is Awful 1.Storage space wasted; $$$ loss 2.Anomalies in write ops: Update Anomalies Delete Anomalies Insert Anomalies 7

Update Anomaly Example 8 RatingIDStarsMovieIDNameYearDirector 13.520Inception2010Christopher Nolan 25.020Inception2010Christopher Nolan 34.016Avatar2009Jim Cameron 44.516Avatar2009Jim Cameron ……………… MostStuff James Cameron Inconsistency in the data!

Delete Anomaly Example 9 RatingIDStarsMovieIDNameYearDirector 13.520Inception2010Christopher Nolan 25.020Inception2010Christopher Nolan 34.016Avatar2009Jim Cameron 44.516Avatar2009Jim Cameron ……………… MostStuff Data about Inception is permanently lost!

Insert Anomaly Example 10 RatingIDStarsMovieIDNameYearDirector 13.520Inception2010Christopher Nolan 25.020Inception2010Christopher Nolan 34.016Avatar2009Jim Cameron 44.516Avatar2009Jim Cameron ……………… MostStuff ?? 91Spectre2015Sam Mendes NULLs?!

Okay, so how to avoid redundancy? 11

Database Design Process: ~6 steps 1.Requirements Analysis 2.Conceptual Database Design 3.Logical Database Design 4.Schema Refinement 5.Physical Database Design 6.Application and Security Design 12 Normalization

Overview of Schema Normalization 1. How to define and detect data redundancy? Functional Dependency (FD) theory Armstrong’s Axioms of FDs 2. How to remove data redundancy? Normal Forms (BCNF, 3NF) Decomposition of Schemas 13

Functional Dependency: Intuition 14 RatingIDStarsMovieIDNameYearDirector 13.520Inception2010Christopher Nolan 25.020Inception2010Christopher Nolan 34.016Avatar2009Jim Cameron 44.516Avatar2009Jim Cameron 55.039Interstellar2014Christopher Nolan ……………… MostStuff MovieID “determines” Name, Year, Director

Functional Dependency: Intuition 15 RIDStarsMIDNameYearDirectorDAge 13.520Inception2010Christopher Nolan45 25.020Inception2010Christopher Nolan45 34.016Avatar2009Jim Cameron61 44.516Avatar2009Jim Cameron61 55.039Interstellar2014Christopher Nolan45 ……………… MostStuff Director “determines” DAge

Functional Dependency: Definition 16 RIDStarsMIDNameYearDirectorDAge MostStuff RID → Stars,MID,Name,Year,Director,DAge Director → DAge MID → Name,Year,Director,DAge FD is also an Integrity Constraint (IC)! Schema vs. instance

How do we know if a given schema has FDs in the first place? 17

FDs are a property of the application! 19 User Name Age UserID Movie Name Director MID Rating RID Stars JoinDate Year DAge “Easy-to-know” FDs: Use Key Constraints!

FDs are a property of the application! 20 “Easy-to-know” FDs: Use Key Constraints! RIDStarsMIDNameYearDirectorDAge MostStuff RID → Stars,MID,Name,Year,Director,DAge But how do we know Director → DAge? MID → Name,Year,Director,DAge Application-specific! Must specify as part of Database Schema Design (Conceptual and Logical)

Sneak Peak: How do FDs help? 21 FDs help decompose (“Normalize”) the schema! RIDStarsMIDNameYearDirectorDAge MostStuff RID → Stars,MID,Name,Year,Director,DAge Director → DAge MID → Name,Year,Director,DAge MIDNameYearDirectorDAge Movies RIDStarsMID MostStuffNew MIDNameYearDirector MoviesNew DirectorDAge Directors

Inferring more FDs and Closure 22 RIDStarsMIDNameYearDirectorDAge MostStuff Q: So, does MID → DAge? Director → DAge MID → Director Definition: Given a set of FDs S, the set of all FDs logically implied by S is called the closure of S (S + ) Q: How do we compute the closure?

Armstrong’s Axioms 24 1. Reflexivity Rule 2. Augmentation Rule 3. Transitivity Rule Given any 3 sets of attributes X, Y, Z (on some schema): To get S +, repeatedly apply 1, 2, and 3 till no new FDs! Theorem: These axioms are sound and complete

Derived Rules 25 1. Union Rule 2. Decomposition Rule 3. Pseudo-transitivity Rule Derivable from 3 basic rules; used for convenience

Attribute Closure 26 Definition: Given a set of FDs S and a set of attributes X, the set of all attributes functionally determined by X is called the closure of X (X + )

Why is Attribute Closure helpful? 27 1. Test if X is a superkey of relation R Check if X + contains all attributes of R 2. Test if a given FD X → Y holds without computing S + Check if X + contains Y

Surprise review question! 28 Q: Given R(J,K,L) (assume no NULL), primary key J, and a given FD L → J. Is L a key? Why or why not? Example: Student (StudentID,Name,SSN) StudentID → StudentID,Name,SSN and SSN → StudentID Yes it is! Two-line proof: Given J → JKL and L → J. By Transitivity Rule, L → JKL. □

Boyce-Codd Normal Form (BCNF) 30 BCNF is a desired “form” for a relation schema. Given a relation schema R and a set of FDs S, we say R is in BCNF if for every FD X → A in S (X is a subset of R’s attributes and A is a single attribute), we have: X → A is a trivial FD or X is a superkey for R

Review of Terminology 31 Example: Student (StudentID,Name,Age,SSN) StudentID → Name,Age,SSN and SSN → StudentID Key: A minimal set of attributes that uniquely identify an entity If you drop any attribute(s), it will NOT be a key anymore! Superkey: A (possibly strict) superset of some key It need NOT be minimal; there might be some attribute(s) you can drop and it will still remain a (super) key All keys are superkeys, but not vice versa! There are 2 keys: {StudentID} and {SSN} Several superkeys possible: {StudentID,Name}, {SSN}, {SSN,StudentID,Age}, etc.

Review of Terminology 32 Example: Student (StudentID,Name,Age,SSN) StudentID → Name,Age,SSN and SSN → StudentID There are different “labels” for a key: Q. What are the candidate/primary/alternate keys here? Candidate Key: A key (extra name just to confuse you ) Primary Key: A candidate key chosen to be the “primary” representative of the relation’s keys (underlined in the schema) Alternate Key: A candidate key rejected as the “primary”!

Review of Terminology 33 Example: R(ABCDEFGH) ABC → EF, EG → FH, B → DG, and DH → ABCE Q. What are the keys? What are the prime attributes? Prime Attribute: Attribute that is a part of some (candidate) key Non-Prime Attribute: Attribute that is NOT a prime attribute

BCNF Example 34 RIDStarsMIDNameYearDirectorDAge MostStuff F 1 : Director → DAge F 2 : MID → Name,Year,Director,DAge F 3 : RID → Stars,MID,Name,Year,Director,DAge Q: Is MostStuff in BCNF? X → A is a trivial FD or X is a superkey for R

BCNF Example 35 G 1 : Director → DAge G 2 : MID → Name,Year,Director,DAge Q: Is either table in BCNF? X → A is a trivial FD or X is a superkey for R MIDNameYearDirectorDAge Movies RIDStarsMID MostStuffNew F 1 : RID → Stars,MID

BCNF Example 36 H 1 : Director → DAge X → A is a trivial FD or X is a superkey for R RIDStarsMID MostStuffNew F 1 : RID → Stars,MID MIDNameYearDirector MoviesNew DirectorDAge Directors G 1 : MID → Name,Year,Director All relations are in BCNF!

Okay, how to “decompose” a relation schema in general? 37

Decomposition of Schemas 38 Two important properties in decomposing a relation schema: 1. Lossless Join property Given any instance, can we recover the exact original instance by “joining” the decomposed instances? 2. Dependency Preservation property Given a set of FDs and a decomposition, can we verify all FDs using the decomposed relations individually?

What is a “Lossy Join” Decomp.? 39 Suppose we decompose R(J,K,L) as follows: JKL 10x4 20y9 30x8 R(J,K,L) JK 10x 20y 30x R 1 (J,K) KL x4 y9 x8 R 2 (K,L) Decompose JKL 10x4 20y9 30x8 10x8 30x4 R’(J,K,L) Reconstruct by matching on K (“join”) Inconsistent Data! “Lossy”!

BCNF Decomposition is Lossless! 40 BCNF decomposition satisfies the “Lossless Join” property X → A is a trivial FD or X is a superkey for R FD is X → YNo FDs Q: What about a general set of FDs?“Chase Algorithm”

What is “Dependency Preservation”? 41 Given a decomposition, none of the FDs “fall through the cracks”, i.e., we can still recover all the FDs S + == S 1 + U S 2 + ? Otherwise, information about FDs has been lost!

Example for “Dependency Loss” 42 JKL 10x4 20y9 30x8 10x2 20y1 R(J,K,L) Suppose we decompose R(J,K,L) as follows: Given FDs: J → K KL → J JK 10x 20y 30x R 1 (J,K) Decompose KL x4 y9 x8 x2 y1 R 2 (K,L) J → K But, KL → J is lost!

A bit of a conundrum! 43 R(J,K,L) So, how do we decompose R(J,K,L)? Given FDs: J → K KL → J R 1 (J,K)R 2 (K,L) R(J,K,L)R 1 (J,L)R 2 (K,L) R(J,K,L)R 1 (J,K)R 2 (J,L) No way out! Sometimes, redundancy is unavoidable if we want to satisfy both Lossless Join and Dependency Preservation! Saving grace: above situation is rare in the “real-world”! But, BCNF can always ensure Lossless Join Decomposition

Algorithm to Compute BCNF 44 Input: Relation R and set of FDs S

Are other decompositions possible? 45

Hierarchies of “Normal Forms” 46 BCNF Third Normal Form (3NF) Second Normal Form (2NF) First Normal Form (1NF) Fourth Normal Form (4NF) Fifth Normal Form (5NF)

Hierarchies of “Normal Forms” 48 BCNF Third Normal Form (3NF) Second Normal Form (2NF) First Normal Form (1NF) Fourth Normal Form (4NF) Fifth Normal Form (5NF)

Hierarchies of “Normal Forms” 49 First Normal Form (1NF) Simply means that all attribute values are “atomic”

Hierarchies of “Normal Forms” 50 Second Normal Form (2NF) First Normal Form (1NF) Practically no one cares about this one anymore!

Hierarchies of “Normal Forms” 51 Third Normal Form (3NF) Second Normal Form (2NF) First Normal Form (1NF) Probably the most widely used one in practice!

Hierarchies of “Normal Forms” 52 BCNF Third Normal Form (3NF) Second Normal Form (2NF) First Normal Form (1NF) Whenever possible, use this over 3NF

Hierarchies of “Normal Forms” 53 BCNF Third Normal Form (3NF) Second Normal Form (2NF) First Normal Form (1NF) Fourth Normal Form (4NF) Fifth Normal Form (5NF) 4NF and 5NF: see more advanced topics in the book!

Third Normal Form (3NF) 54 Generally popular when BCNF is not possible Given a relation schema R and a set of FDs F, we say R is in BCNF if for every FD X → A in F (X is a subset of R’s attributes and A is a single attribute), we have: X → A is a trivial FD or X is a superkey for R or A is part of some key for R 3NF

Example for 3NF vs BCNF 55 JKL 10x4 20y9 30x8 10x2 20y1 R(J,K,L) Suppose we have R(J,K,L) as follows: Given FDs: J → K KL → J X → A is a trivial FD or X is a superkey for R or A is part of some key for R BCNF 3NF KL is a key Thus, R is NOT in BCNF, but it is in 3NF!

Another Example 56 H 1 : Director → DAge RIDStarsMID MostStuffNew F 1 : RID → Stars,MID MIDNameYearDirector MoviesNew DirectorDAge Directors G 1 : MID → Name,Year,Director X → A is a trivial FD or X is a superkey for R or A is part of some key for R BCNF 3NF Q: Is this in 3NF?

When is 3NF violated? 57 X → A is a trivial FD or X is a superkey for R or A is part of some key for R Given non-trivial X → A on R not in 3NF Case 1: X is a proper subset of some key in R Case 2: X is NOT a proper subset of some key in R AX Key Partial Dependency 3NF tolerates more redundancy than BCNF! KeyAX A X Transitive Dependency

Algorithm to Compute 3NF 58  Use the algorithm for BCNF! (Typically, can stop earlier)  What if a dependency X → A is not preserved? Simple – just include (X,A) as another relation! 3NF allows for this redundancy!  An efficiency improvement to this algorithm possible: Instead of using S as such, use its “Canonical Cover”

Canonical Cover (aka Minimal Cover) 59  Given set of FDs S, its canonical cover M is s.t.: 1. S + = M + 2. Deleting any FD, or deleting any attributes from any FD in M changes M + 3. LHS of each FD in M is unique  Minimal cover enables a more efficient Lossless-Join and Dependency-Preserving decomposition into 3NF

Canonical Cover: Example 60 1. S + = M + 2. Deleting any FD, or deleting any attributes from any FD in M changes M + 3. LHS of each FD in M is unique Example: R(ABCDEFGH) S is {A → B, ABCD → E, EF → GH, ACDF → EG} M is {A → B, ACD → E, EF → GH}

A Procedure for Canonical Cover 61 1. Standardization of RHS Rewrite all FDs with 1 attribute on RHS 2. Minimization of LHS For each FD, can the LHS be reduced without affecting the closure? 3. Deletion of Redundant FDs For each FD, can it be dropped without affecting the closure? Canonical cover is NOT unique – order of picking FDs in the above steps affects the output! Also, we can combine FDs with the same LHS in the end.

More advanced stuff in book 62  4NF, Multi-valued Dependencies (MVDs), and Embedded Multi-valued Dependencies (EMVDs)  5NF, and Join Dependencies  Inclusion Dependencies

Review: Schema Normalization 1. What is data redundancy and why is it harmful? Update, Delete, and Insert anomalies 2. How to define and detect data redundancy? Functional Dependency (FD) theory Armstrong’s Axioms of FDs 3. How to decompose schemas to remove data redundancy? Normal Forms (BCNF, 3NF) Lossless-Join and Dependency-Preserving Decomposition Algorithm to get BCNF, Canonical Cover, and 3NF 63

How useful is all of this normalization stuff in the “real-world”? 64

Normalization in the “Real-World” 65  Usually, people aim for BCNF and “settle” for 3NF  Normalization used widely for “write-heavy” workloads  “Denormalization” popular for “read-only/mostly” workloads Sneak peak: “Analytical” SQL queries on old-ish data Called a “Data Warehouse”/“Data Lake” environment UserI D Na me Ag e JoinDa te RatingI D NumSt ars Timest amp MovieI D Nam e ReleaseY Ear Direct or AllStuff So, Netflix probably uses just AllStuff for its analytics!

CS 564 Database Management Systems: Design and Implementation Lecture 3: Schema Normalization Chapter 19 in Cow Book Slide ACKs: AnHai Doan, Jeff Naughton,

Similar presentations

Presentation on theme: "CS 564 Database Management Systems: Design and Implementation Lecture 3: Schema Normalization Chapter 19 in Cow Book Slide ACKs: AnHai Doan, Jeff Naughton,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 564 Database Management Systems: Design and Implementation Lecture 3: Schema Normalization Chapter 19 in Cow Book Slide ACKs: AnHai Doan, Jeff Naughton,

Similar presentations

Presentation on theme: "CS 564 Database Management Systems: Design and Implementation Lecture 3: Schema Normalization Chapter 19 in Cow Book Slide ACKs: AnHai Doan, Jeff Naughton,"— Presentation transcript:

Similar presentations

About project

Feedback