Temple University – CIS Dept. CIS331– Principles of Database Systems V. Megalooikonomou Database Design and Normalization (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)
Overview Relational model formal query languages commercial query languages (SQL) Integrity constraints domain I.C., foreign keys functional dependencies Functional Dependencies DB design and normalization
Overview - detailed DB design and normalization pitfalls of bad design decomposition normal forms
Design ‘good’ tables sub-goal#1: define what ‘good’ means sub-goal#2: fix ‘bad’ tables in short: “we want tables where the attributes depend on the primary key, on the whole key, and nothing but the key” Let’s see why, and how: Goal
Pitfalls takes1 (ssn, c-id, grade, name, address) Ssnc-idGradeNameAddress 123 cs331 A smithMain
Pitfalls ‘Bad’ - why? because: ssn->address, name
Pitfalls Redundancy space (inconsistencies) insertion/deletion anomalies:
Pitfalls insertion anomaly: “jones” registers, but takes no class - no place to store his address!
Pitfalls deletion anomaly: delete the last record of ‘smith’ (we lose his address!)
Solution: decomposition split offending table in two (or more), e.g.: ??
Overview - detailed DB design and normalization pitfalls of bad design decomposition lossless join dependency preserving normal forms
Decompositions there are ‘bad’ decompositions we want: lossless and dependency preserving
Decompositions - lossy: R1(ssn, grade, name, address) R2(c-id,grade) ssn->name, address ssn, c-id -> grade c-idGrade cs331A cs351B cs211A
Decompositions - lossy: can not recover original table with a join! ssn->name, address ssn, c-id -> grade c-idGrade cs331A cs351B cs211A
Decompositions – lossy: Another example Decomposition of R = (A, B) into R 1 = (A), R 2 = (B) AB A B 1212 r A(r)A(r) B(r)B(r) A (r) B (r) AB
Decompositions example of non-dependency preserving S# -> address, status address -> status S# -> addressS# -> status
Decompositions is it lossless? S# -> address, status address -> status S# -> addressS# -> status
Decompositions - lossless Definition: Consider schema R, with FD ‘F’. R1, R2 is a lossless join decomposition of R if we always have: An easier criterion?
Decomposition - lossless Theorem: lossless join decomposition if the joining attribute is a superkey in at least one of the new tables Formally:
Decomposition - lossless example: ssn->name, address ssn, c-id -> grade Ssnc-idGrade 123 cs331 A 123 cs351 B 234 cs211 A ssn->name, address ssn, c-id -> grade R1 R2
Overview - detailed DB design and normalization pitfalls of bad design decomposition lossless join decomp. dependency preserving normal forms
Decomposition - depend. pres. informally: we don’t want the original FDs to span two tables - counter-example: S# -> address, status address -> status S# -> addressS# -> status
Decomposition - depend. pres. dependency preserving decomposition: S# -> address, status address -> status S# -> addressaddress -> status (but: S#->status ?)
Decomposition - depend. pres. informally: we don’t want the original FDs to span two tables more specifically: … the FDs of the canonical cover Let F i be the set of dependencies F + that include only attributes in R i. Preferably the decomposition should be dependency preserving, that is, (F 1 F 2 … F n ) + = F + Otherwise, checking updates for violation of functional dependencies may require computing joins expensive
Decomposition - depend. pres. why is dependency preservation good? S# -> addressaddress -> status S# -> addressS# -> status (address->status: ‘lost’)
Decomposition - depend. pres. A: eg., record that ‘Philly’ has status ‘A’ S# -> addressaddress -> status S# -> address S# -> status (address->status: ‘lost’)
Decomposition - depend. pres. To check if a dependency is preserved in a decomposition of R into R 1, R 2, …, R n we apply the following test (with attribute closure done w.r.t. F) result = while (changes to result) do for each R i in the decomposition t = (result R i ) + R i result = result t If result contains all attributes in , then functional dependency is preserved We apply the test on all dependencies in F to check if a decomposition is dependency preserving The test takes polynomial time Computing F + and (F 1 F 2 … F n ) + needs exponential time
Decomposition - conclusions decompositions should always be lossless joining attribute -> superkey whenever possible, we want them to be dependency preserving (occasionally, impossible - see ‘STJ’ example later…)
Normalization using FD When decomposing a relation schema R with a set of functional dependencies F into R 1, R 2,…, R n we want: Lossless-join decomposition: otherwise … information loss No redundancy: relations R i preferably should be in either Boyce-Codd Normal Form or Third Normal Form Dependency preservation: Let F i be the set of dependencies in F + that include only attributes in R i. Preferably the decomposition should be dependency preserving, i.e., (F 1 F 2 … F n ) + = F + Otherwise, checking updates for violation of functional dependencies may require computing joins expensive
Normalization using FD - Example R = (A, B, C) F = {A B, B C) R 1 = (A, B), R 2 = (B, C) Lossless-join decomposition: R 1 R 2 = {B} and B BC Dependency preserving R 1 = (A, B), R 2 = (A, C) Lossless-join decomposition: R 1 R 2 = {A} and A AB Not dependency preserving (cannot check B C without computing R 1 R 2 )
Overview - detailed DB design and normalization pitfalls of bad design decomposition ( how to fix the problem) normal forms ( how to detect the problem) BCNF, 3NF, (1NF, 2NF)
Normal forms - BCNF We saw how to fix ‘bad’ schemas - but what is a ‘good’ schema? Answer: ‘good’, if it obeys a ‘normal form’, i.e., a set of rules Typically: Boyce-Codd Normal Form (BCNF)
Normal forms - BCNF Defn.: Rel. R is in BCNF w.r.t. F, if informally: everything depends on the full key, and nothing but the key semi-formally: every determinant (of the cover) is a candidate key
Normal forms - BCNF Example and counter-example: ssn->name, address ssn, c-id -> grade
Normal forms - BCNF Formally: for every FD a->b in F+ a->b is trivial (a is a superset of b) or a is a superkey (or both)
Normal forms - BCNF Theorem: given a schema R and a set of FD ‘F’, we can always decompose it to schemas R1, … Rn, so that R1, … Rn are in BCNF and the decomposition is lossless (…but, some decomp. might lose dependencies)
BCNF Decomposition How? ….essentially, break off FDs of the cover eg. TAKES1(ssn, c-id, grade, name, address) ssn -> name, address ssn, c-id -> grade
Normal forms - BCNF eg. TAKES1(ssn, c-id, grade, name, address) ssn -> name, address ssn, c-id -> grade name addressgrade c-id ssn
Normal forms - BCNF ssn->name, address ssn, c-id -> grade Ssnc-idGrade 123 cs331 A 123 cs351 B 234 cs211 A ssn->name, address ssn, c-id -> grade
Normal forms - BCNF pictorially: we want a ‘star’ shape name addressgrade c-id ssn :not in BCNF
Normal forms - BCNF pictorially: we want a ‘star’ shape B C A G E D or FH
Normal forms - BCNF or a star-like: (e.g., 2 cand. keys): STUDENT(ssn, st#, name, address) name address ssn st# = name addressssnst#
Normal forms - BCNF but not: or B CA D G E D F H
BCNF Decomposition result := {R}; done := false; compute F + ; while (not done) do if (there is a schema R i in result that is not in BCNF) then begin let be a nontrivial functional dependency that holds on R i such that R i is not in F +, and = ; result := (result – R i ) (R i – ) ( , ); end else done := true; Note: each R i is in BCNF, and decomposition is lossless-join
Normal forms - 3NF consider the ‘classic’ case: STJ( Student, Teacher, subJect) T-> J S,J -> T is it BCNF? S T J
Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T How to decompose it to BCNF? S T J
Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T 1) R1(T,J) R2(S,J) (BCNF? - lossless? - dep. pres.? ) 2) R1(T,J) R2(S,T) (BCNF? - lossless? - dep. pres.? )
Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T 1) R1(T,J) R2(S,J) (BCNF? Y+Y - lossless? N - dep. pres.? N ) 2) R1(T,J) R2(S,T) (BCNF? Y+Y - lossless? Y - dep. pres.? N )
Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T in this case: impossible to have both BCNF and dependency preservation Welcome 3NF (…a weaker normal form)!
Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T S J T informally, 3NF ‘forgives’ the red arrow in the can. cover
Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T S J T Formally, a rel. R with FDs ‘F’ is in 3NF if: for every a->b in F+: it is trivial or a is a superkey or each b-a attr.: part of a cand. key
Normal forms - 3NF R = (J, K, L) F = {JK L, L K} Two candidate keys = JK and JL R is not in BCNF Any decomposition of R will fail to preserve JK L n BCNF decomposition has (JL) and (LK) n Testing for JK L requires a join R is in 3NF JK L JK is a superkey L K K is contained in a candidate key There is some redundancy in this schema…
Normal forms - 3NF TESTING FOR 3NF Optimization: Need to check only FDs in F, need not check all FDs in F + Use attribute closure to check, for each dependency , if is a superkey If is not a superkey, we have to verify if each attribute in is contained in a candidate key of R this test is more expensive; it involves finding candidate keys testing for 3NF is NP-hard Interestingly, decomposition into 3NF (described shortly) can be done in polynomial time
Decomposition into 3NF Let F c be a canonical cover for F; i := 0; for each functional dependency in F c do if none of the schemas R j, 1 j i contains then begin i := i + 1; R i := end if none of the schemas R j, 1 j i contains a candidate key for R then begin i := i + 1; R i := any candidate key for R; end return (R 1, R 2,..., R i ) The dependencies are preserved by building explicitly a schema for each given dependency Guarantees a lossless-join decomposition by having at least one schema containing a candidate key for the schema being decomposed
Normal forms - 3NF how to bring a schema to 3NF? In short ….for each FD in the cover, put it in a table
Normal forms - 3NF vs BCNF If ‘R’ is in BCNF, it is always in 3NF (but not the reverse) In practice, aim for BCNF; lossless join; and dep. preservation if impossible, we accept 3NF; but insist on lossless join and dep. preservation 3NF has problems with transitive dependecies
3NF vs BCNF (cont.) Example of problems due to redundancy in 3NF R = (J, K, L) F = {JK L, L K} A schema that is in 3NF but not in BCNF has the problems of repetition of information (e.g., the relationship l 1, k 1 ) need to use null values (e.g., to represent the relationship l 2, k 2 where there is no corresponding value for J). JLK j 1 j 2 j 3 null l1l1l1l2l1l1l1l2 k1k1k1k2k1k1k1k2
Normal forms - more details why ‘3’NF? what is 2NF? 1NF? 1NF: attributes are atomic (i.e., no set- valued attr., a.k.a. ‘repeating groups’) not 1NF
Normal forms - more details 2NF: 1NF and non-key attr. fully depend on the key counter-example: TAKES1(ssn, c-id, grade, name, address) ssn -> name, address ssn, c-id -> grade name addressgrade c-id ssn
Normal forms - more details 3NF: 2NF and no transitive dependencies counter-example: B CA D in 2NF, but not in 3NF
Normal forms - more details 4NF, multivalued dependencies etc: later… in practice, E-R diagrams usually lead to tables in BCNF
Overview - conclusions DB design and normalization pitfalls of bad design decompositions (lossless, dep. preserving) normal forms (BCNF or 3NF) “everything should depend on the key, the whole key, and nothing but the key”