Schema Refinement What and why Copyright © 2003-2015 Curt Hill
What is a good schema? A good schema should: Represent all the data needed Group the data into relations that make sense Have little or no redundancy Make common operations efficient Not just a common sense notion We have some objective ways of determining if a schema is indeed good Copyright © 2003-2015 Curt Hill
Redundancy What is wrong with redundant data? Space and access tradeoff Update anomaly One copy is changed and others not Insert anomaly An insertion requires that unrelated information also be inserted Delete anomaly Deleting something deletes unrelated information Copyright © 2003-2015 Curt Hill
Normalization Design activities to preclude the redundancy and functional anomalies There are a series of normal forms that are contained within one another 5thNF=PJ 4thNF BCNF 3rdNF 2ndNF 1stNF implies or contains NF = Normal Form PJ = Project Join, form of 5thNF BC = Boyce-Codd BCNF is a slight strengthening of 3rdNF Copyright © 2003-2015 Curt Hill
How we will do this? We will start with the simplest and work up to the most complicated Show how to determine the particular normal form Show what problems the next normal form solves The literature describes an 18th Normal Form We will stop at 5th Normal Form Warning: Mathematics ahead If there is no math, this is not science Copyright © 2003-2015 Curt Hill
First Normal Form Default case in a relational database Rectangular tables Fixed number of fields A file is not in 1stNF if it allows repeating groups Such as a variable number of fields A relational database may allow variable length field but that is an implementation consideration The field is considered atomic Copyright © 2003-2015 Curt Hill
1st NF and non 1st NF 1013 Joe Smith Biology English 1043 Jon Smith Not in 1st Normal Form Repeated Groups 1013 Joe Smith Biology English 1043 Jon Smith CIS 1152 Jane Jones Math 1st Normal Form 1013 Joe Smith Biology 1013 Joe Smith English 1043 Jon Smith CIS 1152 Jane Jones Math Copyright © 2003-2015 Curt Hill
An example in 1st NF Attributes SID - numeric student ID SNAME - student name LCODE - location (campus) STATUS - numeric status of the location CID - course ID (numeric) CNAME - course name SITE - location of the course GRADE - grade this student received Key is SID and CID Copyright © 2003-2015 Curt Hill
A picture 21 Jones A1 1 170 C Lit MCF 89 32 Smith 160 C++ RSC 68 91 SID SName LCode Status CID CName Site Grade 21 Jones A1 1 170 C Lit MCF 89 32 Smith 160 C++ RSC 68 91 385 DB I VNG 76 62 Copyright © 2003-2015 Curt Hill
What problems exist? Twos: Locations, student and course Names IDs Both of these depend on part but not all of the key Looks like two tables not one Table is in 1stNF but not 2ndNF Copyright © 2003-2015 Curt Hill
Anomalies Update anomaly Insert anomaly Delete anomaly Changing course number requires changing several records Changing the LCode requires several updates Insert anomaly We cannot have a student without their taking at least one class Delete anomaly Deleting first record destroys all that we know about 170 Copyright © 2003-2015 Curt Hill
Problem again The real problem is that things like CName are not dependent on the entire key CName is dependent on CID Just part of the key We need to consider functional dependencies Copyright © 2003-2015 Curt Hill
Functional Dependencies (FD) If field A determines field B then B is functionally dependent on A In other words: if we know A we know B Notation: AB This is read: A determines B A does not have to be an atomic attribute Every field is functionally dependent on every candidate key Includes every field with uniqueness property Copyright © 2003-2015 Curt Hill
Full Functional Dependency Somewhat stronger than previous B is fully functionally dependent on A iff B is functionally dependent on A B is not functionally dependent on any subset of A If A is atomic FD = FFD Notation is A ↠ B Copyright © 2003-2015 Curt Hill
Observations We cannot tell FDs by just looking at the data We must understand the data relationships Small tables may have apparent FDs that were not actually FDs If every AB was projected onto its relation then A would be the key Each FD represents an integrity constraint Copyright © 2003-2015 Curt Hill
Closure of a Set of FDs The closure (denoted F+) of a set F of FDs is a set that includes: All FDs Every FD that can be derived from the given FDs FDs obey some properties that allow us to find FDs implied by other FDs These properties are called Armstrong’s Axioms Copyright © 2003-2015 Curt Hill
Armstrong’s Axioms There are three basic rules: Reflexivity Augmentation Transitivity Two additional rules may be derived using these three Union Decomposition Copyright © 2003-2015 Curt Hill
Reflexivity If Y is a subset of X then X Y A set of fields determines all of its members Examples: A A AB B Trivial FDs are any FD where the right hand side is a subset of the left hand side Copyright © 2003-2015 Curt Hill
Augmentation If X determines Y Then XZ determines YZ It is always possible to add a field to both sides of a functional dependency Example: If A B then AC BC Copyright © 2003-2015 Curt Hill
Transitivity If X determines Y and Y determines Z Then X determines Z We can chain FDs together Example: If: A B B C C D then: A C A D Copyright © 2003-2015 Curt Hill
Union If a field determines two separate fields it determines both of them together If X determines Y and X determines Z Then X determines YZ If: A B A C then: A BC Copyright © 2003-2015 Curt Hill
A Example Suppose that a table has six fields: ABCDEF The following dependencies exist: AC B C DE F AC How many dependencies can be derived? What dependencies are contained in the closure? Copyright © 2003-2015 Curt Hill
Closure The closure is the union of any dependency that may be derived from the original set: AC B, C DE, F AC Reflexivity (AKA trivial) A A, B B, AB B, ABC C, … Augmentation CA ADE, ACD BD, … Transitive F B, F DE Copyright © 2003-2015 Curt Hill
Keys and Dependencies A key is any set of fields that determine all other fields Either directly or transitively A candidate key must be minimal No field may be removed and stay a key In the above: The entire relation is a key by reflexivity but is not minimal F is the key – it determines every other field directly or using transitivity Super key: set of fields that contains a key Copyright © 2003-2015 Curt Hill
Decomposition If a field determines two combined fields it determines both of them separately If X determines YZ Then X determines Y and X determines Z This is the reverse of Union If: A BC then: A B A C Copyright © 2003-2015 Curt Hill
Decompositions Use projections to subdivide a table into several tables in order to move to a higher normal form However, can all projections be done without problems? No There are both lossless and lossy projections The kind of desired projections are called: lossless join decompositions This kind allows us to exactly reconstruct the original table Copyright © 2003-2015 Curt Hill
Lossless Join Decomposition How may we subdivide one relation into two without losing anything? There must be some attributes in common in the two tables Otherwise the relationship between a key and attribute is broken The decomposition is lossless if one of the attributes in common is a key of either table Copyright © 2003-2015 Curt Hill
Lossless Decomposition Again Let R be a set of fields in a relation F be a set of FDs that hold over R The decomposition of R into R1 and R2 is lossless if and only if either F+ contains either R1 R2 R1 or R1 R2 R2 The attributes in common must contain the key for R1 or the key for R2 Copyright © 2003-2015 Curt Hill
Example Original Join is larger than original, some information lost S D S1 P1 D1 S2 P2 D2 S3 D3 S P D S1 P1 D1 S2 P2 D2 S3 D3 Decomposed into two S P S1 P1 S2 P2 S3 P D P1 D1 P2 D2 D3 Copyright © 2003-2015 Curt Hill
Why did that not work? The common field was P P is not the key Recall: The functional dependencies cannot be determined from looking at the data The data may only show what is not an FD In this case either S or D or both could be the key Copyright © 2003-2015 Curt Hill
Example Revisited This works now, but may not work, with more data. Original Reconstructed the same as original S P D S1 P1 D1 S2 P2 D2 S3 D3 S P D S1 P1 D1 S2 P2 D2 S3 D3 Decomposed into two better tables S P S1 P1 S2 P2 S3 S D S1 D1 S2 D2 S3 D3 This works now, but may not work, with more data. Copyright © 2003-2015 Curt Hill
Other Notes This generalizes to decomposing a table into more than two tables Decompose R1 into R1A and R1B We can then reconstruct R1 if needed From the viewpoint of lossless decomposition: The common fields must include the key, but may include other fields From the viewpoint of decomposing into higher normal forms: The common fields are usually only key fields Non-key fields are just redundant data Copyright © 2003-2015 Curt Hill
Second Normal Form (2ndNF) A table is in Second Normal Form if and only if It is in 1st NF and Every non-key attribute is fully functionally dependent on the whole key No partial dependencies Copyright © 2003-2015 Curt Hill
Partial Dependencies XA X is part of key but not all of it Violation of 2nd NF Copyright © 2003-2015 Curt Hill
Student Table Our previous student table was 1stNF but not 2ndNF The key is SID and CID LCODE is dependent on SID CNAME is dependent on CID The fix is projecting it into two (or more) tables This must be dependency preserving Copyright © 2003-2015 Curt Hill
What dependencies? SIDSNAME SIDLCODE LCODESTATUS CIDCNAME SID,CIDGRADE CIDSITE SID,CIDEverything Copyright © 2003-2015 Curt Hill
Now what? The two piece key implies three tables: One where SID is the key One where CID is the key One with both SID and CID as the key Each table has only fields dependent on the whole key Copyright © 2003-2015 Curt Hill
Original 1NF Table 21 Jones A1 1 170 C Lit MCF 89 32 Smith 160 C++ RSC SID SName LCode Status CID CName Site Grade 21 Jones A1 1 170 C Lit MCF 89 32 Smith 160 C++ RSC 68 91 385 DB I VNG 76 62 Copyright © 2003-2015 Curt Hill
New Relations Student SID SName LCode Status 21 Jones A1 1 32 Smith Enroll Course SID CID Grade 21 170 89 32 160 68 91 385 76 62 CID CName Site 170 C Lit MCF 160 C++ RSC 385 DB I VNG Copyright © 2003-2015 Curt Hill
The new schema is better Used a three-way lossless join decomposition Now at Second Normal Form Lost some anomalies The insertion and deletion anomalies We may have a student without a class The update anomaly Changing a course title needs only one update One anomaly still exists: Changing LCode of one requires changing other LCodes as well More work to be done Copyright © 2003-2015 Curt Hill
Finally Dependencies are mathematical concept Strongly related to the concept of a key We can use dependencies to determine a table’s normal form Second, third and Boyce-Codd First is any rectangular table Second has no partial dependencies A 1NF table with a single field for a key must be in 2NF Copyright © 2003-2015 Curt Hill