Why Normalization? To Reduce Redundancy to 1.avoid modification, insertion, deletion anomolies 2.save space Goal: One Fact in One Place
Normalization A process for assigning attributes to entities... (Rob, Coronel) A process for transforming some objects into structural form that satisfies some collection of rules… (Riccardi) Reduces data redundancies… A formalization of “one fact in one place” (CJ Date)
Normalization You will probably never have to go through formal normalization of any database (unless you are in school or work for a rigid corporation like IBM). Maybe that is a little unfair, but if you do good design you shouldn’t have to go through the formal normalization steps which are most useful when you already have tables. Define and extract entities when modeling. If you jump right into making tables without doing a conceptual or logical design you may end up with tables that aren’t in 3NF or even 2NF, but you can’t end up with tables that aren’t in 1NF. We’ll try a little bit of theory while keeping in mind that the goal is to put one fact in one place only.
The Normal Forms First Normal Form (1NF): A table is in 1NF if each attribute is atomic 2NF: table is in 1NF AND contains no partial dependencies. That is, no attribute is dependent on only a portion of the primary key 3NF: table is in 2NF AND contains no transitive dependencies. That is, there are no non-key attributes that are determined by other non-key attributes. BCNF: table is in 3NF AND every determinate in the table is a candidate key
Higher Normal Forms (rarely used in business apps) 4NF: table in BCNF AND all multi- valued dependencies are also functional dependencies. 5NF: table cannot have a lossless decomposition into any number of smaller tables (join dependency)
relational keys (again) Superkey An attribute (or combination of attributes) that uniquely identifies each instance in a table Candidate key A minimal superkey. A superkey that does not contain a subset of attributes that is itself a superkey Primary key A candidate key selected to uniquely identify all other attribute values in any given row. Cannot contain null entries. Alternate key A candidate key that is not the primary key (might be used as secondary access via an index) Foreign key An attribute or combination of attributes of a table that are also the primary key of another table. Composite key A key that is composed of more than one attribute.
Functional Dependencies A functional dependency is a strong connection between two or more attributes in a table. –one attribute is functionally dependent on another attribute when any two rows of the table that have the same value of the second attribute must have the same value for the first – left side is the determinant Example: movieId determines title, genre, length, rating –Each row with movieId 123 has the same values for other attributes –FD2: movieId {title, genre, length, rating}
Street, City, State, Zip FD4: zipcode {city, state} FD5: {street, city, state} zipcode
Using Functional Dependencies To determine candidate and primary keys (2NF) To discover redundancies and transitive dependencies (3NF)
Primary keys & superkeys A primary key constraint is a functional dependency Example: accountId is primary key of Customer –FD6: accountId {lastName, firstName, street, city, state, zipcode} A superkey is a set of attributes that determine the rest of the attributes of a table –FD7: {accountId, lastName} (firstName, street, city, state, zipcode}
Determining Keys from Functional Dependencies Start with closure of functional dependencies Any functional dependency that includes all attributes has a superkey as the left side If no subset of the left side is a superkey –The left side is a candidate key In other words, a candidate key is a minimal superkey A set of attributes is a candidate key if and only if the above holds Choose primary key from set of candidates keys
2NF is easy Our tables will be in 2NF if we make sure our primary keys are candidate keys. That is, they are minimal superkeys. Putting tables in 2NF removes partial dependencies You can also get here by using surrogate keys so they must be good!
Let‘s go to 3NF A table is in third normal form (3NF) if for every functional dependency –The left side (determinant) is a superkey or –The right side attributes are all key attributes Putting tables in 3NF removes transitive dependencies. That is, there are no non- key attributes that are determinants for other non-key attributes.
Boyce Codd Normal Form A table is in BCNF if every functional dependency has a superkey as its determinant –No exclusion for key attributes Important in the context of multi-attribute keys Consider the example of table and FD –R6: (street, city, state, zipcode) –FD4: zipcode {city, state} R6 is in 3NF R6 has BCNF violation because zipcode is not a superkey
3NF and BCNF A table is in third normal form (3NF) if for every functional dependency The left side (determinant) is a superkey OR The right side attributes are all key attributes A table is in BCNF if every functional dependency has a superkey as its determinant
3NF or BCNF? What is the difference between 3NF and BCNF? Remember that BCNF is just a special form of 3NF. We’re trying to get rid of transitive dependencies. It probably won’t happen very much that you have a table in 3NF that is not in BCNF. If you do then go back to the original table. Ideally we want all our tables to have keys that determine the remaining attributes with no other dependencies in the remaining attributes. Is BCNF better than 3NF? Why?
How far should we normalize? What about 4NF and 5NF? Higher the normal form the better? Generally 3NF is fine, but BCNF is goal. After you’ve done all that fancy normalization you might want to denormalize some of them. That is, put them back in a lower normal form. Consider the expense of all the joins required to get data out of high normal form relations. You might be willing to sacrifice a little redundancy for performance?
Give me a break? Why go through all this normalization process if we are just going to denormalize at the end? It will help us better understand our model and our resulting database. Goal is to make good designs and understanding any trade-offs or limitations our model will impose on our database helps us get there.