1 Data Normalization Text book Chapter 3: Jerry Post Copyright © 2003
2 Introduction A Database is a powerful tool. It provides many advantages over traditional programming. However you get these advantages only if you design the database correctly.
3 What is data normalization It is to split your data into several tables that will be connected to each other based on the data within them Before data can be normalized you must Understand the business rules Your tables must match the business rules
4 Primary and composite keys Primary Key A column which can uniquely identify a row in a table. E.g. Iqama Number, Saudi Id etc. Composite Key If a table is using more than one column as the part of the primary key, is called composite key
5 Identifying Key Columns Orders OrderItems OrderIDDateCustomer OrderIDItemQuantity Each order has only one customer. So Customer is not part of the key. Each order has many items. Each item can appear on many orders. So OrderID and Item are both part of the key.
6 Identifying Key Columns If you are uncertain about which columns to key. Write them down and evaluate the business rules. OrderIDCustomerID For a given order, can there ever be more than one customer? If yes, then key CustomerID. In most businesses, only one customer per order, so do not key it. For a given customer, can there ever be more than one order? If yes, then key OrderID, otherwise, do not key it. All businesses hope to get more than one order from a customer, so OrderID must be key.
7 Surrogate Keys Real world keys sometimes cause problems in a database. Example: Customer Avoid phone numbers: people may not notify you when numbers change. Often best to let the DBMS generate unique values Access: AutoNumber SQL Server: Identity Oracle: Sequences (but require additional rogramming) Drawback: Numbers are not related to any business data, so the application needs to hide them and provide other look up mechanisms.
8 Problems with Repeating Sections RentalForm(TransID, RentDate, CustomerID, Phone, Name, Address, City, State, ZipCode, (VideoID, Copy#, Title, Rent ) ) TransIDRentDateCustomerIDLastNamePhoneAddressVideoIDCopy#TitleRent 14/18/043Washington Easy Street122001: A Space Odyssey$ /18/04 3Washington Easy Street63Clockwork Orange$ /30/04 7Lasater S. Ray Drive81Hopscotch$ /30/04 7Lasater S. Ray Drive21Apocalypse Now$ /30/04 7Lasater S. Ray Drive61Clockwork Orange$ /18/048Jones Lakeside Drive91Luggage Of The Gods$ /18/04 8Jones Lakeside Drive151Fabulous Baker Boys$ /18/04 8Jones Lakeside Drive41Boy And His Dog$ /18/043Washington Easy Street31Blues Brothers$ /18/04 3Washington Easy Street81Hopscotch$ /18/04 3Washington Easy Street131Surf Nazis Must Die$ /18/043Washington Easy Street171Witches of Eastwick$2.00 Repeating Section Causes duplication Storing data in this raw form would not work very well. For example, repeating sections will cause problems. Note the duplication of data. Also, what if a customer has not yet checked out a movie--where do we store that customer’s data?
9 First Normal Form Problems (Data) TransIDRentDateCustIDPhoneLastNameFirstNameAddressCityStateZipCode 14/18/ WashingtonElroy95 Easy StreetSmith's GroveKY /30/ LasaterLes67 S. Ray DrivePortlandTN /18/ JonesCharlie867 Lakeside DriveCastalian SpringsTN /18/ WashingtonElroy95 Easy StreetSmith's GroveKY42171 TransIDVideoIDCopy#TitleRent : A Space Odyssey$ Clockwork Orange$ Hopscotch$ Apocalypse Now$ Clockwork Orange$ Luggage Of The Gods$ Fabulous Baker Boys$ Boy And His Dog$ Blues Brothers$ Hopscotch$ Surf Nazis Must Die$ Witches of Eastwick$2.00 1NF splits repeating groups Still have problems Replication Hidden dependency: If a video has not been rented yet, then what is its title?
10 Second Normal Form A relation is in second normal form (2NF) if and only if it is in 1NF and every non key attribute is fully dependent on the primary key
11 Second Normal Form Example (Data) TransIDVideoIDCopy# VideoIDTitleRent 12001: A Space Odyssey$1.50 2Apocalypse Now$2.00 3Blues Brothers$2.00 4Boy And His Dog$2.50 5Brother From Another Planet$2.00 6Clockwork Orange$1.50 7Gods Must Be Crazy$2.00 8Hopscotch$1.50 VideosRented(TransID, VideoID, Copy#) Videos(VideoID, Title, Rent) RentalForm2(TransID, RentDate, CustomerID, Phone, Name, Address, City, State, ZipCode) (Unchanged)
12 Second Normal Form Example Title depends only on VideoID Each VideoID can have only one title Rent depends on VideoID This statement is actually a business rule. It might be different at different stores. Some stores might charge a different rent for each video depending on the day (or time). Each non-key column depends on the key. RentalLine(TransID, VideoID, Copy#, Title, Rent) VideosRented(TransID, VideoID, Copy#) Videos(VideoID, Title, Rent)
13 Second Normal Form Problems (Data) TransIDRentDateCustIDPhoneLastNameFirstNameAddressCityStateZipCode 14/18/ WashingtonElroy95 Easy StreetSmith's GroveKY /30/ LasaterLes67 S. Ray DrivePortlandTN /18/ JonesCharlie867 Lakeside DriveCastalian SpringsTN /18/ WashingtonElroy95 Easy StreetSmith's GroveKY42171 RentalForm2(TransID, RentDate, CustomerID, Phone, Name, Address, City, State, ZipCode) Even in 2NF, problems remain Replication Hidden dependency If a customer has not rented a video yet, where do we store their personal data? Solution: split table.
14 Third Normal Form Definition RentalForm2(TransID, RentDate, CustomerID, Phone, Name, Address, City, State, ZipCode) Each non-key column must depend on nothing but the key. Some columns depend on columns that are not part of the key. Split those into a new table. Example: Customers name does not change for every transaction. Dependence (definition) If given a value for the key you always know the value of the property in question, then that property is said to depend on the key. Depend only on CustomerID Depend on TransID
15 Third Normal Form Example Data TransIDRentDateCustomerID 14/18/ /30/ /18/048 44/18/043 CustomerIDPhoneLastNameFirstNameAddressCityStateZipCode JohnsonMartha125 Main StreetAlvatonKY SmithJack873 Elm StreetBowling GreenKY WashingtonElroy95 Easy StreetSmith's GroveKY AdamsSamuel746 Brown DriveAlvatonKY RabitzVictor645 White AvenueBowling GreenKY SteinmetzSusan15 Speedway DrivePortlandTN LasaterLes67 S. Ray DrivePortlandTN JonesCharlie867 Lakeside DriveCastalian SpringsTN ChavezJuan673 Industry Blvd.CaneyvilleKY RojoMaria88 Main StreetCave CityKY42127 Rentals(TransID, RentDate, CustomerID ) Customers(CustomerID, Phone, Name, Address, City, State, ZipCode )
16 Third Normal Form Tables (3NF) Rentals(TransID, RentDate, CustomerID ) Customers(CustomerID, Phone, Name, Address, City, State, ZipCode ) VideosRented(TransID, VideoID, Copy#) Videos(VideoID, Title, Rent) CustomerID Phone LastName FirstName Address City State ZipCode Customers TransID RentDate CustomerID Rentals TransID VideoID Copy# VideosRented VideoID Title Rent Videos 1 * 1 * * 1
17 3NF Rules/Procedure Split out repeating sections Be sure to include a key from the parent section in the new piece so the two parts can be recombined. Verify that the keys are correct Is each row uniquely identified by the primary key? Are one-to-many and many-to-many relationships correct? Check “many” for keyed columns and “one” for non-key columns. Make sure that each non-key column depends on the whole key and nothing but the key. No hidden dependencies.
18 Fourth Normal Form (Keys) Problem arise when there are two binary relationships In some cases, there are hidden relationships between key properties. Example: EmployeeTasks(EID, Specialty, ToolID) In 3NF now. Business Rules Each employee has many specialties. Each employee has many tools. Tools and are unrelated EmployeeTasks(EID, Specialty, ToolID) EmployeeSpecialty(EID, Specialty)) EmployeeTools(EID, ToolID))
19 Domain-Key Normal Form (DKNF) This describes the ultimate goal in designing a database If a table is in DKNF it must also be in 4NF, 3NF, and all of the other normal forms The catch is that there is no defined method to get a table into DKNF. In fact, it is possible that some tables can never be converted to DKNF
20 DKNF(Continues) The goal of DKNF is to have each table represent one topic All business rules are explicitly described by a table rules. For example prices cannot be negative etc. All other business rules must be expressed in terms of relationships with keys In particular, there can be no hidden relationships
21 No Hidden Dependencies The simple normalization rules: Remove repeating sections Each non-key column must depend on the whole key and nothing but the key. There must be no hidden dependencies. Solution: Split the table. Make sure you can rejoin the two pieces to recreate the original data relationships. For some hidden dependencies within keys, double-check the business assumption to be sure that it is realistic. Sometimes you are better off with a more flexible assumption.
22 Create Tables with SQL CREATE TABLE Customer ( CustomerIDNUMBER(38), LastNameNVARCHAR2(25), FirstNameNVARCHAR2(25), PhoneNVARCHAR2(25), NVARCHAR2(120), AddressNVARCHAR2(50), CityNVARCHAR2(50), StateNVARCHAR2(25), ZIPNVARCHAR2(15), GenderNVARCHAR2(15), DateOfBirthDATE, CONSTRAINT pk_Customer PRIMARY KEY (CustomerID), CONSTRAINT ck_CustGender CHECK (Upper(Gender) IN ('FEMALE', 'MALE', 'UNIDENTIFIED')) );
23 Data Rules and Integrity Simple business rules Limits on data ranges Price > 0 Salary < 100,000 DateHired > 1/12/1995 Choosing from a set Gender = M, F, Unknown Jurisdiction=City, County, State, Federal Referential Integrity Foreign key values in one table must exist in the master table. Order(O#, Odate, C#,…) C# must exist in the customer table. O#OdateC#… Order C#NamePhone… 321Jones Sanchez Carson8738- Customer
24 SQL Foreign Key (Oracle, SQL Server) CREATE TABLE Order (OIDNUMBER(9) NOT NULL, OdateDATE, CIDNUMBER(9), CONSTRAINT pk_Order PRIMARY KEY (OID), CONSTRAINT fk_OrderCustomer FOREIGN KEY (CID) REFERENCES Customer (CID) ON DELETE CASCADE );
25 Relationships: Department and Employee Employee EmployeeID TaxpayerID LastName FirstName Address Phone City State ZIP Department Description 1…1 1…* Foreign Key Reference Table
26 Estimating Database Size CustomerIDLong4 LastNameText(50)30 FirstNameText(50)20 PhoneText(50)24 Text(150)50 AddressText(50)50 StateText(50)2 ZIPText(15)14 GenderText(15)10 DateOfBirthDate8 Average bytes per customer212 Customers per week (winter)*200 Weeks (winter)*25 Bytes added per year1,060,000
27 Data Assumptions 200 customers per week for 25 weeks 2 skills per customer 2 rentals per customer per year 3 items per rental 20 percent of customers buy items 4 items per sale 100 manufacturers 20 models per manufacturer 5 items (sizes) per model
28 Database Table Sizes