Database Design Using Normalization David M. Kroenke and David J. Auer Database Processing: Fundamentals, Design, and Implementation Chapter Four: Database Design Using Normalization
Chapter Objectives To design updatable databases to store data received from another source To use SQL to access table structure To understand the advantages and disadvantages of normalization To understand denormalization To design read-only databases to store data from updateable databases KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Chapter Objectives To recognize and be able to correct common design problems: The multivalue, multicolumn problem The inconsistent values problem The missing values problem The general-purpose remarks column problem KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Chapter Premise We have received one or more tables of existing data. The data is to be stored in a new database. QUESTION: Should the data be stored as received, or should it be transformed for storage? KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
How Many Tables? SKU_DATA (SKU, SKU_Description, Buyer) BUYER (Buyer, Department) Where SKU_DATA.Buyer must exist in BUYER.Buyer Should we store these two tables as they are, or should we combine them into one table in our new database? KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Normal Forms Review 1NF 2NF Eliminate repeating groups. Make a separate table for each set of related attributes, and give each table a primary key. 2NF Eliminate redundant data. Each attribute must be functionally dependent on the primary key. If an attribute depends on only part of a multi-valued key, remove it to a separate table. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Normal Forms Review 3NF Eliminate columns not dependent on key. If attributes do not contribute to a description of the key, remove them to a separate table. Any transitive dependencies are moved into a smaller table. BCNF Every determinant in the table is a candidate key. If there are non-trivial dependencies between candidate key attributes, separate them out into distinct tables. All normal forms are additive, in that if a model is in 3NF, it is by definition also in 2NF and 1NF. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Another Example KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Putting a Relation into BCNF: EQUIPMENT_REPAIR KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Step 1 Is the Table in 1NF? A quick scan of the table suggests it is in 1NF. Even though a primary key is not identified, one could be determined. [Remember since no 2 rows can be identical in a relation, a candidate for the primary key can always be a composite key made up of all the attributes.] KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Identify Functional Dependencies EQUIPMENT_REPAIR (ItemNumber, Type, AcquisitionCost, RepairNumber, RepairDate, RepairAmount) FD: ItemNumber (Type, AcquisitionCost) RepairNumber (ItemNumber, Type, AcquisitionCost, RepairDate, RepairAmount) KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
2 NF Look for a composite primary key [or candidate key] The PK for this table could be a composite of all the attributes So the best place to start here would be to assess the determinants of the functional dependencies Hint: another way to look at this is to evaluate whether you see possible different entities KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Identify Functional Dependencies EQUIPMENT_REPAIR (ItemNumber, Type, AcquisitionCost, RepairNumber, RepairDate, RepairAmount) FD: ItemNumber (Type, AcquisitionCost) RepairNumber (ItemNumber, Type, AcquisitionCost, RepairDate, RepairAmount) Is there a determinate key that is not a candidate key? KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Put into Tables ItemNumber is not a candidate key so Move it and its attributes to a new table ITEM(ItemNumber,Type, AcquisitionCost) The determinate becomes the primary key Leave a foreign key in the original table REPAIR (ItemNumber, RepairNumber, RepairDate, RepairAmount) KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Tables KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
3 NF Look for transitive dependencies There are no transitive dependencies All functional dependencies have been taken care of KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
BCNF All determinates are candidate keys KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
What Does a Database Do? Stores information in a highly organized manner Manipulates information in various ways, some of which are not available in other applications or are easier to accomplish with a database Models some real world process or activity through electronic means Often called modeling a business process Often replicates the process only in appearance or end result KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
The Design Process Identify the purpose of the database Review existing data Make a preliminary list of fields Make a preliminary list of tables and enter fields Identify the key fields Draft the table relationships Enter sample data and normalize the data/tables Review and finalize the design [HANDOUT: EXERCISE 1] KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
1. Identify purpose of the DB Clients can tell you what information they want but have no idea what data they need. “We need to keep track of inventory” “We need an order entry system” “I need monthly sales reports” “We need to provide our product catalog on the Web” Be sure to Limit the Scope of the database. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
1. Continued Quite often, the stated intention implies data needs far beyond the client’s knowledge. Be sure to offer or question extension of the design to other areas. Example: Tracking inventory implies adjusting inventory in stock every time there is a sale, thus implying that some method of tracking sales is also needed. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
1. Continued Client may say “We have a database already for that”, which implies that you the designer may need to tap into the existing DB in some manner. Or client may say “We don’t have the budget for that this year; just do the inventory tracking part and we’ll keep track of sales manually.” thus limiting the scope of your design KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
2. Review Existing Data Electronic Manual Legacy database(s) Spreadsheets Web forms Manual Paper forms Receipts and other printed output KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
3. Make Preliminary Field List Make sure fields exist to support needs Ex. if client wants monthly sales reports, you need a date field for orders. Ex. To group employees by division, you need a division identifier Make sure values are atomic Ex. First and Last names stored separately Ex. Addresses broken down to Street, City, State, etc. Do not store values that can be calculated from other values Ex. “Age” can be calculated from “Date of Birth” KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
4. Make Preliminary Tables (and insert the fields into them) Each table holds info about one subject Don’t worry about the quantity of tables Look for logical groupings of information Use a consistent naming convention KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Naming Conventions Rules of thumb Table names must be unique in DB; should be plural Field names must be unique in the table(s) Clearly identify table subject or field data Be as brief as possible Avoid abbreviations and acronyms Use less than 30 characters, Use letters, numbers, underscores (_) Do not use spaces or other special characters Uniqueness of field names applies to the table they are in; fields in different tables can have the same name and linked fields usually should so they are easily identified KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
5. Identify the Key Fields Primary Key(s) Can never be Null; must hold unique values Automatically indexed in most RDBMSs Values rarely (if ever) change Try to include as few fields as possible Multi-field Primary Key Combination of two or more fields that uniquely identify an individual record Candidate Key Field or fields that qualify as a primary key Important in Third and Boyce-Codd Normal Forms KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
6. Identify Table Relationships Based on business rules being modeled Examples: “each customer can place many orders” “all employees belong to a department” “each TA is assigned to one course” Historical note: “Relational” as in “Relational Database” has nothing to do with “relationship” as in “table relationships”. Codd was a mathematician, and devised his rules for modern databases based on mathematical set theory. In set theory, when two groups of numbers have a correspondence of some kind, this is called a “relation”, and Codd named this type of database “relational” because the database storage structure follows some of the same rules as mathematical sets, not because we relate tables together. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
7. Normalization Normal Forms (NF): design standards based on database design theory Normalization is the process of applying the NFs to table design to eliminate redundancy and create a more efficient organization of DB storage. Each successive NF applies an increasingly stringent set of rules Much of what we’ll talk about now and much that you’ve already run into in your own experience will tell you that common sense can avoid many of these problems. At the very least, some of the earlier steps in the design process will obviate or prevent the occurrence of these problems later in the process. But the normal forms are your safety net. If you aren’t sure about whether something belongs in a table or not, run it through the normal forms to find out. Sometimes the problem isn’t in the table you’re currently analyzing, but in one at which you’ve already looked. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
8. Finalizing the Design Double-check to ensure good, principle-based design Evaluate design in light of business model and determine desired deviations from design principles Process efficiency Security concerns KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Design and Normalization Process Summary Watch for repeating values and fields Check against the Normal Forms Make new tables when necessary Re-check all tables against the NFs Remember the business rules Use common sense, but check anyway! KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Assessing Table Structure KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Counting Rows in a Table To count the number of rows in a table use the SQL COUNT(*) built-in aggregate function : KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Reasons for Counting Rows There are various reasons why you might need to know the row count of various database structures (tables etc), including: Determine if an application has loaded data Estimating how long a query might take to run Estimating how long update statistics might take to run Estimating how long create index might take to run Deciding why a query plan has chosen a particular join type KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Examining the Columns To determine the number and type of columns in a table, use an SQL SELECT statement. To limit the number of rows retrieved, use the SQL TOP {NumberOfRows} function: KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Checking Validity of Assumed Referential Integrity Constraints I Given two tables with an assumed foreign key constraint: SKU_DATA (SKU, SKU_Description, Buyer) BUYER (Buyer, Department) Where SKU_DATA.Buyer must exist in BUYER.Buyer KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Checking Validity of Assumed Referential Integrity Constraints II To find any foreign key values that violate the foreign key constraint An empty set for the query result indicates that no foreign key values violate the foreign key constraint KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Assessing Assumed Constraints Placing constraints on how and when and where data can be entered Done after or along with table design Part of design process because many constraints are established at the database and table levels KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Referential Integrity True relational databases support Referential Integrity: every non-null foreign key value must match an existing primary key value. In other words, every record in a related table must have a matching record in the primary table. Preserves the validity of foreign key values. Enforced at database level. Why is this important? Referential Integrity helps ensure that the database contains valid and usable values and records by preserving the connection between tables. Without it, table relationships quickly become meaningless and queries return unreliable results. The most common problem in the absence of referential integrity is the creation of orphan records: the primary key value is changed, causing the matching of the related records to fail. Default in most RDBMSs is for RefInt to be turned off, probably because the software can’t tell from the table design whether you want it turned on or not. So, what happens when you want to change the value on one side of a set of related records? RefInt in its absolute form won’t allow this, so… KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Levels of Enforcement Referential Integrity enforced at database level because it affects relationship between two tables. Many other business rules enforced at field and table level to ensure data integrity. Business rule implementation should be documented: how and where it is enforced in the design. Some rules can’t be enforced at table or field level; must be enforced in the application level. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Testing of Business Rules Always test business rule implementation What happens when rule is met? What happens when rule is violated? Not much good as a data entry constraint if it doesn’t constrain properly Good application or interface design will provide feedback when user violates a constraint or rule KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Type of Database Updateable database, or read-only database? If updateable database, we normally want tables in BCNF. If read-only database, we may not use BCNF tables. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Designing Updatable Databases Updatable databases are typically the operational databases of a company, such as the online transaction processing (OLTP) system discussed for Cape Codd Outdoor Sports at the beginning of Chapter 2. If you are constructing an updatable database, then you need to be concerned about modification anomalies and inconsistent data. Consequently, you must carefully consider normalization principles. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Normalization: Advantages and Disadvantages Why do we say reduce data duplication rather than eliminate data duplication? The answer is that we cannot eliminate all duplicated data because we must duplicate data in foreign keys. We cannot eliminate Buyer, for example, from the SKU_DATA table because we would then not be able to relate BUYER and SKU_DATA rows. Values of Buyer are thus duplicated in the BUYER and SKU_DATA tables. This observation leads to a second question: If we only reduce data duplication, how can we claim to eliminate inconsistent data values? Data duplication in foreign keys will not cause inconsistencies because referential integrity constraints prohibit them. As long as we enforce such constraints, the duplicate foreign key values will cause no inconsistencies. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Non-Normalized Table: EQUIPMENT_REPAIR KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Normalized Tables: ITEM and REPAIR KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Copying Data to New Tables To copy data from one table to another, use the SQL INSERT statement: KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Final Steps In Chapters 7 and 8, you will learn how to: Remove unneeded tables after the data is copied, using the SQL DROP TABLE statement. Create the referential integrity constraint, using the SQL ALTER TABLE statement. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Choosing Not To Use BCNF BCNF is used to control anomalies from functional dependencies. There are times when BCNF is not desirable. The classic example is ZIP codes: ZIP codes almost never change. Any anomalies are likely to be caught by normal business practices. Not having to use SQL to join data in two tables will speed up application processing. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Multivalued Dependencies Anomalies from multivalued dependencies are very problematic. Always place the columns of a multivalued dependency into a separate table (4NF). KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Designing Read-Only Databases The extracted sales data that we used for Cape Codd Outdoor Sports in Chapter 2 is a small, but typical example of a read-only database. Read-only databases are used in business intelligence (BI) systems for producing information for assessment, analysis, planning, and control, as we discussed for Cape Codd Outdoor Sports in Chapter 2. Read-only databases are commonly used in a data warehouse, which we also introduced in Chapter 2. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Read-Only Databases Read-only databases are nonoperational databases using data extracted from operational databases. They are used for querying, reporting, and data mining applications. They are never updated (in the operational database sense—they may have new data imported from time to time). KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Denormalization For read-only databases, normalization is seldom an advantage. Application processing speed is more important. Denormalization is the joining of the data in normalized tables prior to storing the data. The data is then stored in nonnormalized tables. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Normalized Tables KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Denormalizing the Data KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Customized Tables I Read-only databases are often designed with many copies of the same data, but with each copy customized for a specific application. Consider the PRODUCT table: KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Customized Tables II KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Common Design Problems KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
The Multivalue, Multicolumn Problem The multivalue, multicolumn problem occurs when multiple values of an attribute are stored in more than one column: EMPLOYEE (EmployeeNumber, EmployeeLastName, Auto2_LicenseNumber, Auto3_LicenseNumber) This is another form of a multivalued dependency. Solution = like the 4NF solution for multivalued dependencies, use a separate table to store the multiple values. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Inconsistent Values I Inconsistent values occur when different users, or different data sources, use slightly different forms of the same data value: Different codings: SKU_Description = 'Corn, Large Can' SKU_Description = 'Can, Corn, Large' SKU_Description = 'Large Can Corn‘ Different spellings: Coffee, Cofee, Coffeee KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Inconsistent Values II Particularly problematic are primary or foreign key values. To detect: Use referential integrity check already discussed for checking keys. Use the SQL GROUP BY clause on suspected columns. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Inconsistent Values III KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Missing Values A missing value or null value is a value that has never been provided. In a database table, a null value appears in upper case letters as NULL. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Null Values Null values are ambiguous: May indicate that a value is inappropriate; DateOfLastChildbirth is inappropriate for a male. May indicate that a value is appropriate but unknown; DateOfLastChildbirth is appropriate for a female, but may be unknown. May indicate that a value is appropriate and known, but has never been entered; DateOfLastChildbirth is appropriate for a female, and may be known but no one has recorded it in the database. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
Checking for Null Values Use the SQL IS NULL operator to check for null values: KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
The General-Purpose Remarks Column A general-purpose remarks column is a column with a name such as: Remarks Comments Notes It often contains important data stored in an inconsistent, verbal, and verbose way. A typical use is to store data on a customer’s interests. Such a column may: Be used inconsistently Hold multiple data items KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
The General-Purpose Remarks Column: Hidden Foreign Key Data In a typical situation, the data for the foreign key may have been recorded in the Remarks column. 'Wants to buy a Piper Seneca II‘ 'Owner of a Piper Seneca II‘ 'Possible buyer for a turbo Seneca'. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
End of Presentation: Chapter Four David Kroenke and David Auer Database Processing Fundamentals, Design, and Implementation (14th Edition) End of Presentation: Chapter Four KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America. KROENKE AND AUER - DATABASE PROCESSING, 14th Edition © 2016 Pearson Prentice Hall