The Process of Normalisation
Relational Databases and Normalisation: Outline What is a Relational Database Database Design Problems with Design: modification anomalies dependencies Normalisation
What is a Relational Database? a collection of data organised into 2 dimensional tables (also called relations) These tables comprise rows (tuples) and fields/columns (attributes) Tables in a relational database are linked to each other via common fields
Guidelines for database design: Identify all the fields required Group related fields into tables Determine the primary key for each table Make sure related tables have a common field Avoid data redundancy Determine the properties of each field - name, length, description, valid values Develop the user interface
Relations Relation - formal term for a table Attribute – formal term for field Shorthand Notation: RelationName(attribute1,attribute2, attribute3…..) Primary key is underlined e.g. Student(studentID,FirstName,LastName) Primary key is indicated by underlining
Candidate keys: The primary key in a table is the field or combination of fields that are used to uniquely identify a record in the table. If the primary key is a combination of fields then it is called a composite key. The value of a primary key cannot be null. A candidate key is any other field which could also serve as a primary key.
Candidate key example: Elements table Question: Which fields are candidate keys? Answer: all fields, since any one of ElementName, ElementSymbol or AtomicNumber will uniquely determine a record.
Modification Anomalies Modification anomalies include some of the problems that can occur with poorly structured databases. There are three types of modification anomalies. These are anomalies to do with the insertion, deletion and updating of data.
Deletion Anomaly A deletion anomaly occurs when one deliberately deletes one piece of data and thereby accidentally loses other data.
BUSINESS table
BUSINESS table Eg. in the BUSINESS table if Baker leaves the company and the record containing data on Baker is deleted from the database then the information that Cody is the manager of the project ‘Identify New Investments’ is also lost.
Insertion Anomaly An insertion anomaly occurs when one desires to insert new data into a relation and cannot do so because it is not possible to assemble a complete primary key.
BUSINESS table Eg. in the BUSINESS table suppose there is a new project planned for the organisation and it is necessary to add data regarding the manager of the new project. The primary key in this table is the concatenation of Employee Number and Project Number. Key values cannot be null, so it is not possible to add the required data until at least one person has been assigned to work on the project.
Update Anomaly An update anomaly can occur when redundant data has to be updated. Unless all records containing the data needing to be changed are updated, the resultant database will be inconsistent.
BUSINESS table Eg. in the BUSINESS table, suppose a project gets a new manager, say Yates is to be replaced by Martin as the manager of the project ‘New Billing System’. This requires a change in more than one record in the table in order to avoid inconsistencies. This is known as an update anomaly.
Terminology Dependency Functional Dependency Dependency - describes the relationship between attributes in terms of how one value fixes or determines the value of another Dependency Functional Dependency Full Functional Dependency Partial Dependency Transitive Dependency Normalisation is based on the analysis of functional dependence. It describes a particular relationship between two attributes. Dependency: describes the relationship between attributes in terms of how one value fixes or determines the value of another Functional Dependency: exists when a unique value of one attribute can always be determined if we know the value of another. Both attributes can be composite. Total Dependency: exists between attribute X and attribute Y iff x is functionally dependent on Y and vice versa. Transitive Dependency: when a non-key attribute in a relation is fully dependent on another non-key attribute. e.g If Student _No ---> Course and Course ---> Tutor then Student_No ---> Tutor Mutual Independency: Two or more attributes are mutually independent if non of the attributes concerned is functionally dependent on any of the other.
Functional Dependency The contents of one field is (fully) functionally dependent on the primary key if given any value of the primary key, the contents of that specific field is uniquely determined by the whole of the primary key. e.g. Student(studentID,FirstName,LastName) FirstName and LastName are fully functionally dependent on studentID The attribute on the LHS of the arrow in a functional dependency is called a determinant. Medicare_No, Reg_No and ISBN are determinants in the above examples. In the previous example, EMP_ID and COURSE relation, the combination of both EMP_ID and COURSE is a determinant. Example of an instance when functional dependency does not exist: A B C D X U X Y Y X Z X Z Y Y Y Y Z W Z Since A does not uniquely determine B, B is therefore not functionally dependant on the attribute A
Partial Dependency The contents of one field is partially dependent on the primary key if the contents of that specific field is uniquely determined by the part of the primary key. e.g. ….. The attribute on the LHS of the arrow in a functional dependency is called a determinant. Medicare_No, Reg_No and ISBN are determinants in the above examples. In the previous example, EMP_ID and COURSE relation, the combination of both EMP_ID and COURSE is a determinant. Example of an instance when functional dependency does not exist: A B C D X U X Y Y X Z X Z Y Y Y Y Z W Z Since A does not uniquely determine B, B is therefore not functionally dependant on the attribute A
e.g. TRAVEL CLUB table:
Eg. In the TRAVEL CLUB table it can be seen that the Cost is dependant only on the Destination and Travel Date. The primary key in the table is the concatenation of Membership Number, Destination and Travel Date. The Cost can therefore be determined by a subset of the primary key, so Cost is not fully functionally dependant on the primary key, it is only partially dependant. TRAVEL CLUB table
Transitive Dependence A transitive dependency (or non-key dependency) occurs when the contents of non-key fields are dependant on the contents of other non-key fields as well as or rather than the primary key. (Note that if the other non-key field is also a candidate key then the dependency is not considered to be transitive).
e.g. MEDICAL table:
Eg. In the MEDICAL table, Patient’s Employer is dependent on Medical Record Number, and Patient’s Employer determines Employer’s Address i.e. a transitive dependency exists between Patient’s Employer and Employer’s Address. MEDICAL table
Normalisation Normalisation is a method of building a database in order to easily accommodate changes in the database and avoid problems such as redundant data and modification anomalies. References: Kendall & Kendall, Systems Analysis and Design, Prentice-Hall Date, C.J., Database Systems, Addison-Wesley
Goals of Normalisation database is easier to understand and simpler to implement reflects meaning of situation being modelled more amenable to processing new requests for data prevents storage of invalid information When data items are put together in a haphazard way, the above criteria may be compromised. For example, when data items that are logically unrelated are aggregated, users can become confused. Experience has shown that most problems can be traced to improper conceptual database designs. Normalisation is a technique that structures data in ways to help reduce or prevent problems. It results in logically consistent record structures that are easy to understand and simple to maintain. Several levels of normalisation can be obtained.
Normalisation steps - The process of normalisation includes the following steps: remove transitive dependencies remove repeating groups remove partial dependencies Table with repeating groups 1NF 2NF 3NF remove remaining anomalies Normalisation is a process of converting complex data structures into simple, stable data structures. It is often accomplished in stages where each stage corresponds to a normal form. It results in data being organised in such a way that we can minimise data redundancy and can avoid modification anomalies. Normalisation converts a table into tables of progressively smaller degree until an optimum level of decomposition is reached, i.e. where little or no data redundancy exists. The First, Second and Third Normal Forms label the stages in normalisation where each form is governed by progressively stricter rules. For most cases, relations in the Third Normal Form are sufficient. However 3NF does not guarantee that all anomalies have been removed. Hence Extended Forms have been developed to cope with these anomalies. The Extended Forms are: Boyce Codd Normal Form; Fourth Normal Form; Fifth Normal Form. Results of a successful normalisation effort: amount of space needed to store data may be lower table can be updated with greater efficiency no loss of information during deletion insertion of a row into the table will not be affected by unavailable data description of the database will be straight forward remove multi-valued dependencies remove remaining anomalies Boyce-Codd NF 5NF 4NF
Normalisation STEP 1: Eliminate repeating groups (by splitting into 2 or more tables - explanation shortly) and ensure all tables have a primary key. When this has been done the database is said to be in first normal form (1NF) STEP 2: the database must first be in 1NF Remove all partial dependencies (by splitting into 2 or more tables - explanation shortly). When this has been done the database is said to be in second normal form (2NF)
Normalisation STEP 3: the database must first be in 2NF Remove all non-key (transitive) dependencies (by splitting into 2 or more tables - explanation shortly) When this has been done the database is said to be in third normal form (3NF)
What is a repeating group? A repeating group is a column (field), or combination of columns (fields), that contains several data values in each row (different numbers of values in different rows in general). Ref. Date E.g. Repeating group
Normalisation Example 1 Recall that a relation is in first normal form (1NF) if it contains no repeating groups. Also the first property of a relation is that the value at the intersection of each row and column is atomic. Thus the above table is not in 1NF. A table with a repeating group
Remove Repeating Groups – by splitting into 2 or more tables Primary key required Primary key required Table in 1NF A table with repeating groups is converted to a relation in 1NF by extending the data in each column to fill cells that are empty because of the repeating group structures. Common field required
Step 2: are there any partial dependencies? Table in 2NF also as there are no partial dependencies A table with repeating groups is converted to a relation in 1NF by extending the data in each column to fill cells that are empty because of the repeating group structures.
Step 3: are there transitive dependencies? Table now in 3NF also as there are no transitive dependencies – we are assuming the names are unique A table with repeating groups is converted to a relation in 1NF by extending the data in each column to fill cells that are empty because of the repeating group structures.
Normalisation Example 2 You would have difficulty retrieving information from this table because too much data is stored in the items column. Think how difficult it would be to create a report summarizing number of purchases by item. The Items field is known as a repeating group.
You could redesign the Order table in the following way: This design has divided the Item information into several columns, but there are still problems:
For example how would you go about finding the quantity of hammers ordered by all customers in a particular month. Any query would have to search all three item columns to determine whether a hammer was purchased then sum over the Quantity columns. Worse still, what if a customer ordered more than three items in a single order. You could add more columns, but where would you stop - 10 items, 20 items??? If you decided that a customer would never order more than 25 Items then you could include 25 item and 25 Quantity columns. However for orders that involve only one or two items this would clearly be a waste of space. Fields such as the Quantity and Item fields above are also known as repeating groups.
Step1: For a table to be in first normal form we must remove repeating groups. Here is a table design that does that: To attain 1st Normal form we have added another field OrderItemID. The primary key of this table is a composite key made up of OrderID and OrderItemID.
To make it more realistic we could add a product ID field and a product description field. The table is now in 1st Normal form
Second Normal form (2NF) Step2: For a table to be in Second Normal Form (2NF) it must be in 1NF and every non-key field must be dependent on the (entire) primary key (i.e. fully dependent).
Second Normal form (2NF) As far as the table below is concerned, it is only in 2NF if each non-key field is fully dependant on OrderID and OrderItemID. Is this true? No, given the value of OrderID, the date and customer are fully determined. In other words CustomerID and OrderDate are not fully dependent on the entire primary key. So this table is not in 2NF. Second Normal form can be achieved by breaking the table into 2:
Common field
In this case the original table had a composite key so we put everything relating to OrderID in one table and everything that applies to the order items in another table. sl45
Note: When normalising no information is thrown away Decomposition should be done in such a way that the tables can be put back together again using queries. Thus it is important that the OrderDetails tables contains a foreign key to the Orders table.
Step 3: a table is said to be in 3NF if it is in 2NF and all non-key fields are mutually independent. Both the Orders table and the OrderDetails table are in 2NF. The Orders table is in 3NF. However, the table OrderDetails is not in 3NF because it contains a dependency between 2 of its non-key columns, ProductID and ProductDescription. To achieve 3NF in the OrderDetails table, we can take out ProductID and ProductDescription and put them in a separate Products table. The primary key of the Products table becomes ProductID. The OrderDetails table has the ProductID field as foreign key to the Products table.
Orders table in 3NF
Transitive dependency
These are now both in 3NF. So the final tables in 3NF are: The two new tables are: These are now both in 3NF. So the final tables in 3NF are: Foreign key
Foreign key Foreign key Tables in 3NF
Normalisation Example 3
STEP 1: The SubjectCode and SubjectName fields are an example of a repeating group. The table should be split into 2 tables to eliminate this repeating group. The StudentID field also needs to be included in the STUDENT-SUBJECT table to provide a link to the STUDENT-DEGREE table. This field is known as a foreign key. In this table the obvious choice for the new primary key is StudentIDSubjectCode
STEP 2 (remove partial dependencies): Notice that the subject name is dependent on the subject code, but not on the student ID number, in other words the subject name field is only partially dependent on the primary key and hence needs to be removed. The resultant tables are:
Foreign key The tables so far: along with: Foreign key
STEP 3: there are no non-key dependencies in any of the tables so the database is now in 3NF
Further Normalisation In practise normalization usually stops at 3NF. However note that there are 3 other normal forms, Boyce-Codd normal form, fourth normal form and fifth normal form.
Consequences of Normalisation Advantages: Normalisation solves a number of problems relating to the structuring of data, namely it avoids: Update anomalies Insertion anomalies and Deletion anomalies Efficiency, consistency, size. Updating multiple records Incomplete primary key Deletion of unnecessary info. causes necessary info. to be deleted. Less space needed. Can improve retrieval Removes redundancy Smaller tables
Consequences of Normalisation Disadvantages: Normalisation also creates two new problems: Decomposition of data structures into smaller structures of higher normal form results in duplication of data item types – the decomposition process requires an appropriate part of the primary key in the original relation (table) be included as a foreign key in the new relation(s) (tables) formed.
Consequences of Normalisation Increase in data structures inherent in the normalisation process can adversely affect the retrieval efficiency of the database. Normalisation by decomposition will reduce the overall space required to store data, but increase the time it takes to retrieve information because numerous relations (tables) need to be rejoined in order to extract that information.
convert the following table to third normal form Lecture exercise: convert the following table to third normal form Manager is the manager of the project
STEP 1: Eliminate repeating groups the database must first be in 1NF Remove all partial dependencies STEP 3: the database must first be in 2NF Remove all non-key dependencies Norm Eg.xls
For Homework: Go over this weeks lecture material and make sure that you thoroughly understand the concepts involved in the process of Normalisation. It will be on the exam in some form. Complete the Lecture exercise for next Tuesday. Next Week: Access Tutorial 8 Advanced queries Indexes Joins SQL