Digital recordkeeping and preservation I The relational model ARK2100 Digital recordkeeping and preservation I 2017 Thomas Sødring thomas.sodring@hioa.no P48-R407 67238287
The relational model So far we have looked at databases (DBMS) and discussed their properties Now we take a closer look at the relational model and its use Then we will do some practical work with a DBMS called MySQL And later, we look at the relationship between the Noark 5 standard and the standards implementation in a relational database
Learning What do the following concepts mean The relational model, Schema, Relationship, Tuple, Attribute, Primary key, Foreign key Normalization* Anomalies, Why and How Referential integrity What it is and what kind of anomalies that may arise because of it
The relational model A database is called a schema Data is stored in relations *(tables) Access to data is (usually) with keys Two central key types used in relations Primary Key Foreign key *Formally duplicate rows are not allowed in a relation, while these are allowed in a table
DBMS Schema A Schema B r1 r2 r3 r1 r2 r5 r3 r4 r6 r7 A database management system can contain many schemas. People often call a schema a database and use the terms interchangeably.
Data is stored in a relation (table) Cars RegistrationNr ChassisNr Colour Manufacturer Model LH12984 10946534 Red Volkswagen Golf DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon
Relation, Attributtes, Tuples RegistrationNr ChassisNr Colour Manufacturer Model Cars LH12984 10946534 Red Volkswagen Golf 4 Tuples DK23491 9648573 Blue Toyota Yaris BP12349 5523840 Green Skoda Fabia ZT97495 2643923 White Seat Leon 5 Attributtes
An attribute is a column A tuple is a row Roughly speaking ... A relation is a table * An attribute is a column A tuple is a row *again a table can have duplicate rows, a relation cannot
Primary Key A primary key is a value that can be used to identify a unique row (record) in a relation The primary key identifies a unique object (row) with a set of objects (rows) Social security number identify a person ISBN number identifying a book Registration number identifying a car
Foreign key A foreign key is a field (attribute) in a table in a relational database that points to a field (attribute) in (usually) another table This last field is often the tables primary key But it does not have to be This allows us to connect related information between tables The table using the foreign key is often called a child table, while the table the value of the key is defined in is called the parent
Primary and Foreign Keys StudentNr Firstname Etternavn 12345 Jan Karlson 23456 Pål Solberg 34567 Mette Johansen 45678 Ingrid Aleksandersen Surname TelephoneNr 76543829 90783298 99456543 45990234 Student StudentTelephoneNr Parent Child StudentNr is a primary key in both relations StudentNr in the StudentTelephoneNr relation is a foreign key to StudentNr in the Student relation
Primary and Foreign Keys Pål Solberg Nils Nilsen Ari Hansen 1 2 3 Customers Firstname Surname CustomerNr CustomerNr 1 2 3 15486110 06584585 95486110 06759425 AccountOwner AccountNr 06584585 2,000 15486110 8,000 06759425 -3,000 95486110 Account AccountNr Balance AccountNr is the primary key in the Account relation CustomerNr is the primary key in the Customers relation CustomerNr and AccountNr are primary keys in the AccountOwner relation AccountNr is foreign key to the Account relation CustomerNr is foreign key to the Customers relation Parent Parent Child
Let's recap A database is called a schema Data is stored in relations *(tables) Access to data is (usually) with keys Two central key types used in relations Primary Key Foreign key A tuple is a row of information An attribute is a column *Formally duplicate rows are not allowed in a relation, while these are allowed in a table
Another example Telenor is a provider of mobilephone telephony, internet, television (Canal Digital), landline and IP telephony Try to explain a structure that minimizes duplicated data showing primary / foreign keys
Telenor Group - Customers CustomerNr Surname Firstname 1 Hansen Thomas 2 Lie Mona 3 Rørvik Eli 4 Andersen Børre Mobil Landline CanalDigital CustomerNr Number CustomerNr Subscription CustomerNr Subscription 1 45764389 2 1234567 1 1234567 2 95794873 3 2345678 3 2345678 3 91265238 4 3456789 4 3456789 MobilephoneConversations CanalDigitalSubscription MobilFrom ToNumber Time Length Subscription Type 45764389 93473422 1.1.2017 13.45 45 1234567 Pakke 1 95794873 32793455 1.1.2017 13.49 32 2345678 Basic Pkg 91265238 22109344 1.1.2017 13.52 500 3456789 Sport Pkg
Relations Telenor Group - Customers Mobilephone Landline CanalDigital MobilephoneConversations CanalDigitalSubscription
Attributes Telenor Group - Customers Surname Firstname Mobil Landline CustomerNr Surname Firstname Mobil Landline CanalDigital CustomerNr Number CustomerNr Subscription CustomerNr Subscription MobilephoneConversations CanalDigitalSubscription MobilFrom ToNumber Time Length Subscription Type
Tuples Telenor Group - Customers Surname Firstname 1 Hansen Thomas 2 CustomerNr Surname Firstname 1 Hansen Thomas 2 Lie Mona 3 Rørvik Eli 4 Andersen Børre Mobil Landline CanalDigital CustomerNr Number CustomerNr Subscription CustomerNr Subscription 1 45764389 2 1234567 1 1234567 2 95794873 3 2345678 3 2345678 3 91265238 4 3456789 4 3456789 MobilephoneConversations CanalDigitalSubscription MobilFrom ToNumber Time Length Subscription Type 45764389 93473422 1.1.2017 13.45 45 1234567 Pakke 1 95794873 32793455 1.1.2017 13.49 32 2345678 Basic Pkg 91265238 22109344 1.1.2017 13.52 500 3456789 Sport Pkg
Primary Keys Telenor Group - Customers 1 2 3 4 Mobil Landline CustomerNr 1 2 3 4 Mobil Landline CanalDigital CustomerNr CustomerNr CustomerNr 1 2 1 2 3 3 3 4 4 MobilephoneConversations CanalDigitalSubscription MobilFrom ToNumber Time Subscription 45764389 93473422 1.1.2017 13.45 1234567 95794873 32793455 1.1.2017 13.49 2345678 91265238 22109344 1.1.2017 13.52 3456789
Foreign keys Telenor Group - Customers Surname Firstname Mobil CustomerNr Surname Firstname Mobil Landline CanalDigital CustomerNr Number CustomerNr Subscription CustomerNr CustomerNr Subscription MobilephoneConversations CanalDigitalSubscription MobilFrom ToNumber Time Length Subscription Type
Foreign keys Telenor Group - Customers Surname Firstname 1 Hansen (with data) CustomerNr Surname Firstname 1 Hansen Thomas 2 Lie Mona 3 Rørvik Eli 4 Andersen Børre Mobil Landline CanalDigital CustomerNr Number CustomerNr Subscription CustomerNr Subscription 1 45764389 2 1234567 1 1234567 2 95794873 3 2345678 3 2345678 3 91265238 4 3456789 4 3456789 MobilephoneConversations CanalDigitalSubscription MobilFrom ToNumber Time Length Subscription Type 45764389 93473422 1.1.2017 13.45 45 1234567 Pakke 1 95794873 32793455 1.1.2017 13.49 32 2345678 Basic Pkg 91265238 22109344 1.1.2017 13.52 500 3456789 Sport Pkg
Is it as easy as tables? Can we just store data in tables, or are there other things we have to take into account? Redundancy and anomalies Insertion Updating Deletion We are going to working on a fictional scenario You have a small rental company where you rent 3-4 cars and record everything in Excel (flat file) Use this example to explore issues relating to data modelling
Redundancy Redundancy means that your data repeats itself and it makes your database unnecessarily large This can result in errors with your data
Insertion anomaly Every time a customer rents a car all customer data and vehicle data are reinserted If car information is required, we can not insert data about a customer unless they rent a car If customer information is required, we can not insert data about a car unless it is rented
Update anomaly If the colour of a car is entered incorrectly and subsequently has to be updated you must first find all relevant occurrences If all instances are not found and changed our data will be inconsistent Original data Updated data
Deletion anomaly If a customer rents a new car and subsequently cancels the rental, all information about the car disappears Original data After deletion
More about the scenario Now it is unlikely that this is a problem for a single person keeping track of a few cars You probably could work with data in this format But if your company grows and you have 30 cars and hundreds of customer And you have an agreement with another company that you will help eachother And you have to employ people This data model will quickly result in problems
Normalisation Method used to verify if you have a good database model Why do we normalise? Prevent data anomalies from ocurring Minimise duplication of data During data updates, the system must be consistent and data integrity must be ensured When do we normalise? Early in the database design process
Normal forms Good design 3NF 2NF 1NF
First Normal form Atomic values means that each field of a row can only contain one value A table is in the first normal form if and only if all columns contain atomic values
1NF We have to analyse each field in the database and identify whether or not the values are atomic To convert the data to 1NF, we have to make the non-atomic fields atomic This may result in a duplication of rows, introduction of new columns
1NF Sometimes, 1NF will require the creation of new columns becomes Sometimes, 1NF will require the creation of new columns You see this with Name -> fname, sname Othertimes, the change to 1NF will require a duplication of data You see this with License
Second normal form (2NF) A table is in the second normal form (2NF) if and only if it is in 1NF and all columns that are not part of the primary key are dependent on the entire primary key, and not just part of it
Second normal form (2NF) A table is in the second normal form (2NF) if and only if it is in 1NF and all columns that are not part of the primary key are dependent on the entire primary key, and not just part of it Violation of 2NF A and B are primary keys A B C D E E is only dependent on B
2NF To get the table to 2NF, we have to first identify primary keys Remember a primary key is a unique key that identifies a unique row in a relation A primary key can be made up of one or more columns We are trying to find out if multiple primary keys are present in the table and which data is associated with these primary keys Next we have to see which columns are dependent on the primary key(s)
Identify Primary Keys
Identify dependencies rental customer car telephonenr
Solution 2NF Customer Rental Car The solution is to break the table up into four tables Customer Car Rental TelephoneNr TelephoneNr
Third normal form (3NF) A table is in third normal form if and only if it is in the second normal form and all columns that are not part of the primary key, are mutually independent
A B C D E Third normal form (3NF) A table is in third normal form if and only if it is in the second normal form and all columns that are not part of the primary key, are mutually independent Violation of 3NF A and B are primary keys A B C D E Dependency between C og E
Are all columns mutually independent? Customer Rental Car TelephoneNr
Solution 3NF Customer Rental Car ? TelephoneNr Postnr
Something to think about Normalisation is both an art and a science You can systematically go through the steps and arrive at a decent solution But intuition and experience will play a big part in solving the problem You will very rarely have to work like this In this scenario the modelling job was so bad the system was useless But you have to understand what problem normalisation solves to understand its importance
Another example Gerd Bergets book has another example on anomalies and normalisation The scenario this time is a film rental company that records all information in a flat file / excel spreadsheet Again the point is understand normalisation by looking at a scenario that is badly modelled
Simplified film spreadsheet Surname Firstname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Lie Mona Storgata 4 0182 Oslo 1,2 Citizen Kane, Psycho 1941, 1960 115 104 Universal Pictures, Universal Pictures 2 Hansen Thomas Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Rørvik Eli Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Andersen Børre Bekkefaret 5 0348 Oslo Psycho 1998, 109 Universal Pictures
Redundancy and anomalies Redundancy means that data in the database repeats itself and this makes the database unnecessarily large and can also potentially introduce errors with data Insertion anomaly Each time a new customer rents a film all the data about the customer and film have to be reinserted Update anomaly If a film has the wrong date and needs updating we have to first find all relevant instances If all instances are not found then we will have inconsistent data Deletion anomaly If the first customer that rents film subsequently cancels it, then we lose information about the film
Universal Pictures, Universal Pictures Redundancy example Surname Firstname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Lie Mona Storgata 4 0182 Oslo 1,2 Citizen Kane, Psycho 1941, 1960 115 104 Universal Pictures, Universal Pictures 2 Hansen Thomas Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Rørvik Eli Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Andersen Børre Bekkefaret 5 0348 Oslo Psycho 1998, 109 Universal Pictures Reinserting information about films means that the information is duplicated resulting in a database that is larger than it has to be and increases the potential for data errors
Universal Pictures, Universal Pictures Insertion anomaly Surname Firstname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Lie Mona Storgata 4 0182 Oslo 1,2 Citizen Kane, Psycho 1941, 1960 115 104 Universal Pictures, Universal Pictures 2 Hansen Thomas Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Rørvik Eli Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Andersen Børre Bekkefaret 5 0348 Oslo Psycho 1998 109 Universal Pictures When customer 3 (Eli Rørvik) wants to rent a file ('Psycho'), we had to reinsert all data about the film again title, year, length, company
Universal Pictures, Universal Pictures Update anomaly Surname Firstname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Lie Mona Storgata 4 0182 Oslo 1,2 Citizen Kane, Psycho 1941, 1960 115 104 Universal Pictures, Universal Pictures 2 Hansen Thomas Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Rørvik Eli Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Andersen Børre Bekkefaret 5 0348 Oslo Psycho 1998 109 Universal Pictures If we find out that we the wrong data about the movie 'Psycho' was registered, e.g. that length was 108min not 104min then we need to find all the rows that contain 'Psycho' and update them. (Simply searching for the title will not work)
Universal Pictures, Universal Pictures Deletion anomaly Surname Firstname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Lie Mona Storgata 4 0182 Oslo 1,2 Citizen Kane, Psycho 1941, 1960 115 104 Universal Pictures, Universal Pictures 2 Hansen Thomas Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Rørvik Eli Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Andersen Børre Bekkefaret 5 0348 Oslo Psycho 1998 109 Universal Pictures If Customer 1 (Mona Lie) and Customer 3 (Eli Rørvik) cancel their film rental we will lose all data about the movie 'Psycho' from 1960
Normal forms Good design 3NF 2NF 1NF
First Normal form Atomic values means that each field of a row can only contain one value A table is in the first normal form if and only if all columns contain atomic values
Universal Pictures, Universal Pictures 1NF A table is in the first normal form if and only if all columns contain atomic values Surname Firstname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Lie Mona Storgata 4 0182 Oslo 1,2 Citizen Kane, Psycho 1941, 1960 115, 104 Universal Pictures, Universal Pictures 2 Hansen Thomas Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Rørvik Eli Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Andersen Børre Bekkefaret 5 0348 Oslo Psycho 1998 109 Universal Pictures
Solution 1NF Firstname Surname Address Postnr Town Title Mona Lie 0182 Film ID Title Year Length Company Cust ID Mona Lie 0182 Oslo 1 Citizen Kane 1941 115 Universal Pictures Storgata 4 Mona Lie 0182 Oslo 2 Psycho 104 1 Universal Pictures Storgata 4 1960 Thomas Hansen 1406 Ski 3 2 Bakken 8b 1972 175 The Godfather Paramount Eli Rørvik Saturnringen 47 1808 Askim 2 3 Universal Pictures 1960 Psycho 104 Børre Andersen Bekkefaret 5 0348 Oslo 4 Psycho 1998 109 Universal Pictures
2NF To get the table to 2NF, we have to first identify primary keys Remember a primary key is a unique key that identifies a unique row in a relation A primary key can be made up of one or more columns We are trying to find out if multiple primary keys are present in the table and which data is associated with these primary keys Next we have to see which columns are dependent on the primary key(s)
FilmID og CustID stand out as the best primary keys Identify primary keys Firstname Surname Address Postnr Town Film ID Title Year Length Company Cust ID 1 Mona Lie Storgata 4 0182 Oslo 1 Citizen Kane 1941 115 Universal Pictures 1 Mona Lie Storgata 4 0182 Oslo 2 Psycho 1960 104 Universal Pictures 2 Thomas Hansen Bakken 8b 1406 Ski 3 The Godfather 1972 175 Paramount 3 Eli Rørvik Saturnringen 47 1808 Askim 2 Psycho 1960 104 Universal Pictures 4 Børre Andersen Bekkefaret 5 0348 Oslo 4 Psycho 1998 109 Universal Pictures FilmID og CustID stand out as the best primary keys Why?
Second normal form (2NF) A table is in the second normal form (2NF) if and only if it is in 1NF and all columns that are not part of the primary key are dependent on the entire primary key, and not just part of it Violation of 2NF A and B are primary keys A B C D E E is only dependent on B
..... we have to identify dependencies between columns ..... First ..... we have to identify dependencies between columns ..... Firstname Surname Address Postnr Town Film ID Title Year Length Company Cust ID Mona Lie 0182 Oslo 1 Citizen Kane 1941 115 Universal Pictures Storgata 4 Mona Lie 0182 Oslo 2 Psycho 104 1 Universal Pictures Storgata 4 1960 Thomas Hansen 1406 Ski 3 2 Bakken 8b 1972 175 The Godfather Paramount Eli Rørvik Saturnringen 47 1808 Askim 2 3 Universal Pictures 1960 Psycho 104 Børre Andersen Bekkefaret 5 0348 Oslo 4 Psycho 1998 109 Universal Pictures Separated out to own relations telephonenr
Solution 2NF Customer Firstname Surname Address Postnr Town Cust ID 1 Mona Lie Storgata 4 0182 Oslo 2 Thomas Hansen Bakken 8b 1406 Ski 3 Eli Rørvik Saturnringen 47 1808 Askim 4 Børre Andersen Bekkefaret 5 0348 Oslo Film Booking Solution is to separate the table into three different relations Customer Film Booking Film ID Title Year Length Company CustID Film ID 1 Citizen Kane 1941 115 Universal Pictures 1 1 2 Psycho 104 Universal Pictures 1960 1 2 3 1972 175 The Godfather Paramount 2 3 4 Psycho 1998 109 Universal Pictures 3 2 4 4
A B C D E Third normal form (3NF) A table is in third normal form if and only if it is in the second normal form and all columns that are not part of the primary key, are mutually independent Violation of 3NF A og B er primary keys A B C D E Dependency between C og E
Is Film i 3NF? Film ID Title Year Length Company 1 Citizen Kane 1941 115 Universal Pictures 2 Psycho 104 Universal Pictures 1960 1972 175 The Godfather Paramount 3 4 Psycho 109 Paramount 1998 Are there any dependencies between two columns where one of them is not part of the primary key?
Is Customer in 3NF? Customer Firstname Surname Address Postnr Town Cust ID 1 Mona Lie Storgata 4 0182 Oslo 2 Thomas Hansen Bakken 8b 1406 Ski 3 Eli Rørvik Saturnringen 47 1808 Askim 4 Børre Andersen Bekkefaret 5 0348 Oslo Are there any dependencies between two columns where one of them is not part of the primary key?
Solution 3NF Customer Zip Cust ID Firstname Surname Address Postnr Postnr Town 1 Mona Lie Storgata 4 0182 0182 Oslo 2 Thomas Hansen Bakken 8b 1406 1406 Ski 3 Eli Rørvik Saturnringen 47 1808 1808 Askim 4 Børre Andersen Bekkefaret 5 0348 0348 Oslo Film Booking Film ID Title Year Length Company CustID Film ID 1 Citizen Kane 1941 115 Universal Pictures 1 1 Solution is to separate postnumber/town to their own relations 2 Psycho 104 Universal Pictures 1960 1 2 3 1972 175 The Godfather Paramount 2 3 4 Psycho 1998 109 Universal Pictures 3 2 4 4
Learning What do the following concepts mean The relational model, Schema, Relationship, Tuple, Attribute, Primary key, Foreign key Normalization* Anomalies, Why and How Referential integrity What it is and what kind of anomalies that may arise because of it *http://en.wikibooks.org/wiki/Relational_Database_Design/Normalization
Referential Integrity So far we have only looked at intra-relation issues, there can also be inter-relation issues that we have to concern ourselves with Referential integrity is an important inter-relation concept Parent Child (1) Child (2)
Referential integrity - insertion What happens if I try to insert a rental with no corresponding customer? The car would be blocked for the rental period and I potentially lose money ? Customer Rental
Referential integrity - deletion What happens if I try to delete a customer? Try to delete Customer number 2 The rental table will have a missing foreign key reference If the renter committed damage with the car and I later need to find out who rented the car? Customer Rental
Referential integrity - update What happens if I try to change a customers primary key? e.g I change the primary key from 2 to 9 We suddenly lose the entire rental history of that customer Customer Rental
Handling referential integrity Referential integrity is a database mechanism that can be switched on or off The consequence of this is really important Referential integrity goes from a parent to a child table With referential integrity on the Customer (parent) to Rental (child) relationship You cannot delete a customer without deleting all rentals You cannot add a rental if a customer does not exist
Handling referential integrity When referential integrity is enabled, certain rules apply You can not add a row in a child (table) if there is no corresponding row in the parent table (FK) Cannot add a rental without a corresponding car or customer You can not delete a row from a parent (table) if a corresponding row exists in a child table Can not delete a customer if it has rentals You can not change values in the primary key in a parent (table) if there is a related row in a child (table) Can not change the customer number in the customer relation if the customer has rented a car
Handling referential integrity The referential integrity mechanism is often configurable in a database Set to null on delete Delete all children automatically Automatically update the foreign key values in the child relation
Another example Combination of ER- diagram and table infromation Shows Relation names Attribute names Primary Keys Foreign Keys Relationships http://www.jpmensah.com/ITEC485/images/er_diagram.gif
Tables and ER-diagrams The previous slide is a good example of what we are going to learn in this course The relationship between tables in the relational model as defined in an ER-diagram An ER-diagram ultimately defines the structure of the schema But we first have to understand the basic concepts of Schema, relations, attributes, tuples, primary keys, foreign keys and relationships
Finally We have explored a lot of the problems associated with relations in a database We should now have a good grasp of the terminology and the subject area Next we will look at how we model a database using ER-modelling and generate ER-diagrams before we start developing a model and implement it in a database