Chapter 9: Database Systems 9.1 Database Fundamentals 9.2 The Relational Model 9.3 Object-Oriented Databases 9.4 Maintaining Database Integrity 9.5 Traditional File Structures 9.6 Data Mining 9.7 Social Impact of Database Technology
Definition of a Database A database (DB) is collection of data with internal links that allow users to access the data from a variety of perspectives. Contrast: a flat file, which can only be accessed in one way.
Database Database is multidimensional, since internal links between its entries make the information accessible from a variety of perspectives Flat File is the traditional one-dimensional file storage system that presents its information from a single point of view
Example of a Flat File Panel 2.4.06 £3400 205 North Street Honeywell Case 5.5.06 £2000 15 Retail Road Univac 43 Plate 1.7.06 £1000 1 The Lane Sperry 37 Product Due date Price Customer address Customer name Order no.
Relational Database Order file Customer file Address Name Customer no. Product Due date Price Customer no. Order no. 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Order file Customer file
Relational Database linkage Order file Customer file Address Name Customer no. Product Due date Price Customer no. Order no. 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Order file Customer file
Figure 9.1 A file versus a database organization
Databases emerged as a means of integrating the information stored and maintained by a particular organisation Different departments/users can use the same data. Benefits: Reduce (or remove) duplication of data being stored in several places. Ensure all departments use the same versions of the data Easier to make changes to data -- change in one place only, not in several files
Advantages of a Single Central Database Easier to control access to information. Easier to update information. Easier (overall) to maintain and upgrade as technology improves.
Potential problems of data integration Security and Privacy issues Sensitive data (personal data, salary records ...) may be accessed by unauthorised personnel e.g. Need to restrict payroll user's access to exclude other financial data. The ability to control access to information in the database is as important as the ability to share it
Schemas To provide different users with different information within a database, database systems often rely on schemas and subschemas A schema is a description of the entire database, i.e. the contents of records and the links between them.
Example of a Schema Customer(Customer ID, Name, Address) Order(Order No. , Customer ID, Invoice No. , Status) Invoice(Invoice No. , Customer ID, Order No. , Status) Product(Product Code, Description) Bold underline: primary key Underline: foreign key http://en.wikipedia.org/wiki/Relational_model
Schemas and subschemas Schema = a description of the structure of an entire database, used by database software to maintain the database Subschema = a description of only that portion of the database pertinent to a particular user’s needs, used to prevent sensitive data from being accessed by unauthorised personnel
Database management systems (DBMS) Database Management System = a software layer that maintains a database and manipulates it in response to requests from applications Distributed Database = a database stored on multiple machines DBMS will mask this organisational detail from its users Data independence = the ability to change the organisation of a database without changing the application software which uses it
Figure 9.2 The conceptual layers of a database implementation Two software layers: application software (user interface) DBMS (database manipulation)
Advantages of the Conceptual Layers Application software is simplified as it is insulated from database management details. Easier to control user access through a single DBMS. Data independence: database organisation can be changed without changing the application software. to make a change in the DB for one user, need to change only the overall schema and the subschemas of users involved in the change.
Database models Database model = conceptual view of a database Relational database model Object-oriented database model Abstraction actual storage details are hidden from the user
9.2 Relational database model Name of relation Customers Address Name Customer no. Table heading 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Table All data is stored in rectangular tables called relations. Each row in a relation is a tuple cardinality = no.of tuples Each column is an attribute degree = no. of attributes
Advantages of this Relational Model Information is preserved. e.g. the name and address of a customer are recorded even if the customer has no orders. Information is not duplicated: e.g. the name and address of a customer with two orders are not entered twice.
Figure 9.3 A relation containing employee information
A relation does not specify the order of the attributes or the order of the tuples. Customers Customers 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Univac 15 Retail Road 103 Honeywell 205 North Street 54 Sperry 1 The Lane 102 is the same as
Repeated Tuples A relation cannot have repeated tuples 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Not a relation in a relational database.
Attributes in Different Relations An attribute of a given relation is not an attribute of any other relation. E.g. “Orders.customer no” is a different attribute from “Customers.customer no”.
Keys Primary key: an attribute whose value uniquely identifies a tuple. Composite key: a minimal set of attributes whose values together uniquely identify a tuple Foreign key: set of attributes pointing to a primary key or a composite key in another table.
Example of a Composite Key Post Code Town Street House no. Composite key: House no., Post Code. CH3 8PN Chester High St. 23 DB21 0TQ Bury Acacia Ave. 1
Example of a Foreign Key Product Due date Price Customer no. Order no. Address Name Customer no. 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Order file Customer file
Evaluating a relational design Avoid multiple concepts within one relation Can lead to redundant data Deleting a tuple could also delete necessary but unrelated information
Table Structure Each table should correspond to a single concept or task. Each row of a table should be uniquely identified by a key. Avoid multiple copies of information.
Improving a relational design Decomposition = dividing the columns of a relation into two or more relations, duplicating those columns necessary to maintain relationships Lossless or nonloss decomposition = a “correct” decomposition that does not lose any information Aim: to produce a better table structure
Example of a Lossless Decomposition Address Name Product Due date Price Customer no. Order no. 205 North Street Honeywell Panel 2.4.06 £3400 54 20 25 Retail Road Univac Case 5.5.06 £2000 103 43 1 The Lane Sperry Plate 1.7.06 £1000 102 37 Original table
Lossless Decomposition Address Name Customer no. Product Due date Price Customer no. Order no. 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Order file Customer file
Example of a Lossy Decomposition Dept Job Title Empl Id Planning Manager 18 Finance 17 Assistant 203
Lossy Decomposition Dept Job Title Job Title Empl Id Planning Manager Finance Assistant Manager 18 17 Assistant 203 After decomposition we can no longer tell in which department employees 17 and 18 worked
Figure 9.7 A relation and a proposed decomposition
Figure 9.4 A relation containing redundancy Problem: more than one concept has been combined in a single relation. Redesign the system with 3 relations...
Figure 9.5 An employee database consisting of three relations
Figure 9.6 Finding the departments in which employee 23Y34 has worked
Rules for database normalisation http://www.datamodel.org/NormalizationRules.html http://en.wikipedia.org/wiki/Database_normalization
Relational operations Extracting information from a database Can the answers to database queries be found in a systematic way? In the relational model the answers are always relations.
Example Database queries Get List of Order no and Product. List of tuples of orders for Plate. List of Order no and Name.
Contents of a Relational Database Customers Orders Address Name Customer no Product Due date Price Customer no Order no 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 All data in a relational database are stored in tables
Query 1: List of Order Nos. and Products Project: remove the attributes Customer no., Price and Due date to leave... Orders Product Due date Price Customer no Order no Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Product Order no Ans1: Panel 20 Case 43 Plate 37
Query 2: List of Tuples of Orders for Plate Select: take all tuples for which the product is Plate, to give... Product Due date Price Customer no Order no Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Ans2: Product Due date Price Customer no Order no Plate 1.7.06 £1000 102 37
Query 3: List of Order No. and Name Customers Orders Address Name Customer no Product Due date Price Customer no Order no 205 North Street Honeywell 54 15 Retail Road Univac 103 1 The Lane Sperry 102 Panel 2.4.06 £3400 54 20 Case 5.5.06 £2000 103 43 Plate 1.7.06 £1000 102 37 Join: combine tuples in Orders and Customers, then use Project to retain the attributes Order no. and Name.
Query 3: List of Order No. and Name (cont'd) Join and then Project Temp <- Join Orders and Customers where O.Cno=C.Cno C.add C.Nm C.Cno O.Pr O.Due O.pr O.Cno O.no 205 North Street Honeywell 54 Panel 2.4.06 £3400 20 15 Retail Road Univac 103 Case 5.5.06 £2000 43 1 The Lane Sperry 102 Plate 1.7.06 £1000 37 Ans3 C.Nm O.no Honeywell 20 Univac 43 Sperry 37 Names of attributes are abbreviated
Project The original relation is R1, the new relation created by Project is R2 and the attributes projected to R2 are att1,…, attn. R2 <- Project att1,…attn from R1 Example: Ans1 <- Project Orders.Order no, Orders.Product from Orders
Figure 9.9 The PROJECT operation
Select The original relation is R1, the new relation created by Select is R2 and the conditions on the attribute values in R2 are c1,…cn. R2 <- Select from R1 where c1,…, cn. Example: Ans2 <- Select from Orders where Orders.Product=‘Plate’
Figure 9.8 The SELECT operation
Join The original relations are R1, R2, the new relation created by Join is R3 and the conditions on the attribute values in R3 are c1,…cn. R3 <- Join R1 and R2 where c1,…, cn Example: Temp <- Join Orders, Customers where O.Cno=C.Cno
Figure 9.10 The JOIN operation
Figure 9.11 Another example of the JOIN operation
Another Example for Join R3 Join Orders and Customers where Orders.Price>£1500 and O.Cno=C.Cno R3 C.add C.Nm C.Cno O.Pr O.Due O.pr O.Cno O.no 205 North Street Honeywell 54 Panel 2.4.06 £3400 20 15 Retail Road Univac 103 Case 5.5.06 £2000 43
Join and Keys A B C Join A and B where A.foreign key=B.key C data 22 DB1 16 D2 22 28 D1 16 34 C Join A and B where A.foreign key=B.key B.data B.key A.data A.Foreign key A.key C DB2 22 D2 28 DB1 16 D1 34
Structured Query Language (SQL) operations to manipulate tuples insert update delete select
Structured Query Language SQL is used to retrieve data and manipulate data. SQL is declarative. It describes the required information. It does not specify how the information is obtained. Adopted by ANSI (1986) and ISO (1987). First commercial release: Oracle Corporation (1979), then IBM (1979) http://en.wikipedia.org/wiki/SQL
SQL Data Retrieval 1 List of Order no and Product. select Order no, Product from Orders SQL command The attributes Order no, Product are projected from Orders Result
SQL Data Retrieval 2 List of tuples of orders for Plate. select * from Orders where Orders.Product=‘Plate’ SQL command * Means all attributes. The resulting relation contains all tuples in Orders for which Orders.Product=‘Plate’. Result
SQL Data Retrieval 3 List of Order no and Name. select Orders.Order no, Customers.name from Orders, Customers where Orders.Customer no = Customers.Customer no SQL command Join Orders and Customers. Project the the attributes Orders.Order no, Customers.name Result
SQL Command for Data Retrieval select <list of attributes> from <list of relations> where <list of conditions> select * from Orders, Customers where Orders.Price > 1500 and Orders.Customer no=Customers.Customer no Example
SQL Commands for Data Manipulation insert into Orders values (56, 54, ‘£1000’, ‘1.8.06’, ‘Plate’) delete from Orders where Orders.Customer no = 54 update Customers set Customers.Address = ‘16 High Street’ where Customers.Customer no = 102
Advantages of SQL High level language with short but powerful commands. Simple, natural syntax. Details of information retrieval left to the DBMS, allowing quick applications development. Multipurpose: querying, modifying, controlling access to data, defining the database http://www.craigsmullins.com/idug_sql.htm
Problems with SQL Large, complicated standard. Standard does not always fully specify the effects of a command. There are many commercial extensions of and variations on the standard, thus SQL code is usually not easily portable. http://en.wikipedia.org/wiki/SQL
SQL and Relational Databases SQL allows many deviations from the relational model, including: Duplicate tuples Unnamed attributes Two or more attributes with the same name Specification of the order of attributes http://en.wikipedia.org/wiki/Relational_model
9.3 Object-oriented databases Object-oriented database = a database constructed by applying the object-oriented paradigm Each data entity stored as a persistent object Relationships indicated by links between objects DBMS maintains inter-object links
Figure 9.13 The associations between objects in an object-oriented database
Advantages of object-oriented databases Matches design paradigm of object-oriented applications Intelligence can be built into attribute handlers Example: names of people Can handle exotic data types Example: multimedia Can store intelligent entities
9.4 Maintaining database integrity Transaction = a sequence of operations that must all happen together Example: transferring money between bank accounts Transaction log = non-volatile record of each transaction’s activities, built before the transaction is allowed to happen Commit point = point at which transaction has been recorded in log Roll-back = procedure to undo a failed, partially completed transaction Problem of cascading rollback.
Maintaining database integrity (continued) Simultaneous access problems Incorrect summary problem Lost update problem Locking = preventing others from accessing data being used by a transaction Shared lock: used when reading data Exclusive lock: used when altering data
9.5 Traditional File Structures Sequential Files Indexed Files Hash Files
Sequential files Sequential file = file whose contents can only be read in order Reader must be able to detect end-of-file (EOF) Data can be stored in logical records, sorted by a key field Greatly increases the speed of batch updates
Figure 9.14 A procedure for merging two sequential files
Figure 9.15 Applying the merge algorithm (Letters are used to represent entire records. The particular letter indicates the value of the record’s key field.)
Figure 9.16 The structure of a simple employee file implemented as a text file
Indexed files Index = list of (key, location) pairs Sorted by key values location = where the record is stored
Figure 9.17 Opening an indexed file
Hashing Each record has a key The master file is divided into buckets A hash function computes a bucket number for each key value Each record is stored in the bucket corresponding to the hash of its key
Figure 9.18 Hashing the key field value 25X3Z to one of 41 buckets
Figure 9.19 The rudiments of a hashing system
Collisions in Hashing Collision = when two keys hash to the same bucket Major problem when table is over 75% full Solution: increase number of buckets and rehash all data
9.6 Data mining Data mining = a set of techniques for discovering patterns in collections of data Relies heavily on statistical analyses Data warehouse = static data collection to be mined Data cube = data presented from many perspectives to enable mining Raises significant ethical issues when it involves personal information
Data mining strategies Class description Class discrimination Cluster analysis Association analysis Outlier analysis Sequential pattern analysis
Data mining strategies Class description identifying properties that characterise a given group of data items what characterises people who buy small economical cars, or expensive mobile phones Class discrimination identifying properties that divide two groups what distinguishes customers who buy used cars from those who buy new ones; or what distinguishes a reader of Vatan from a reader of Cumhuriyet
Data mining strategies Cluster analysis trying to discover classes find properties of data items that identify groupings Association analysis looking for links between data groups do people who buy beer also buy potato chips?
Data mining strategies Outlier analysis identify data entries that are out of the ordinary e.g. identifying use of stolen credit cards from altered spending patterns Sequential pattern analysis try to identify patterns of behaviour over time e.g. economic trends, climate conditions...
Social impact of database technology Problems Massive amounts of personal data are being collected Often without knowledge or meaningful consent of affected people Data merging produces new, more invasive information Errors are widely disseminated and hard to correct Remedies Existing legal remedies largely ineffective Negative publicity may be more effective