Database: Review Sept. 2004Yangjun Chen Database Introduction system architecture, Basic concepts, ER-model, Data modeling, B+-tree Hashing Relational algebra, Relational data model SQL: DDL, DMLNormalizationLossless join Hierarchical databases
Database: Review Sept. 2004Yangjun Chen Introduction to the database systems What is a database? The main characters of a database The basic database design method The entity-relationship data model for application modeling
Database: Review Sept. 2004Yangjun Chen The main characteristics of the database approach: single repository of data sharable by multiple users concurrency control and transaction concept security and integrity constraints self-describing - system catalogue contains meta data program-data independence some changes to the database are transparent to programs/users multiple views of data - to support individual needs of programs/users
Database: Review Sept. 2004Yangjun Chen Database schema, Schema evolution, Database state Working process with a database system Database system architecture Data independence concept Concepts and Architecture
Database: Review Sept. 2004Yangjun Chen Database schema Relation schema Schema evolution Database state Student Name StNo Class Major Smith 17 1 CS Brown 8 2 CS Course CName CNo CrHrs Dept Database CS C CS Section SId CNo Semester Yr Instructor Spring 2000 Smith Winter 2000 Smith Spring 2000 Jones Grades StNo Sid Grade A B
Database: Review Sept. 2004Yangjun Chen Working process with a database system: Definition record structure data elements names data types constraints etc Construction create database files populate the database with records Manipulation querying updating
Database: Review Sept. 2004Yangjun Chen Database Management System (DBMS) collection of software facilitating the definition, construction and manipulation of databases Users/ actors Request manager Storage manager, Query evaluation Meta data Stored database DBMS
Database: Review Sept. 2004Yangjun Chen Three-schema architecture External view External view Conceptual schema Internal schema Physical storage structures and details Describes the whole database for all users A specific user or groups view of the database
Database: Review Sept. 2004Yangjun Chen Data modeling using ER-model Entity-relationship model -Entity types -strong entities -weak entities -Relationships among entities -Attributes - attribute classification -Constraints -cardinality constraints -participation constraints ER-to-Relation-mapping
Database: Review Sept. 2004Yangjun Chen employee department project dependent ER-model: works for manages works on dependents of controls supervision bdate ssn name lname minit fname sex address salary birthdatenamesex relationship name numberlocation name numberlocation number of employees startdate hours N supervisorsupervisee N M N 1 M N1 M
Database: Review Sept. 2004Yangjun Chen external hashing static hashing & dynamic hashing hash function mathematical function that maps a key to a bucket address collisions collision resolution scheme - open addressing - chaining - multiple hashing linear hashing Hashing technique
Database: Review Sept. 2004Yangjun Chen External hashing: the data are on the disk. Static hashing: using a hashing function to map keys to bucket addresses primary area can not be changed collision resolution schema: open addressing chaining multiple hashing Dynamic hashing: primary area can be changed linear hashing
Database: Review Sept. 2004Yangjun Chen Linear hashing: 1.What is a phase? 2.How to split a bucket? 3.When to split a bucket? 4.What bucket will be chosen to split next?
Database: Review Sept. 2004Yangjun Chen Linear hashing: initially hash file contains M buckets h i = key mod 2 i M (i = 0, 1, 2,...) insertion process can be divided into several phases phase 1: insertion using h 0 = key mod M splitting using h 1 = key mod 2 M splitting rule: overflow of a bucket or if load factor > constant (e.g., 0.70) overflow will be put in the overflow area or redistributed through splitting a bucket splitting buckets from n = 0 to n = M- 1 (after each splitting n is increased by 1. Phase 1 finishes when n = M (in this case, the primary area becomes 2 M buckets long)
Database: Review Sept. 2004Yangjun Chen phase 2: insertion using h 1 = key mod 2 M splitting using h 2 = key mod 4 M splitting rule: overflow of a bucket or if load factor > constant (e.g., 0.70) overflow will be put in the overflow area or redistributed through splitting a bucket splitting buckets from n = 0 to n = 2 M- 1 (after each splitting n is increased by 1. Phase 1 finishes when n = 2 M (in this case, the primary area will contain 4 M buckets.) phase 3:... … h 2 = …, h 3 = …,...
Database: Review Sept. 2004Yangjun Chen tree - root, internal, leaf, subtree - parent, child, sibling balanced, unbalanced b + -tree - splits on overflow; merge on underflow - in practice it is usually 3 or 4 levels deep search, insert, delete algorithms Multi-level index
Database: Review Sept. 2004Yangjun Chen B+-tree insertion: leaf node splitting, internal node splitting Leaf splitting When a leaf splits, a new leaf is allocated the original leaf is the left sibling, the new one is the right sibling key and pointer pairs are redistributed: the left sibling will have smaller keys than the right sibling a 'copy' of the key value which is the largest of the keys in the left sibling is promoted to the parent insert 31
Database: Review Sept. 2004Yangjun Chen Internal node splitting If an internal node splits and it is not the root, insert the key and pointer and then determine the middle key a new 'right' sibling is allocated everything to its left stays in the left sibling everything to its right goes into the right sibling the middle key value along with the pointer to the new right sibling is promoted to the parent (the middle key value 'moves' to the parent to become the discriminator between this left and right sibling) Insert 26 33
Database: Review Sept. 2004Yangjun Chen Internal node splitting When a new root is formed, a key value and two pointers must be placed into it Insert
Database: Review Sept. 2004Yangjun Chen Deleting nodes from a B+-tree: 1. When deleting a key from a node A, check whether the number of the remaining keys (or pointers) is p/2 . 2. If it is not the case, redistribute the keys in the left sibling B or in the right sibling C if it is possible. Otherwise, merge A and B or merge A and C. 3.When redistributing or merging, change the key values in the parent node so that the following condition is satisfied: K 1 < K 2 <... < K q-1 (i.e. it is an ordered set) for the key values, X, in the subtree pointed to by P i K i-1 < X <= K i for 1 < i < q X <= K 1 for i = 1 K q-1 < X for i = q
Database: Review Sept. 2004Yangjun Chen A b + -tree Records p = 3, p leaf = 2.
Database: Review Sept. 2004Yangjun Chen Entry deletion - deletion sequence: 8, 12, 9, Deleting 8 causes the node redistribute.
Database: Review Sept. 2004Yangjun Chen Entry deletion - deletion sequence: 8, 12, 9, is removed.
Database: Review Sept. 2004Yangjun Chen Entry deletion - deletion sequence: 8, 12, 9, is removed.
Database: Review Sept. 2004Yangjun Chen Entry deletion - deletion sequence: 8, 12, 9, Deleting 7 makes this pointer no use. Therefore, a merge at the level above the leaf level occurs.
Database: Review Sept. 2004Yangjun Chen Entry deletion - deletion sequence: 8, 12, 9, 7 53 For this merge, 5 will be taken as a key value in A since any key value in B is less than or equal to 5 but any key value in C is larger than A B C 5 This point becomes useless. The corresponding node should also be removed.
Database: Review Sept. 2004Yangjun Chen Entry deletion - deletion sequence: 8, 12, 9,
Database: Review Sept. 2004Yangjun Chen Data modeling using Relational model Relational algebra Relational Data Model -relational schema, relations -database schema, database state -integrity constraints and updating Relational algebra -select, project, join, cartesian product -division -set operations: union, intersection, difference,
Database: Review Sept. 2004Yangjun Chen Integrity Constraints any database will have some number of constraints that must be applied to ensure correct data (valid states) 1. domain constraints a domain is a restriction on the set of valid values domain constraints specify that the value of each attribute A must be an atomic value from the domain dom(A). 2. key constraints a superkey is any combination of attributes that uniquely identify a tuple: t 1 [superkey] t 2 [superkey]. -Example: (in Employee) a key is superkey that has a minimal set of attributes -Example: (in Employee)
Database: Review Sept. 2004Yangjun Chen Integrity Constraints If a relation schema has more than one key, each of them is called a candidate key. one candidate key is chosen as the primary key (PK) foreign key (FK) is defined as follows: i) Consider two relation schemas R 1 and R 2 ; ii) The attributes in FK in R 1 have the same domain(s) as the primary key attributes PK in R 2 ; the attributes FK are said to reference or refer to the relation R 2 ; iii) A value of FK in a tuple t 1 of the current state r(R 1 ) either occurs as a value of PK for some tuple t 2 in the current state r(R 2 ) or is null. In the former case, we have t 1 [FK] = t 2 [PK], and we say that the tuple t 1 references or refers to the tuple t 2. Example: Employee(SSN, …, Dno)Dept(Dno, … ) FK
Database: Review Sept. 2004Yangjun Chen Integrity Constraints 3. entity integrity no part of a PK can be null 4. referential integrity domain of FK must be same as domain of PK FK must be null or have a value that appears as a PK value 5. semantic integrity other rules that the application domain requires: state constraint: gross salary > net income transition constraint: Widowed can only follow Married; salary of an employee cannot decrease
Database: Review Sept. 2004Yangjun Chen Relational algebra Retrieve for each female employee a list of the names of her dependents: FEMALE_EMPS SEX = ‘F’ (EMPLOYEE) ACTUAL_DEPENDENTS EMPNAMES EMPNAMES FNAME,LNAME, SSN (FEMALE_EMPS) RESULT FNAME, LNAME, DEPENDENT_NAME (ACTUAL_DEPENDENTS ) DEPENDENT SSN = ESSN
Database: Review Sept. 2004Yangjun Chen DDL - creating schemas - modifying schemas DML - select-from-where clause - group by, having, order by - update - view SQL
Database: Review Sept. 2004Yangjun Chen DDL - Examples: Create schema: Create schema COMPANY authorization JSMITH; Create table: Create table EMPLOYEE (FNAMEVARCHAR(15)NOT NULL, MINITCHAR, LNAMEVARCHAR(15)NOT NULL, SSNCHAR(9)NOT NULL, BDATEDATE, ADDRESSVARCHAR(30), SEXCHAR, SALARYDECIMAL(10, 2), SUPERSSNCHAR(9), DNOINTNOT NULL, PRIMARY KEY(SSN), FOREIGN KEY(SUPERSSN) REFERENCES EMPLOYEE(SSN), FOREIGN KEY(DNO) REFERENCES DEPARTMENT(DNUMBER));
Database: Review Sept. 2004Yangjun Chen DDL - Examples: drop schema DROP SCHEMA CAMPANY CASCADE; DROP SCHEMA CAMPANY RESTRICT; drop table DROP TABLE DEPENDENT CASCADE; DROP TABLE DEPENDENT RESTRICT; alter table ALTER TABLE COMPANY.EMPLOYEE ADD JOB VARCHAR(12); ALTER TABLE COMPANY.EMPLOYEE DROP ADDRESS CASCADE;
Database: Review Sept. 2004Yangjun Chen DML - select-from-where clause Retrieve a list of employees and the projects they are working on, ordered by department, within each department, ordered alphabetically by last name, first name: SELECTDNAME, LNAME, FNAME, PNAME FROM DEPARTMENT, EMPLOYEE, WORKS_ON, PROJECT WHEREDNUMBER = DNO AND SSN = ESSN AND PNO = PNUMBER ORDER BY DNAME, LNAME, FNAME order by – clause group by – clause having – clause aggregation functions: max, min, average, count, sum
Database: Review Sept. 2004Yangjun Chen DML - select-from-where clause Insert Update Delete INSERT INTO employee ( fname, lname, ssn, dno ) VALUES ( "Joe", "Smith", 909, 1); UPDATE employee SET salary = WHERE ssn=909; DELETE FROM employee WHERE ssn=909; Note that Access changes the above to read: INSERT INTO employee ( fname, lname, ssn, dno ) SELECT "Joe", "Smith", 909, 1;
Database: Review Sept. 2004Yangjun Chen View definition Use a Create View command essentially a select specifying the data that makes up the view Create View Enames as select lname, fname from employee CREATE VIEWEnames (lname, fname) AS SELECTLNAME, FNAME FROMEMPLOYEE
Database: Review Sept. 2004Yangjun Chen CREATE VIEWDEPT_INFO (DEPT_NAME, NO_OF_EMPS, TOTAL_SAL) AS SELECTDNAME, COUNT(*), SUM(SALARY) FROMDEPARTMENT, EMPLOYEE WHEREDNUMBER = DNO GROUP BYDNAME;
Database: Review Sept. 2004Yangjun Chen function dependencies - data redundancy, update anomalies - what is a function dependency? - inference rules, minimal set of FDs normal forms - first normal form - second normal form - third normal form - Boyce Codd normal form Normalization
Database: Review Sept. 2004Yangjun Chen Data redundancy and update anomalies: enamessnbdateaddress EmployeeDepartment dnumberdname This is similar to Employee, but we have included dname
Database: Review Sept. 2004Yangjun Chen In the two prior cases with EmployeeDepartment and EmployeeProject, we have redundant information in the database … if two employees work in the same department, then that department name is replicated if more than one employee works on a project then the project location is replicated if an employee works on more than one project his/her name is replicated Redundant data leads to additional space requirements update anomalies
Database: Review Sept. 2004Yangjun Chen Suppose EmployeeDepartment is the only relation where department name is recorded insert anomalies adding a new department is complicated unless there is also an employee for that department deletion anomalies if we delete all employees for some department, what should happen to the department information? modification anomalies if we change the name of a department, then we must change it in all tuples referring to that department
Database: Review Sept. 2004Yangjun Chen Functional dependencies: Suppose we have a relation R comprising attributes X,Y, … We say a functional dependency exists between the attributes X and Y, if, whenever a tuple exists with the value x for X, it will always have the same value y for Y. XY XY LHSRHS
Database: Review Sept. 2004Yangjun Chen student_nostudent_namecourse_nogender Student Given a specific student number, there is only one value for student name and only one value for gender found with it. Student_noStudent_name gender
Database: Review Sept. 2004Yangjun Chen Inference Rules for Function Dependencies From a set of FDs, we can derive some other FDs Example: F = {ssn {Ename Bdate, Address, dnumber}, dnumber {dname, dmgrssn}} ssn {dname, dmgrssn}, ssn dnumber, dnumber dname. inference F + (closure of F): The set of all FDs that can be deduced from F (with F together) is called the closure of F.
Database: Review Sept. 2004Yangjun Chen Inference Rules for Function Dependencies Inference rules: - IR1 (reflexive rule): If X Y, then X Y. (X X.) - IR2 (augmentation rule): {X Y} |= ZX Y. - IR3 (transitive rule): {X Y, Y Z} |= X . - IR4 (decomposition, or projective, rule): {X Y} |= X Y, X Z. - IR5 (union, or additive, rule): {X Y, Y Z} |= X Y. - IR6 (pseudotransitive rule): {X Y, WY Z} |= WX .
Database: Review Sept. 2004Yangjun Chen Equivalence of Sets of FDs E and F are equivalent if E + = F +. Minimal sets of FDs every dependency has a single attribute on the RHS the attributes on the LHS of a dependency are minimal we cannot remove any dependency from F and still have a set of dependencies that is equivalent to F. ssnpnumberhoursenameplocation {ssn, pnumber} hours, ssn ename, pnumber plocation.
Database: Review Sept. 2004Yangjun Chen Normal Forms A series of normal forms are known that have, successively, better update characteristics. We’ll consider 1NF, 2NF, 3NF, and BCNF. A technique used to improve a relation is decomposition, where one relation is replaced by two or more relations. When we do so, we want to eliminate update anomalies without losing any information.
Database: Review Sept. 2004Yangjun Chen NF - First Normal Form The domain of an attribute must only contain atomic values. This disallows repeating values, sets of values, relations within relations, nested relations, … In the example database we have a department located in possibly several locations: department 5 is located in Bellaire, Sugarland, and Houston. If we had the relation then it would not be 1NF because there are multiple values to be kept in dlocations. Department dnumberdnamedmgrssndlocations 5Research Bellaire, Sugarland, Houston
Database: Review Sept. 2004Yangjun Chen NF - First Normal Form If we have a non-1NF relation we can decompose it, or modify it appropriately, to generate 1NF relations. There are 3 options: option 1: split off the problem attribute into a new relation (create a DepartmentLocation relation). dnumberdnamedmgrssndlocation Department dnumber DepartmentLocation 5Research Bellaire5 5Sugarland 5Houston Generally considered the best solution
Database: Review Sept. 2004Yangjun Chen NF - Second Normal Form full functional dependency X Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more. ssnpnumberhoursenameplocation EmployeeProject {ssn, pnumber} hours is a full dependency (neither ssn hours, nor pnumber hours).
Database: Review Sept. 2004Yangjun Chen NF - Second Normal Form partial functional dependency X Y is a partial functional dependency if removal of some attribute A from X does not affect the dependency. {ssn, pnumber} ename is a partial dependency because ssn ename holds.) ssnpnumberhoursenameplocation EmployeeProject
Database: Review Sept. 2004Yangjun Chen NF - Second Normal Form A relation schema is in 2NF if (1) it is in 1NF and (2) every non-key attribute must be fully functionally dependent on the primary key. If we had the relation EmployeeProject ssnpnumberhoursenameplocation then this relation would not be 2NF because of two separate violations of the 2NF definition:
Database: Review Sept. 2004Yangjun Chen NF - Second Normal Form We correct this by decomposing the relation into three relations - splitting off the offending attributes - splitting off partial dependencies on the key. ssnpnumberhoursenameplocation EmployeeProject ssnpnumberhours ename plocation ssn pnumber 2NF
Database: Review Sept. 2004Yangjun Chen NF - Third Normal Form Transitive dependency A functional dependency X Y in a relation schema R is a transitive dependency if there is a set of attributes Z that is not a subset of any key of R, and both X Z and Z Y hold. enamessnbdateaddress EmployeeDept dnumberdname ssn dnumber and dnumber dname
Database: Review Sept. 2004Yangjun Chen NF - Third Normal Form A relation schema is in 3NF if (1) it is in 2NF and (2) each non-key attribute must not be fully functionally dependent on another non-key attribute (there must be no transitive dependency of a non-key attribute on the PK) If we had the relation enamessnbdateaddressdnumberdname then this relation would not be 3NF because dname is functionally dependent on dnumber and neither is a key attribute
Database: Review Sept. 2004Yangjun Chen NF - Third Normal Form We correct this by decomposing - splitting off the transitive dependencies enamessnbdateaddress EmployeeDept dnumberdname enamessnbdateaddressdnumber dnamednumber 3NF
Database: Review Sept. 2004Yangjun Chen Boyce Codd Normal Form, BCNF Consider a different definition of 3NF, which is equivalent to the previous one. A relation schema R is in 3NF if, whenever a function dependency X A holds in R, either (a)X is a superkey of R, or (b)A is a prime attribute of R. A superkey of a relation schema R = {A1, A2,..., An} is a set of attributes S R with the propertity that no tuples t1 and t2 in any legal state r of R will have t1[S] = t2[S]. An attribute is called a prime attribute if it is a member of any key.
Database: Review Sept. 2004Yangjun Chen Boyce Codd Normal Form, BCNF If we remove (b) from the previous definition for 3NF, we have the definition for BCNF. A relation schema is in BCNF if every determinant is a superkey key. Stronger than 3NF: - no partial dependencies - no transitive dependencies where a non-key attribute is dependent on another non-key attribute - no non-key attributes appear in the LHS of a functional dependency.
Database: Review Sept. 2004Yangjun Chen Boyce Codd Normal Form, BCNF Consider: student_nocourse_noinstr_no Instructor teaches one course only. Student takes a course and has one instructor. In 3NF! {student_no, course_no} instr_no instr_no course_no
Database: Review Sept. 2004Yangjun Chen Boyce Codd Normal Form, BCNF This decomposition preserves all the information. course_noinstr_no student_noinstr_no S#C#I# Only FD is instr_no course_no but the join preserves {student_no, course_no} instr_no
Database: Review Sept. 2004Yangjun Chen Definition of lossless join property - relation decomposition - lossless join property Testing algorithm - matrix construction - matrix initialization - matrix modification Lossless join
Database: Review Sept. 2004Yangjun Chen Basic definition of Lossless-join A decomposition D = {R 1, R 2,..., R m } of R has the lossless join property with respect to the set of dependencies F on R if, for every relation r of R that satisfies F, the following holds, ( R1 (r),..., Rm (r)) = r, where is the natural join of all the relations in D. The word loss in lossless refers to loss of information, not to loss of tuples.
Database: Review Sept. 2004Yangjun Chen SSNPNUMhoursENAME Emp_PROJ PNAMEPLOCATION F = {SSN ENAME, PNUM {PNAME, PLOCATION}, {SSN, PNUM} hours} SSNENAME R1 PNUMPNAMEPLOCATION R2 SSNPNUMhours R3 Lossless join
Database: Review Sept. 2004Yangjun Chen decomposion-1 A1 SSN A2 ENAME A3 PNUM A4 PNAME A5 PLOCATION A6 hours b11 b21 b31 b12 b22 b32 b13 b23 b33 b14 b24 b34 b15 b25 b35 b16 b26 b36 R1 R2 R3 a1 b21 a1 a2 b22 b32 b13 a3 b14 a4 b34 b15 a5 b35 b16 b26 a6 R1 R2 R3
Database: Review Sept. 2004Yangjun Chen a1 b21 a1 a2 b22 a2 b13 a3 b14 a4 b34 b15 a5 b35 b16 b26 a6 R1 R2 R3 a1 b21 a1 a2 b22 a2 b13 a3 b14 a4 b15 a5 b16 b26 a6 R1 R2 R3 SSN ENAME PNUM {PNAME, PLOCATION} SSNENAME PNUMPNAMEPLOCATION
Database: Review Sept. 2004Yangjun Chen Example: decomposition-2 SSNPNUMhoursENAME Emp_PROJ PNAMEPLOCATION F = {SSN ENAME, PNUM {PNAME, PLOCATION}, {SSN, PNUM} hours} ENAME R1 SSNPNAME PLOCATION R2 PNUMhours Not lossless join PLOCATION
Database: Review Sept. 2004Yangjun Chen decomposition-2 A1 SSN A2 ENAME A3 PNUM A4 PNAME A5 PLOCATION A6 hours b11 b21 b12 b22 b13 b23 b14 b24 b15 b25 b16 b26 R1 R2 b11 a1 a2 b22 b13 a3 b14 a4 a5 b16 a6 R1 R2 The matrix can not be changed! SSN ENAME PNUM {PNAME, PLOCATION} {SSN, PNUM} hours
Database: Review Sept. 2004Yangjun Chen Hierarchical database schema - hierarchical schema - record type, PCR type - virtual PCR: virtual child, virtual parent Database languages - HDDL - HDML Hierarchical databases
Database: Review Sept. 2004Yangjun Chen dependent Dept_locations employee department project ERD for Chapter 6 database example n n n n n n m Works on
Database: Review Sept. 2004Yangjun Chen Virtual Parent-child Relationships -Hierarchical schema using VPCR - for a Company database Department Dname Dnum Project Pname …... Dlocation Location Demployee EPTR Dmanager MPTR Pworker Hours WPTR Employee Ename Minit …... Esupervisee SPTR Dependent DEPnameMinit... DE L P Y M W S T StartDate