Introduction to Database Management

Introduction to Database Management
Chapter 1 Introduction to Database Management Welcome to the second edition Textbook: Database Design, Application Development and Administration Chapter 1 objectives: Describe the characteristics of business databases and the features of database management systems Appreciate the advances in database technology and the contribution of database technology to modern society Understand the impact of database management system architectures on distributed processing and software maintenance Perceive career opportunities related to database application development and database administration

Welcome! Database technology: crucial to the operation and management of modern organizations Major transformation in computing skills Significant time commitment Exciting journey ahead Welcome to Chapter 1 on introduction to database management Database management is crucial to the operation and management of modern organizations: - infrastructure (plumbing) for daily business operations - raw materials for long range decision making Transformation: as significant as learning computer programming and algebra Time: assignments and projects; lots of practical skills; detailed textbook Database field: - Many employment opportunities with good pay; - Challenging work (sometimes too challenging); - Very dynamic field: much new R & D

Book Goals First course in database management Practical textbook
Fundamentals of relational databases Data modeling and normalization Database application development Database administration and database processing environments Detailed material Textbook for first course in database management - Designed for students without an previous course in database management - Beneficial even for those with significant database experience First part of text: Overview of database management and overview of database development Second part: fundamentals of relational databases; Relational data model, SQL, basic query formulation skills; Third part of text: data modeling and conversion; developing a database; skill used by database specialist or functional user developing a database Fourth part: relational database design involving normalization and physical database design Fifth part: application development emphasis; advanced query formulation skills; data requirements for forms/reports; triggers and stored procedures; Sixth part: advanced database development with view integration (linking database design and application development); comprehensive case Seventh part of textbook: background on database administration and specialized processing area (transaction management, data warehouses, distributed processing, object data management) Detailed material: developing skills requires lots of practice Not a theoretical textbook (does not prove theorems nor present any axioms)

Outline Database characteristics DBMS features Architectures
Organizational roles Chapter 1 material: general background for the entire textbook Essential characteristics of databases Features: found in most DBMSs (desktop and enterprise) Architectures: background for application development and distributed processing with databases Organizational roles: how you might be using a DBMS

Initial Vocabulary Data: raw facts about things and events
Information: transformed data that has value for decision making Essential to organize data for retrieval and maintenance Most organizations have a flood of data (too much data is the problem); web proliferation has greatly multiplied the amount of data Conventional facts: names, DOBs, salaries, interest rates, codes (major) Unconventional facts: images, engineering drawings, maps, product videos, fingerprints, time series (useful for forecasting), web page Distinction sometimes made between data and information: raw facts need interpretation, combination, formatting, etc. to be useful for decision making

Database Characteristics
Persistent Inter-related Shared Database is a generic term; collection of data Databases are ubiquitous; many encounters this week Persistent: - Lasts a long time (not transient) - Lasts longer than the execution of a computer program - Program variables are not stored in a database - Relevance of intended usage: only store potentially relevant data Inter-related: - Entity: cluster of data about a topic (customer, student, loan) - Relationship: connection among entities Shared: - Multiple uses: hundreds to thousands of data entry screens and reports - Multiple users: many people simultaneously use a database

University Database To depict these characteristics, let us consider a number of databases. We begin with a simple university database (Figure 1) since you have some familiarity with the workings of a university. A simplified university database contains data about students, faculty, courses, course offerings, and enrollments. The database supports procedures such as registering for classes, assigning faculty to course offerings, recording grades, and scheduling course offerings. Relationships in the university database support answers to questions such as · What offerings are available for a course in a given academic period? · Who is the instructor for an offering of a course? · What students are enrolled in an offering of a course?

Water Utility Database

Database Management System (DBMS)
Collection of components that support data acquisition, dissemination, storage, maintenance, retrieval, and formatting Enterprise DBMSs Desktop DBMSs Embedded DBMSs Major part of information technology infrastructure DBMS (Database Management System): collection of components (mostly software) Enterprise DBMS: supports mission critical information systems; very large dbs, many users, tight performance requirements Desktop DBMS: end user departments and small databases Embedded DBMS: resides in a larger system, either an application or a device such as a Personal Digital Assistant or smart card. Embedded DBMSs provide limited transaction processing features but have low memory, processing, and storage requirements. Features common to most DBMSs: database definition, non procedural access, application development, procedural language interface, transaction processing

Database Definition Define database structure before using a database
Tables and relationships SQL CREATE TABLE statement Graphical tools Fundamental difference to other productivity software: amount of planning before using; defined database before using Table: 2 dimensional arrangement of data; relationship: linking column among tables SQL: industry standard database language

University Database Access relationship window
5 tables (student, enrollment, course, offering, faculty): faculty_1 is not a real table (details later) Relationships: lines connecting tables (faculty to offering); not all tables are directly connected Must define the tables and relationships before entering data and retrieving data

University Database (ERD)
University Database diagram drawn with an external tool (Visio Professional); Learn Entity Relationship Diagrams in second part of course - Entity: similar to a table - Relationship: connection among entities with names and connection symbols Can use third party tools for database definition

Nonprocedural Access Query: request for data to answer a question
Indicate what parts of database to retrieve not the procedural details Improve productivity and improve accessibility SQL SELECT statement and graphical tools Specify what not how Loop buster: no loops; major difference between procedural and nonprocedural language Trip planning analogy: specify features of trip (destination, quality of accommodations, dates, …) but not details (route, hotel research, flight research, …) Productivity improvement: 100 times fewer lines of code

Graphical Tool for Nonprocedural Access
Query Design (Access) - specify tables and columns - Access determines connections among tables

Application Development
Form: formatted document for data entry and display Report: formatted document for display Use nonprocedural access to specify data requirements of forms and reports Nonprocedural access by itself is not useful because of default output appearance Nonprocedural access combined with graphical tools for form and report development is very powerful Non-procedural access makes form and report creation possible without extensive coding. As part of creating a form or report, the user indicates the data requirements using a non-procedural language (SQL) or graphical tool. To complete a form or report definition, the user indicates formatting of data, user interaction, and other details.

Sample Data Entry Form Faculty assignment form
The form can be used to add new course assignments for a professor and to change existing assignments.

Sample Report The report uses indentation to show courses taught by faculty in various departments. The indentation style can be easier to view than the tabular style shown as default output style.

Procedural Language Interface
Combine procedural language with nonprocedural access Why Batch processing Customization and automation Performance improvement Combine external languages (COBOL, Java, C, C++, …) with SQL New DBMS specific languages: PL/SQL (Oracle), Transact-SQL (SQL Server) Batch processing: much business processing is batch (collect loan applications and process together); online processing is becoming more prevalent because of the web; Customization: customize the behavior of a data entry form Automation: rule processing; check qoh when an order is placed Performance: more control with a procedural language

Transaction Processing
Transaction: unit of work that should be reliably processed Control simultaneous users Recover from failures Example transactions: ATM, shopping cart Major difference between enterprise and desktop DBMSs: transaction processing ability; major cost difference

Database Technology Evolution
The first generation supported sequential and random searching, but the user was required to write a computer program to obtain access. The second generation products were the first true DBMSs as they could manage multiple entity types and relationships. However, to obtain access to data, a computer program still had to be written. Second generation systems are referred to as “navigational” because the programmer had to write code to navigate among a network of linked records. Third generation systems are known as relational DBMSs because of the foundation based on mathematical relations and associated operators. Optimization technology was developed so that access using non-procedural languages would be efficient. Fourth generation systems can store and manipulate unconventional data types such as images, videos, maps, sounds, and animations. Because these systems view any kind of data as an object to manage, fourth generation systems are sometimes called “object-oriented” or “object-relational”. In addition to the emphasis on objects, the Internet is pushing DBMSs to develop new forms of distributed processing.

DBMS Marketplace Enterprise DBMS Desktop DBMS
Oracle: dominates in Unix; strong in Windows SQL Server: strong in Windows DB2: strong in mainframe environment Significant open source DBMSs: MySQL, Firebird, PostgreSQL Desktop DBMS Access: dominates FoxPro, Paradox, Approach, FileMaker Pro According to the International Data Corporation (IDC), sales (license and maintenance) of enterprise database software reached $13.6 billion in 2003, a 7.6 % increase since Enterprise DBMSs use mainframe servers running IBM’s MVS operating system and mid-range servers running Unix (Linux, Solaris, AIX, and other variations) and Microsoft Windows Server operating systems. Sales of enterprise database software have followed economic conditions with large increases during the Internet boom years followed by slow growth during the dot-com and telecom slowdowns. For future sales, IDC projects sales of enterprise DBMSs to reach $20 billion by 2008. According to IDC, three products dominate the market for enterprise database software as shown in Table 1-3. The IDC rankings include both license and maintenance revenues. When considering only license costs, the Gartner Group ranks IBM with the largest market share at 35.7%, followed by Oracle at 33.4%, and Microsoft at 17.7%. The overall market is very competitive with the major companies and smaller companies introducing many new features with each release. Open source DBMS products have begun to challenge the commercial DBMS products at the low end of the enterprise DBMS market. Although source code for open source DBMS products is available without charge, most organizations purchase support contracts so the open source products are not free. Still, many organizations have reported cost savings using open source DBMS products, mostly for non-mission-critical systems. MySQL, first introduced in 1995, is the leader in the open source DBMS market. PostgreSQL and open source Ingres are mature open source DBMS products. Firebird is new open source product that is gaining usage.

Data Independence Software maintenance is a large part (50%) of information system budgets Reduce impact of changes by separating database description from applications Change database definition with minimal effect on applications that use the database Data Independence: a database should have an identity separate from the applications (computer programs, forms, and reports) that use it. The separate identity allows the database definition to be changed without affecting related applications. The close association between a database and related programs led to problems in software maintenance. Software maintenance encompassing requirement changes, corrections, and enhancements can consume a large fraction of computer budgets. In early DBMSs, most changes to the database definition caused changes to computer programs. In many cases, changes to computer programs involved detailed inspection of the code, a labor-intensive process. This code inspection work is similar to year 2000 compliance where date formats must be changed to four digits. Performance tuning of a database was difficult because sometimes hundreds of computer programs had to be recompiled for every change. Because database definition changes are common, a large fraction of software maintenance resources were devoted to database changes. Some studies have estimated the percentage as high as 50% of software maintenance resources.

Three Schema Architecture
- Schema: database description - Reference architecture for compartmentalizing database descriptions Schema levels: - Conceptual level: base tables - External level: views - Internal level: implementation details for base tables (indexes, disk extents, clustering) - Chapter 10 for physical database design Mappings: - Performed by the DBMS: relieve user of much work - External to Conceptual: submit query using a view; DBMS translates to base tables - Conceptual to Internal: SELECT statement implemented with loops, join order, index usage, … Reduce impact of changes: - Use views rather than base tables in applications - DBMS translates queries on a view to query on lower level schema

Differences among Levels
External FacultyAssignmentFormView: data required for the form in Slide 16 (Figure 1.9) FacultyWorkLoadReportView: data required for the report in Slide 17 (Figure 1.10) Conceptual: tables in Slide 11 Internal Files needed to store the tables Extra files to improve performance To make the three schema levels clearer, Table 4 shows differences among database definition at the three schema levels using examples from the features described in Section 1.2. Even in a simplified university database, the differences among the schema levels is clear. With a more complex database, the differences would be even more pronounced with many more views, a much larger conceptual schema, and a more complex internal schema. The schema mappings describe how a schema at a higher level is derived from a schema at a lower level. For example, the external views in Table 3 are derived from the tables in the conceptual schema. The mapping provides the knowledge to convert a request using an external view (for example, HighGPAView) into a request using the tables in the conceptual schema. The mapping between conceptual and internal levels shows how entities are stored in files.

Client-Server Architecture
Client-Server Architecture: an arrangement of components (clients and servers) and data among computers connected by a network. The client-server architecture supports efficient processing of messages (requests for service) between clients and servers. To improve performance and availability of data, the client-server architecture supports many ways to distribute software and data in a computer network. The simplest scheme is just to place both software and data on the same computer (Figure 13(a)). To take advantage of a network, both software and data can be distributed. In Figure 13(b), the server software and database are located on a remote computer. In Figure 13(c), the server software and database are located on multiple remote computers.

Organizational Roles Because databases are pervasive, there are a variety of ways in which you may interact with databases. The classification in Figure 14 distinguishes between functional users who interact with databases as part of their work and information systems professionals who participate in designing and implementing databases. Each box in the hierarchy represents a role that you may play. You may simultaneously play more than one role. For example, a functional user in a job such as financial analysis may play all three roles in different databases. In some organizations, the distinction between functional users and information systems professionals is blurred. In these organizations, functional users may participate in designing and using databases. Functional users can play a passive or an active role when interacting with databases. Indirect usage of a database is a passive role. An indirect user is given a report or some data extracted from a database. A parametric user is more active than an indirect user. A parametric user requests existing forms or reports using parameters, input values that change from usage to usage. For example, a parameter may indicate a date range, sales territory, or department name. The power user is the most active. Because decision making needs can be difficult to predict, ad hoc or unplanned usage of a database is important. A power user is skilled enough to build a form or report when needed. Power users should have a good understanding of non-procedural access, a skill described in the first part of this book.

Database Specialists Database administrator (DBA) Data administrator
More technical DBMS specific skills Data administrator Less technical Planning role DBA: - focused on individual databases and DBMSs - Need strong skills in specific DBMSs Data administrator - Planning: databases and technology - Standards setting - Computerized and non-computerized databases Large organizations have separate positions; small organizations combine roles Both positions require more than 1 db course - 2nd course - certification - lots of experience - management experience for data administration

Summary Databases and database technology vital to modern organizations Database technology supports daily operations and decision making Nonprocedural access is a crucial feature Many opportunities to work with databases Working with databases: can be lucrative but very demanding First part of textbook: fundamentals of relational databases Other chapters in Part 1: Chapter 2: overview of the database development process

The Relational Data Model
Chapter 3 The Relational Data Model Welcome to Chapter 3 covering the Relational Data Model Careful study of the relational data model Goal of chapter: Understand existing databases so that you can write queries Recognize relational database terminology Understand the meaning of the integrity rules for relational databases Understand the impact of referenced rows on maintaining relational databases Understand the meaning of each relational algebra operator List tables that must be combined to obtain desired results for simple retrieval requests Relational databases are the dominant commercial standard - Simplicity and familiarity with table manipulation - Strong mathematical framework - Lots of research and development

Outline Relational model basics Integrity rules
Rules about referenced rows Relational Algebra Relational model basics: - Tables - Columns and data types - Matching values - Alternative terminology - SQL CREATE TABLE statement Integrity rules: primary and foreign keys Referenced rows: actions when referenced rows are modified Relational algebra - Cover simple operators - Provide separate slide shows for join, outer join, and division operators - May want to mix relational algebra coverage with SQL

Tables Relational database is a collection of tables
Heading: table name and column names Body: rows, occurrences of data Student Partial Student table: - 5 columns - 3 rows - Real student table: 10 to 50 columns; thousands of rows Convention: - Table names begin with uppercase - Mixed case for column names - First part of column name is an abbreviation for the table name - Upper case for data

CREATE TABLE Statement
CREATE TABLE Student ( StdSSN CHAR(11), StdFirstName VARCHAR(50), StdLastName VARCHAR(50), StdCity VARCHAR(50), StdState CHAR(2), StdZip CHAR(10), StdMajor CHAR(6), StdClass CHAR(6), StdGPA DECIMAL(3,2) ) Define table name, column names, and column data types Other clauses added later in the lecture Data type: - Set of values - Permissible values - Vary by DBMS - CHAR: fixed length character strings - VARCHAR: variable length character strings - DECIMAL: fixed precision numbers - Table 3-2 lists common data types

Common Data Types CHAR(L) VARCHAR(L) INTEGER FLOAT(P)
Date/Time: DATE, TIME, TIMESTAMP DECIMAL(W, R) BOOLEAN CHAR: fixed length character strings VARCHAR: variable length character strings Date/Time: SQL standard provides 3 data types; most DBMSs only support one data type; data type name is not standard across DBMSs

Relationships Shown by matching values
- First Student row ( ) related to 1st and 3rd rows of Enrollment table - First Offering row (1234) related to 1st two rows of Enrollment table Combine tables using matching values Relational databases can have many tables (hundreds) Follow matching values to combine tables: - Combine Student and Enrollment where StdSSN matches - Join operation

Alternative Terminology
Table-oriented Set-oriented Record-oriented Table Relation Record-type, file Row Tuple Record Column Attribute Field Table-oriented: familiar Set-oriented: mathematical Record-oriented: IS staff Terminology is often mixed: table, record, field

Integrity Rules Entity integrity: primary keys
Each table has column(s) with unique values Ensures entities are traceable Referential integrity: foreign keys Values of a column in one table match values in a source table Ensures valid references among tables Informal definitions Examples: - Student rows are uniquely identified by StdSSN - Offering rows are uniquely identified by OfferNo - Enrollment rows are uniquely identified by the combination of StdSSN and OfferNo - Enrollment.StdSSN refers to a valid StdSSN value in the Student table - Enrollment.OfferNo refers to a valid OfferNo in the Offering table

Formal Definitions I Superkey: column(s) with unique values
Candidate key: minimal superkey Null value: special value meaning value unknown or inapplicable Primary key: a designated candidate key; cannot contain null values Foreign key: column(s) whose values must match the values in a candidate key of another table Prerequisite definitions Superkey: concept of uniqueness; all columns are a superkey Candidate key: unique without extra columns Null value: - Just moved: do not know phone number (value is unknown) - Not married: do not have a maiden name (value is inapplicable) Primary key: - should be stable; names even if unique can change - No null values Foreign keys: - linking columns - Usually match to primary keys, not to candidate keys that are not primary keys

Formal Definitions II Entity integrity Referential integrity
No two rows with the same primary key value No null values in any part of a primary key Referential integrity Foreign keys must match candidate key of source table Foreign keys can be null in some cases In SQL, foreign keys associated with primary keys Entity integrity rule: each table must have a primary key Referential integrity: foreign keys are valid references except when null

Course Table Example CREATE TABLE Course ( CourseNo CHAR(6),
CrsDesc VARCHAR(250), CrsUnits SMALLINT, CONSTRAINT PKCourse PRIMARY KEY(CourseNo), CONSTRAINT UniqueCrsDesc UNIQUE (CrsDesc) ) Extended CREATE TABLE statement Primary key: CourseNo Candidate key: CrsDesc (course description) Named constraints: easier to reference; PKCourse, UniqueCrsDesc

Enrollment Table Example
CREATE TABLE Enrollment ( OfferNo INTEGER, StdSSN CHAR(11), EnrGrade DECIMAL(3,2), CONSTRAINT PKEnrollment PRIMARY KEY (OfferNo, StdSSN), CONSTRAINT FKOfferNo FOREIGN KEY (OfferNo) REFERENCES Offering, CONSTRAINT FKStdSSN FOREIGN KEY (StdSSN) REFERENCES Student ) Primary key: - combination of OfferNo and StdSSN - combined PK (or composite PK) Foreign key constraints: - OfferNo references Offering - StdSSN references Student

Offering Table Example
CREATE TABLE Offering ( OfferNo INTEGER, CourseNo CHAR(6) CONSTRAINT OffCourseNoRequired NOT NULL, OffLocation VARCHAR(50), OffDays CHAR(6), OffTerm CHAR(6) CONSTRAINT OffTermRequired NOT NULL, OffYear INTEGER CONSTRAINT OffYearRequired FacSSN CHAR(11), OffTime DATE, CONSTRAINT PKOffering PRIMARY KEY (OfferNo), CONSTRAINT FKCourseNo FOREIGN KEY (CourseNo) REFERENCES Course, CONSTRAINT FKFacSSN FOREIGN KEY (FacSSN) REFERENCES Faculty ) NOT NULL keywords Should use constraint names even for inline constraints Inline constraints associated with a specific column Easy to trace error when a constraint violation occurs Two foreign keys: - CourseNo: nulls not allowed - FacSSN: nulls allowed; prepare catalog before instructors are assigned; permits flexibility

Self-Referencing Relationships
Foreign key that references the same table Represents relationships among members of the same set Not common but important in specialized situations Common self-referencing relationships: - Organization chart: manages relationship - Geneology chart: ancestor-descendant - Courses: prerequisities Specialized relationship: - Not common - Important when occurring

Faculty Data Omitted a few columns for brevity FacSupervisor:
- Represents the SSN of the supervising faculty - Null allowed because the top boss does not have a supervisor - Two top bosses (two professors)

Hierarchical Data Display
Partial hierarchical arrangement of faculty data Victoria Emmanual has no boss (null value for FacSupervisor column)

Faculty Table Definition
CREATE TABLE Faculty ( FacSSN CHAR(11), FacFirstName VARCHAR(50) NOT NULL, FacLastName VARCHAR(50) NOT NULL, FacCity VARCHAR(50) NOT NULL, FacState CHAR(2) NOT NULL, FacZipCode CHAR(10)NOT NULL, FacHireDate DATE, FacDept CHAR(6), FacSupervisor CHAR(11), CONSTRAINT PKFaculty PRIMARY KEY (FacSSN), CONSTRAINT FKFacSupervisor FOREIGN KEY (FacSupervisor) REFERENCES Faculty ) Omitted a few columns for brevity Omitted named inline constraints for brevity FacSupervisor: - Represents the SSN of the supervising faculty - Null allowed because the top boss does not have a supervisor

Relationship Window with 1-M Relationships
Visual representation is easier to comprehend than CREATE TABLE statements 1 and  symbols: - 1-M relationships - Student can have many enrollments - Student is the parent (1) table - Enrollment is the child (M) table - Foreign key is shown near the  symbol Meaning of the Faculty_1 table - Access representation for a self referencing relationship - Faculty_1 is not a real table (placeholder for self referencing relationship)

M-N Relationships Rows of each table are related to multiple rows of the other table Not directly represented in the relational model Use two 1-M relationships and an associative table Example: - Student and Offering tables - Student can take many offerings - Offering can have many enrolled students - Enrollment table and 1-M relationships represent this M-N relationship

Referenced Rows Referenced row Actions on referenced rows
Foreign keys reference rows in the associated primary key table Enrollment rows refer to Student and Offering Actions on referenced rows Delete a referenced row Change the primary key of a referenced row Referential integrity should not be violated Referenced row: has rows in associated foreign key tables that reference it Actions: - Delete a referenced row - Change the PK of a referenced row - Must maintain referential integrity; both events could invalidate referential integrity

Possible Actions Restrict: do not permit action on the referenced row
Cascade: perform action on related rows Nullify: only valid if foreign keys accept null values Default: set foreign keys to a default value Restrict: do not allow action on the referenced row - Most conservative (and common) approach - Foreign key rows must be deleted (PK updates) before primary key (referenced rows) - Update: awkward; insert a new PK row, update the foreign key row, delete the old PK row Cascade: - Use carefully: can cause changes to many rows - Automation: only specify action on the referenced row - Use for closely related tables (deleting a PK row always results in deletion of related row); Order – OrderLine tables Nullify: - a reasonable option for relationships that accept null values - do not forget to update the null value Default: - an alternative to nullify; use TBA as the default instructor - do not delete the default row

SQL Syntax for Actions CREATE TABLE Enrollment
( OfferNo INTEGER NOT NULL, StdSSN CHAR(11) NOT NULL, EnrGrade DECIMAL(3,2), CONSTRAINT PKEnrollment PRIMARY KEY(OfferNo, StdSSN), CONSTRAINT FKOfferNo FOREIGN KEY (OfferNo) REFERENCES Offering ON DELETE RESTRICT ON UPDATE CASCADE, CONSTRAINT FKStdSSN FOREIGN KEY (StdSSN) REFERENCES Student ON UPDATE CASCADE ) NO ACTION means restrict; Most DBMSs do not allow all options: - Access permits restrict (default) and cascade - Oracle does not have the ON UPDATE clause - Oracle only permits CASCADE for the ON DELETE clause; default is restrict

Relational Algebra Overview
Collection of table operators Transform one or two tables into a new table Understand operators in isolation Classification Table specific operators Traditional set operators Advanced operators You can think of relational algebra similarly to the algebra of numbers except that the objects are different: algebra applies to numbers and relational algebra applies to tables. In algebra, each operator transforms one or more numbers into another number. Similarly, each operator of relational algebra transforms a table (or two tables) into a new table. This section emphasizes the study of each relational algebra operator in isolation. For each operator, you should understand its purpose and inputs. While it is possible to combine operators to make complicated formulas, this level of understanding is not important for developing query formulation skills. Using relational algebra by itself to write queries can be awkward because of details such as ordering of operations and parentheses. Therefore, you should seek only to understand the meaning of each operator, not how to combine operators to write expressions. Table specific: restrict, project, join, outer join, cross product Traditional set: union, intersection, difference Advanced (specialized): summarize, division See Chapter03FiguresTables for extended operator examples

Subset Operators Simple and widely used operators
Restrict: an operator that retrieves a subset of the rows of the input table that satisfy a given condition; also known as select Project: an operator that retrieves a specified subset of the columns of the input table.

Subset Operator Notes Restrict Project Often used together
Logical expression as input Example: OffDays = 'MW' AND OffTerm = 'SPRING' AND OffYear = 2006 Project List of columns is input Duplicate rows eliminated if present Often used together The logical expression used in the restrict operator can include comparisons involving columns and constants. Complex logical expressions can be formed using the logical operators AND, OR, and NOT. A project operation can have a side effect. Sometimes after a subset of columns is retrieved, there are duplicate rows. When this occurs, the project operator removes the duplicate rows. For example, if Offering.CourseNo is the only column used in a project operation, only three rows are in the result (Table 3-9) even though the Offering table (Table 3-4) has nine rows. The column Offering.CourseNo contains only three unique values in Table Note that if the primary key or a candidate key is included in the list of columns, the resulting table has no duplicates. For example, if OfferNo was included in the list of columns, the result table would have nine rows with no duplicate removal necessary.

Extended Cross Product
Building block for join operator Builds a table consisting of all combinations of rows from each of the two input tables Produces excessive data Subset of cross product is useful (join) Extended Cross Product: an operator that builds a table consisting of all combinations of rows from each of the two input tables. The extended cross product operator can combine any two tables. Other table combining operators have conditions about the tables to combine. Because of its unrestricted nature, the extended cross product operator can produce tables with excessive data. The extended cross product operator is important because it is a building block for the join operator. When you initially learn the join operator, knowledge of the extended cross product operator can be useful. After you gain experience with the join operator, you will not need to rely on the extended cross product operator.

Extended Cross Product Example
The extended cross product (product for short) operator shows everything possible from two tables. The product of two tables is a new table consisting of all possible combinations of rows from the two input tables. Figure 4 depicts a product of two single column tables. Each result row consists of the columns of the Faculty table (only FacSSN) and the columns of the Student table (only StdSSN). The name of the operator (product) derives from the number of rows in the result. The number of rows in the resulting table is the product of the number of rows of the two input tables. In contrast, the number of result columns is the sum of the columns of the two input tables. In Figure 4, the result table has nine rows and two columns. [1] The extended cross product operator is also known as the “Cartesian” product after French mathematician Rene Descartes.

Join Operator Most databases have many tables
Combine tables using the join operator Specify matching condition Can be any comparison but usually = PK = FK most common join condition Relationship diagram useful when combining tables Most joins follow relationship diagram - PK-FK comparisons - What tables can be combined directly versus indirectly

Natural Join Operator Most common join operator Requirements
Equality matching condition Matching columns with the same unqualified names Remove one join column in the result Usually performed on PK-FK join columns

Natural Join Example Work with small tables:
- Useful for understanding the join operation - Useful for difficult problems Join condition: Faculty.FacSSN = Offering.FacSSN Matching rows: - First Faculty row with row 1 and row 3 of Offering - Second Faculty row with row 2 of Offering Join can be applied to multiple tables: - Join two tables - Join a third table to the result of the first two tables - Join Faculty to Offering - Join the result to Course Natural join: - Same unqualified column names (names without table names) - Equality - Discard one of the join columns (arbitrary for now which join column is discarded) - Most popular variation of the join

Visual Formulation of Join
Microsoft Access Query Design tool Similar tools in other DBMSs To form this join, you need only to select the tables. Access determines that you should join over the StdSSN column. Access assumes that most joins involve a primary key and foreign key combination. If Access chooses the join condition incorrectly, you can choose other join columns.

Outer Join Overview Join excludes non matching rows
Preserving non matching rows is important in some business situations Outer join variations Full outer join One-sided outer join Importance of preserving non matching rows: - Offerings without assigned faculty - Orders without sales associates Outer join variations: - Full: preserves non matching rows of both tables - One-sided: preserves non matching rows of the designated table - One-sided outer join is more common

Outer Join Operators Full outer join Left Outer Join Right Outer Join
Outer join matching: - join columns, not all columns as in traditional set operators - One-sided outer join: preserving non matching rows of a designated table (left or right) - Full outer join: preserving non matching rows of both tables - See outer join animation for interactive demonstration Unmatched rows of the left table Matched rows using the join condition Unmatched rows of the right table

Full Outer Join Example
Outer join result: - Join part: rows 1 – 3 - Outer join part: non matching rows (rows 4 and 5) - Null values in the non matching rows: columns from the other table One-sided outer join: - Preserve non matching rows of the designated table - Preserve the Faculty table in the result: first four rows - Preserve the Offering table: first three rows and fifth row

Visual Formulation of Outer Join
Microsoft Access Query Design tool Similar tools in other DBMSs The slide depicts a one-sided outer join that preserves the rows of the Offering. The arrow from Offering to Faculty means that the nonmatched rows of Offering are preserved in the result. When combining the Faculty and Offering tables, Microsoft Access provides three choices: (1) show only the matched rows (a join); (2) show matched rows and nonmatched rows of Faculty; and (3) show matched rows and nonmatched rows of Offering. Choice (3) is shown in this slide. Choice (1) would appear similar to slide 31. Choice (2) would have the arrow from Faculty to Offering.

Traditional Set Operators
A UNION B A INTERSECT B Rows of table are the analog of members of a set - Union: rows in either table - Intersection: rows common to both tables - Difference: rows in one table but not in the other table Usage: - More limited compared to join, restrict, project - Combine geographically dispersed tables (student tables from different branch campuses) - Difference operator: complex matching problems such as to find faculty not teaching courses in a given semester; Chapter 9 presentation A MINUS B

Union Compatibility Requirement for the traditional set operators
Strong requirement Same number of columns Each corresponding column is compatible Positional correspondence Apply to similar tables by removing columns first How are rows compared? - Join: compares rows on the join column(s) - Traditional set operators compare on all columns Strong requirement: - Usually on identical tables (geographically dispersed tables) - Compatible columns: data types are comparable (numbers cannot be compared to strings) - Positional: 1st column of table A to 1st column of table B, 2nd column etc Can be applied to similar tables (faculty and student) by removing columns before traditional set operator

Summarize Operator Decision-making operator
Compresses groups of rows into calculated values Simple statistical (aggregate) functions Not part of original relational algebra Summarize: an operator that produces a table with rows that summarize the rows of the input table. Aggregate functions are used to summarize the rows of the input table. Summarize is a powerful operator for decision making. Because tables can contain many rows, it is often useful to see statistics about groups of rows rather than individual rows. The summarize operator allows groups of rows to be compressed or summarized by a calculated value. Almost any kind of statistical function can be used to summarize groups of rows. Because this is not a statistics book, we will use only simple functions such as count, min, max, average, and sum.

Summarize Example The summarize operator compresses a table by replacing groups of rows with individual rows containing calculated values. A statistical or aggregate function is used for the calculated values. The slide depicts a summarize operation for a sample enrollment table. The input table is grouped on the StdSSN column. Each group of rows is replaced by the average of the grade column. Relational algebra syntax is not important: study SQL syntax in Chapter 3

Divide Operator Match on a subset of values Specialized operator
Suppliers who supply all parts Faculty who teach every IS course Specialized operator Typically applied to associative tables representing M-N relationships Subset matching: - Use of every or all connecting different parts of a sentence - Use any or some: join problem - Specialized matching but important when necessary - Conceptually difficult Table structures: - Typically applied to associative tables such as Enrollment, Supp-Part, StdClub - Can also be applied to M tables in a 1-M relationship (Offering table)

Division Example Table structure:
- SuppPart: associative table between Part and Supp tables - List suppliers who supply every part Formulation: - See Division animation for interactive presentation - Sort SuppPart table by SuppNo - Choose Suppliers that are associated with every part - Set of parts for a supplier contains the set of all parts - S3 associated with P1, P2, and P3 - Must look at all rows with S3 to decide whether S3 is in the result

Relational Algebra Summary

Summary Relational model is commercially dominant
Learn primary keys, data types, and foreign keys Visualize relationships Understanding existing databases is crucial to query formulation Commercial dominance: - Simple and familiar - Theoretically sound - Lots of R&D - SQL standard Understand a database is a prerequisite to query formulation - How are rows identified? PKs and CKs - What data can be compared? Data type knowledge - How can tables be combined? Foreign keys and relationship details (1-M, M-N, self-referencing) - Visualization: show the direct and indirect connections among tables

Query Formulation with SQL
Chapter 4 Query Formulation with SQL Welcome to Chapter 4 on query formulation with SQL Query formulation is an important skill in application development Everyone involved in the application dev. must be competent in query formulation Most students will be involved (at least initially) in application development rather than in a role as a database specialist. Database specialists must also understand query formulation and SQL. Objectives: - Query formulation: problem statement into a database representation - SELECT statement: syntax and patterns for subset operations, joins, summarization, traditional set operators, and data manipulation operations - Write English descriptions to document SQL statements - Need lots of practice with query formulation and SQL

Outline Background Getting started Joining tables Summarizing tables
Problem solving guidelines Advanced problems Data manipulation statements Background: - SQL history - SQL usage contexts Getting started: - SQL syntax - Single table problems - Grouping problems Join styles: - Cross product style - Join operator style - Joins and grouping Problem solving guidelines: - Conceptual process - Critical questions Traditional set operators: - Union, intersection, difference - Union compatibility: requirement for using the operators - Data manipulation statements: INSERT, UPDATE, DELETE Include slides on join and traditional set operators: use these if covering relational algebra as part of SQL

What is SQL? Structured Query Language
Language for database definition, manipulation, and control International standard Standalone and embedded usage Intergalactic database speak Pronunciation: sequel due to its original name Comprehensive database language - Database definition: CREATE TABLE - Manipulation: retrieval and modification of rows - Control: integrity and security constraints Standards: - American National Standards Institute (ANSI), International Standards Organization (ISO) - SQL-86: 1986 (1989 revision to SQL-89) - SQL-92: 1992 - SQL:1999 current standard - SQL:2003 Usage contexts: - Write and execute statements using a specialized editor (standalone) - Embed statements into a procedural language (embedded) Michael Stonebraker called SQL “intergalactic database speak”

SQL Statements Statement Chapter CREATE TABLE 3, 18 SELECT 3, 9, 10
INSERT, UPDATE 3, 10 DELETE CREATE VIEW 10 CREATE TRIGGER 11 GRANT, REVOKE 14 COMMIT, ROLLBACK 15 CREATE TYPE 18 Categories: - Definition: CREATE TABLE, ALTER TABLE, CREATE VIEW - Manipulation: SELECT, INSERT, UPDATE, DELETE, COMMIT, ROLLBACK - Control: GRANT, REVOKE, CREATE ASSERTION - Other statements: SET (Chapter 15) - Oracle specific statements: CREATE MATERIALIZED VIEW (16), CREATE DIMENSION (16)

SQL Standardization Relatively simple standard: SQL-86 and revision (SQL-89) Modestly complex standard: SQL-92 Complex standards: SQL:1999 and SQL:2003 The size and scope of the SQL standard has increased significantly since the first standard was adopted. The original standard (SQL-86) contained about 150 pages, while the SQL-92 standard contained more than 600 pages. In contrast, the most recent standards (SQL:1999 and SQL:2003) contained more than 2,000 pages. The early standards (SQL-86 and SQL-89) had two levels (entry and full). SQL-92 added a third level (entry, intermediate, and full). The SQL:1999 and SQL:2003 standards contain a single level called Core SQL along with additional parts and packages for noncore features. SQL:2003 contains three core parts, six optional parts, and seven optional packages.

SQL Conformance No official conformance testing
Vendor claims about conformance Reasonable conformance on Core SQL Large variance on conformance outside of Core SQL Difficult to write portable SQL code outside of Core SQL The weakness of the SQL standards is the lack of conformance testing. Until 1996, the U.S. Department of Commerce’s National Institute of Standards and Technology conducted conformance tests to provide assurance that government software can be ported among conforming DBMSs. Since 1996, however, DBMS vendor claims have substituted for independent conformance testing. Even for Core SQL, the major vendors lack support for some features and provide proprietary support for other features. With the optional parts and [MVM1] packages, conformance has much greater variance. Writing portable SQL code requires careful study for Core SQL but is not possible for advanced parts of SQL.

SELECT Statement Overview
SELECT <list of column expressions> FROM <list of tables and join operations> WHERE <list of logical expressions for rows> GROUP BY <list of grouping columns> HAVING <list of logical expressions for groups> ORDER BY <list of sorting specifications> Expression: combination of columns, constants, operators, and functions Conventions: - Upper case: keywords - Angle brackets: supply data Expression examples: - StdFirstName: student first name - FacSalary * 1.1 : inflate salary by 10% Logical expression: - T/F value - AND, OR, NOT - Logical expressions can be rather complex (nested queries); will not discuss complexities until Unit 4 Rows vs. Groups: distinction will be made clear as lecture proceeds Show examples in Access SQL and Oracle SQL - Some important differences - Most vendors implement a super/subset of SQL92 - Vendors now support some parts of SQL:2003: Chapter 18

University Database Use university database for examples
Use some new examples not found in the textbook You should work order entry examples: 40 to 60 problems to be proficient Both databases are found in the textbook’s web site.

First SELECT Examples Example 1 SELECT * FROM Faculty
Example 2 (Access) SELECT * FROM Faculty WHERE FacSSN = ' ' Example 3 SELECT FacFirstName, FacLastName, FacSalary Example 4 WHERE FacSalary > AND FacRank = 'PROF' Example 1: - Retrieves all rows and columns - * in the SELECT clause evaluates to all columns of the FROM tables Example 2: - Retrieves a single faculty row (subset of rows) - Relational algebra: restrict operation (row subset) - Oracle: use hyphens in FacSSN constant ( ) Example 3: - Retrieves a subset of columns - Relational algebra: subset of columns Example 4: - Retrieves a subset of rows and columns - Sequence of restrict and project operations

Using Expressions Example 5 (Access)
SELECT FacFirstName, FacLastName, FacCity, FacSalary*1.1 AS IncreasedSalary, FacHireDate FROM Faculty WHERE year(FacHireDate) > 1996 Example 5 (Oracle) WHERE to_number(to_char(FacHireDate, 'YYYY')) > 1996 Example 5: - Retrieves faculty hired after 1996 - Inflates salary by 10% Different functions by DBMS: need to carefully study documentation

Inexact Matching Match against a pattern: LIKE operator
Use meta characters to specify patterns Wildcard (* or %) Any single character (? or _) Example 6 (Access) SELECT * FROM Offering WHERE CourseNo LIKE 'IS*' Example 6 (Oracle) WHERE CourseNo LIKE 'IS%' Common patterns: - Strings with specified endings - Strings with specified beginnings - Strings containing a substring Meta characters: - Special meaning when using the LIKE operator - Many others available: study DBMS documentation Example 6: - Retrieves offerings of IS course numbers - Access supports SQL standard meta characters (% and _) in SQL-92 query mode

Using Dates Dates are numbers
Date constants and functions are not standard Example 7 (Access) SELECT FacFirstName, FacLastName, FacHireDate FROM Faculty WHERE FacHireDate BETWEEN #1/1/1999# AND #12/31/2000# Example 7 (Oracle) WHERE FacHireDate BETWEEN '1-Jan-1999' AND '31-Dec-2000' Date manipulation: - Not strings: do not use pattern matching characters even though some DBMSs permit (not portable) - Study documentation carefully for date functions and constant formats BETWEEN-AND operator: - Closed interval (includes end points) - Short cut for >= AND <= - No shortcuts for other intervals

Other Single Table Examples
Example 8: Testing for null values SELECT OfferNo, CourseNo FROM Offering WHERE FacSSN IS NULL AND OffTerm = 'SUMMER' AND OffYear = 2006 Example 9: Mixing AND and OR SELECT OfferNo, CourseNo, FacSSN WHERE (OffTerm = 'FALL' AND OffYear = 2005) OR (OffTerm = 'WINTER' AND OffYear = 2006) Example 8: - Retrieve summer 2003 offerings without an assigned instructor - Use IS NULL to test for null values Example 9: - Retrieve offerings in Fall 2002 or Winter 2003 -Always use parentheses when mixing AND and OR - Reader may not know default evaluation - Easy to make a mistake - May not be portable if parentheses are not used

Join Operator Most databases have many tables
Combine tables using the join operator Specify matching condition Can be any comparison but usually = PK = FK most common join condition Relationship diagram useful when combining tables Chapter 3 material: can present again for review Can instead present material in Chapter 4 and skip when initially covering chapter 3 Most joins follow relationship diagram - PK-FK comparisons - What tables can be combined directly versus indirectly

Join Example Work with small tables: Chapter 3 material; present again for review if necessary - Can instead present material in Chapter 4 and skip when initially covering chapter 3 - Useful for understanding the join operation - Useful for difficult problems Join condition: Faculty.FacSSN = Offering.FacSSN Matching rows: - First Faculty row with row 1 and row 3 of Offering - Second Faculty row with row 2 of Offering Join can be applied to multiple tables: - Join two tables - Join a third table to the result of the first two tables - Join Faculty to Offering - Join the result to Course Natural join: - Same unqualified column names (names without table names) - Equality - Discard one of the join columns (arbitrary for now which join column is discarded) - Most popular variation of the join

Cross Product Style List tables in the FROM clause
List join conditions in the WHERE clause Example 10 (Access) SELECT OfferNo, CourseNo, FacFirstName, FacLastName FROM Offering, Faculty WHERE OffTerm = 'FALL' AND OffYear = 2005 AND FacRank = 'ASST' AND CourseNo LIKE 'IS*' AND Faculty.FacSSN = Offering.FacSSN Meaning: details of offerings and assigned faculty for fall 2005 IS courses taught by assistant professors Cross Product Style: - Name comes from derivation of the join operator - Join is equivalent of a cross product, selection (retain just matching rows) Extension for multiple tables: - Add tables to the FROM clause - Add join conditions to the WHERE clause Order of tables and join conditions is NOT order dependent Oracle version: use % instead of *

Join Operator Style Use INNER JOIN and ON keywords
FROM clause contains join operations Example 11 (Access) SELECT OfferNo, CourseNo, FacFirstName, FacLastName FROM Offering INNER JOIN Faculty ON Faculty.FacSSN = Offering.FacSSN WHERE OffTerm = 'FALL' AND OffYear = 2005 AND FacRank = 'ASST' AND CourseNo LIKE 'IS*' Join Operator Style: - SQL:1999, SQL:2003, Oracle 9i, 10g, and Access - Oracle 8i does not support Extension for multiple tables: - Need to use parentheses in Access - Conceptually no need for parentheses because join is associative (order of operations does not matter) - Nested parentheses are difficult to read - Harder to find the tables in the statement

Name Qualification Ambiguous column reference
More than one table in the query contains a column referenced in the query Ambiguity determined by the query not the database Use column name alone if query is not ambiguous Qualify with table name if query is ambiguous Readability versus writability Ambiguous: - More than one table in the query contains a column referenced in the query - Ambiguity determined by the query not the database Example 9: - Offering and Faculty tables - Reference to FacSSN is ambiguous unless qualified - Reference to CourseNo is not ambiguous: Course table is not in the query Readability: - qualified names are easier to read (no context to imply) Writability: - Unqualified names require fewer keystrokes (less work) Column naming convention: - Table name abbreviation: Std for Student - Column names easy to associate with tables without qualification

Summarizing Tables Row summaries important for decision-making tasks
Row summary Result contains statistical (aggregate) functions Conditions involve statistical functions SQL keywords Aggregate functions in the output list GROUP BY: summary columns HAVING: summary conditions Row summary: compress multiple rows into a single row Row details are important for operational decision-making (resolving a customer complaint, finding lost shipment, …) Row summaries are important for tactical and strategic decision-making (remove details) Problem involves row summaries: - Result contains aggregate functions: count of students enrolled, average salary, sum of the credit hours - Conditions involve aggregate functions: number of students enrolled less than 10 SQL features for summarizing tables: - Aggregate functions in output list - Standard aggregate functions (COUNT, MIN, MAX, SUM, AVG) - Most DBMSs have many other functions available - GROUP BY columns: indicate columns to summarize on - HAVING (optional): indicate group conditions

GROUP BY Examples Example 12: Grouping on a single column
SELECT FacRank, AVG(FacSalary) AS AvgSalary FROM Faculty GROUP BY FacRank Example 13: Row and group conditions SELECT StdMajor, AVG(StdGPA) AS AvgGpa FROM Student WHERE StdClass IN ('JR', 'SR') GROUP BY StdMajor HAVING AVG(StdGPA) > 3.1 Example 12: - Row summary because output uses AVG function - Rename output column when using aggregate expressions - Retrieves a one row per faculty rank (ASST, ASSC, PROF) Example 13: - Summarize majors of upperclass students by average gpa; only include majors with avggpa > 3.1 - Row condition: cannot use aggregate function - Group condition: uses aggregate functions - Do not use a condition in HAVING unless the condition involves an aggregate function

SQL Summarization Rules
Columns in SELECT and GROUP BY SELECT: non aggregate and aggregate columns GROUP BY: list all non aggregate columns WHERE versus HAVING Row conditions in WHERE Group conditions in HAVING Columns in SELECT and GROUP BY: - Syntactic rule: syntax error if you do not follow - All columns not part of aggregate function must appear in GROUP BY - Applicable to some difficult problems involving joins and grouping WHERE vs. HAVING: - WHERE conditions: cannot have aggregate functions (syntax error) - HAVING: only use conditions that involve an aggregate function - Query executes more slowly if HAVING includes row conditions

Summarization and Joins
Powerful combination List join conditions in the WHERE clause Example 14: List the number of students enrolled in each fall 2003 offering. SELECT Offering.OfferNo, COUNT(*) AS NumStudents FROM Enrollment, Offering WHERE Offering.OfferNo = Enrollment.OfferNo AND OffYear = 2006 GROUP BY Offering.OfferNo Why combine: - Relational databases have many tables - Data for decision making is often spread across many tables

Conceptual Evaluation Process
The conceptual evaluation process is a sequence of operations as indicated in this slide. This process is conceptual rather than actual because most SQL compilers can produce the same output using many shortcuts. Because the shortcuts are system specific, rather mathematical, or performance oriented, we will not review them. The conceptual evaluation process provides a foundation for understanding the meaning of SQL statements that is independent of system and performance issues. Step 1: FROM clause (cross product and join operators) Step 2: WHERE clause (row conditions) Step 3: GROUP BY clause (sort on grouping columns, compute aggregrates) Step 4: HAVING clause (group conditions) Step 5: ORDER BY clause Step 6: eliminate columns not in SELECT (projection operation)

Conceptual Evaluation Lessons
Row operations before group operations FROM and WHERE before GROUP BY and HAVING Check row operations first Grouping occurs only one time Use small sample tables Conceptual evaluation process: - Sequence of steps to evaluate a SELECT statement - Conceptual not actual: DBMSs use many shortcuts Row operations occur first - Errors in formulation usually occur in row operations - Use small tables to understand relationship of row operations (FROM, WHERE) to group operations (GROUP, HAVING) - For large problems, execute row operations separately to ensure that results before grouping are what you expect Grouping only occurs one time: only an issue for advanced problems when summarizing on independent columns

Conceptual Evaluation Problem
Example 15: List the number of offerings taught in 2006 by faculty rank and department. Exclude combinations of faculty rank and department with less than two offerings taught. SELECT FacRank, FacDept, COUNT(*) AS NumOfferings FROM Faculty, Offering WHERE Offering.FacSSN = Faculty.FacSSN AND OffYear = 2006 GROUP BY FacRank, FacDept HAVING COUNT(*) > 1 Use sample university db - Try to derive answer by hand (only derive a subset of the cross product) - Use Access SQL to derive row operations first - Execute entire query to see final result

Query Formulation Process
Problem Statement Database Representation Database Language Statement Problem statement: - Ill formed (without structure) - Most difficult part is to structure the problem statement (convert to db representation) - Convert problem vocabulary into database vocabulary - Problem statement is often ambiguous and incomplete Database representation: answers to query formulation questions - Tables - Columns - Conditions DB Language statement: - SQL or other language - Easy part after some practice

Critical Questions What tables? How to combine the tables?
Columns in output Conditions to test (including join conditions) How to combine the tables? Usually join PK to FK More complex ways to combine Individual rows or groups of rows? Aggregate functions in output Conditions with aggregate functions Answer questions explicitly or implicitly - Initially answer explicitly - As you gain skill, implicitly answer Chapter 9 for more complex ways to combine tables - Outer join - Difference - Division

Efficiency Considerations
Little concern for efficiency Intelligent SQL compilers Correct and non redundant solution No extra tables No unnecessary grouping Use HAVING for group conditions only Chapter 8 provides additional tips for avoiding inefficient SELECT statements Ideally little concern for efficiency SQL compilers: - Consider thousands of alternative plans to evaluate the query - Should not be sensitive to the order of clauses or join style - Complex queries do require efficiency concern: advanced db course topic Eliminate redundancy: - Slows performance - Avoid extra tables: performance most sensitive to the number of joins - Grouping is also expensive: avoid if not necessary - Slow performance if row conditions in HAVING - Improve performance by reducing the size of intermediate tables through WHERE conditions

Advanced Problems Joining multiple tables Self joins
Grouping after joining multiple tables Traditional set operators Let’s apply your query formulation skills and knowledge of the SELECT statement to more difficult problems. All problems in this section involve the parts of SELECT discussed in Sections 3.2 and The problems involve more difficult aspects such as joining more than two tables, grouping after joins of several tables, joining a table to itself, and traditional set operators.

Joining Three Tables Example 16: List Leonard Vince’s teaching schedule in fall For each course, list the offering number, course number, number of units, days, location, and time. SELECT OfferNo, Offering.CourseNo, OffDays, CrsUnits, OffLocation, OffTime FROM Faculty, Course, Offering WHERE Faculty.FacSSN = Offering.FacSSN AND Offering.CourseNo = Course.CourseNo AND OffYear = 2005 AND OffTerm = 'FALL' AND FacFirstName = 'Leonard' AND FacLastName = 'Vince' List Leonard Vince’s teaching schedule in fall For each course, list the offering number, course number, number of units, days, location, and time.

Joining Four Tables Example 17: List Bob Norbert’s course schedule in spring For each course, list the offering number, course number, days, location, time, and faculty name. SELECT Offering.OfferNo, Offering.CourseNo, OffDays, OffLocation, OffTime, FacFirstName, FacLastName FROM Faculty, Offering, Enrollment, Student WHERE Offering.OfferNo = Enrollment.OfferNo AND Student.StdSSN = Enrollment.StdSSN AND Faculty.FacSSN = Offering.FacSSN AND OffYear = 2006 AND OffTerm = 'SPRING' AND StdFirstName = 'BOB' AND StdLastName = 'NORBERT' The Enrollment table is needed even though it does not supply columns in the result or conditions to test. The Enrollment table is needed to connect the Student table with the Offering table.

Self-Join Join a table to itself
Usually involve a self-referencing relationship Useful to find relationships among rows of the same table Find subordinates within a preset number of levels Find subordinates within any number of levels requires embedded SQL Self-Join: a join between a table and itself (two copies of the same table). Self-joins are useful for finding relationships among rows of the same table. Problems involving self-referencing (unary) relationships are part of tree-structured queries. In tree-structured queries, a table can be visualized as a structure such as a tree or hierarchy. For example, the Faculty table has a structure showing an organization hierarchy. At the top, the college dean resides. At the bottom, faculty members without subordinates reside. Similar structures apply to the chart of accounts in accounting systems, part structures in manufacturing systems, and route networks in transportation systems. A more difficult problem than a self-join is to find all subordinates (direct or indirect) in an organization hierarchy. This problem can be solved in SQL if the number of subordinate levels is known. One join for each subordinate level is needed. Without knowing the number of subordinate levels, this problem cannot be done in SQL2 although it can be solved in SQL3 and with proprietary extensions of SQL2. In SQL2, tree-structured queries can be solved by using SQL inside a programming language.

Self-Join Example Example 18: List faculty members who have a higher salary than their supervisor. List the social security number, name, and salary of the faculty and supervisor. SELECT Subr.FacSSN, Subr.FacLastName, Subr.FacSalary, Supr.FacSSN, Supr.FacLastName, Supr.FacSalary FROM Faculty Subr, Faculty Supr WHERE Subr.FacSupervisor = Supr.FacSSN AND Subr.FacSalary > Supr.FacSalary The foreign key, FacSupervisor, shows relationships among Faculty rows. To find the supervisor name of a faculty member, match on the FacSupervisor column with the FacSSN column. The trick is to imagine that you are working with two copies of the Faculty table. One copy plays the role of the subordinate, while the other copy plays the role of the superior. In SQL, a self-join requires alias names (Subr and Supr) in the FROM clause to distinguish between the two roles or copies.

Multiple Joins Between Tables
Example 19: List the names of faculty members and the course number for which the faculty member teaches the same course number as his or her supervisor in 2006. SELECT FacFirstName, FacLastName, O1.CourseNo FROM Faculty, Offering O1, Offering O2 WHERE Faculty.FacSSN = O1.FacSSN AND Faculty.FacSupervisor = O2.FacSSN AND O1.OffYear = 2006 AND O2.OffYear = 2006 AND O1.CourseNo = O2.CourseNo This problem involves two joins between the same two tables (Offering and Faculty). Alias table names (O1 and O2) are needed to distinguish between the two copies of the Offering table used in the statement.

Multiple Column Grouping
Example 20: List the course number, the offering number, and the number of students enrolled. Only include courses offered in spring 2006. SELECT CourseNo, Enrollment.OfferNo, Count(*) AS NumStudents FROM Offering, Enrollment WHERE Offering.OfferNo = Enrollment.OfferNo AND OffYear = 2006 AND OffTerm = 'SPRING' GROUP BY Enrollment.OfferNo, CourseNo After studying Example 20, you might be confused about the necessity to group on both OfferNo and CourseNo. One simple explanation is that any columns appearing in SELECT must be either a grouping column or an aggregrate expression. However, this explanation does not quite tell the entire story. Grouping on OfferNo alone produces the same values for the computed column (NumStudents) because OfferNo is the primary key. Including non-unique columns such as CourseNo adds information to each result row but does not change the aggregate calculations. If you do not understand this point, use sample tables to demonstrate it. When evaluating your sample tables, remember that joins occur before grouping.

Traditional Set Operators
A UNION B A INTERSECT B Rows of table are the analog of members of a set - Chapter 3 material: present again for review if desired; - Can present material in Chapter 4 and skip when initially covering chapter 3 - Union: rows in either table - Intersection: rows common to both tables - Difference: rows in one table but not in the other table Usage: - More limited compared to join, restrict, project - Combine geographically dispersed tables (student tables from different branch campuses) - Difference operator: complex matching problems such as to find faculty not teaching courses in a given semester; Chapter 9 presentation A MINUS B

Union Compatibility Requirement for the traditional set operators
Strong requirement Same number of columns Each corresponding column is compatible Positional correspondence Apply to similar tables by removing columns first How are rows compared? - Chapter 3 material: present again for review if desired - Can instead present material in Chapter 4 and skip when initially covering chapter 3 - Join: compares rows on the join column(s) - Traditional set operators compare on all columns Strong requirement: - Usually on identical tables (geographically dispersed tables) - Compatible columns: data types are comparable (numbers cannot be compared to strings) - Positional: 1st column of table A to 1st column of table B, 2nd column etc Can be applied to similar tables (faculty and student) by removing columns before traditional set operator

SQL UNION Example Example 21: Retrieve basic data about all university people SELECT FacSSN AS SSN, FacFirstName AS FirstName, FacLastName AS LastName, FacCity AS City, FacState AS State FROM Faculty UNION SELECT StdSSN AS SSN, StdFirstName AS FirstName, StdLastName AS LastName, StdCity AS City, StdState AS State FROM Student Example 21: - UNION keyword can be applied to two SELECT statements (one query) - Access and Oracle support - INTERSECT: Access does not support - MINUS: Access does not support; Other DBMSs use EXCEPT keyword - Rename columns so that output is meaningful

Oracle INTERSECT Example
Example 22: Show teaching assistants, faculty who are students. Only show the common columns in the result. SELECT FacSSN AS SSN, FacFirstName AS FirstName, FacLastName AS LastName, FacCity AS City, FacState AS State FROM Faculty INTERSECT SELECT StdSSN AS SSN, StdFirstName AS FirstName, StdLastName AS LastName, StdCity AS City, StdState AS State FROM Student Does not execute in Access SQL

Oracle MINUS Example Example 23: Show faculty who are not students (pure faculty). Only show the common columns in the result. SELECT FacSSN AS SSN, FacFirstName AS FirstName, FacLastName AS LastName, FacCity AS City, FacState AS State FROM Faculty MINUS SELECT StdSSN AS SSN, StdFirstName AS FirstName, StdLastName AS LastName, StdCity AS City, StdState AS State FROM Student Oracle uses the MINUS keyword instead of the EXCEPT keyword used in SQL:2003. Not supported in Access SQL

Data Manipulation Statements
INSERT: adds one or more rows UPDATE: modifies one or more rows DELETE: removes one or more rows Use SELECT statement to INSERT multiple rows UPDATE and DELETE can use a WHERE clause Not as widely used as SELECT statement The modification statements support entering new rows (INSERT), changing columns in one or more rows (UPDATE), and deleting one or more rows (DELETE). Although well designed and powerful, they are not as widely used as SELECT because data entry forms are easier to use for end users.

INSERT Example Example 24: Insert a row into the Student table supplying values for all columns. INSERT INTO Student (StdSSN, StdFirstName, StdLastName, StdCity, StdState, StdZip, StdClass, StdMajor, StdGPA) VALUES (' ','JOE','STUDENT','SEATAC', 'WA',' ','FR','IS', 0.0) In the first format, one row at a time can be added. You specify values for each column with the VALUES clause. You must format the constant values appropriate for each column. Refer to the documentation of your DBMS for details about specifying constants especially string and date constants. Specifying a null value for a column is also not standard across DBMSs. In some systems, you simply omit the column name and the value. In other systems, you specify a particular symbol for a null value. Of course, you must be careful that the table definition permits null values for the column of interest. Otherwise, the INSERT statement will be rejected.

UPDATE Example Example 25: Change the major and class of Homer Wells.
UPDATE Student SET StdMajor = 'ACCT', StdClass = 'SO' WHERE StdFirstName = 'HOMER' AND StdLastName = 'WELLS' The UPDATE statement allows one or more rows to be changed. Any number of columns can be changed, although typically only one column at a time is changed. When changing the primary key, update rules on referenced rows may not allow the operation.

DELETE Example Example 26: Delete all IS majors who are seniors.
DELETE FROM Student WHERE StdMajor = 'IS' AND StdClass = 'SR' The DELETE statement allows one or more rows to be removed. DELETE is subject to the rules on referenced rows. For example, a Student row cannot be deleted if related Enrollment rows exist and the deletion action is restrict.

Summary SQL is a broad language SELECT statement is complex
Use problem solving guidelines Lots of practice to master query formulation and SQL SQL breadth: database definition, manipulation, and control SELECT statement: - Most complex part of SQL - Covered the basic parts of the SELECT statement - Textbook chapter 9 covers advanced query formulation - Textbook chapter 10 covers query formulation for forms and reports Problem solving guidelines: - Use small tables and conceptual evaluation process - Use query formulation questions before writing SQL (at least initially) - Understanding the database is crucial to query formulation - Any correct, non redundant solution is acceptable - Major error: incorrect solution - Moderate error: correct but redundant solution Lots of practice - Work many problems without seeing the solutions - 50 problems to develop understanding of query formulation and SQL - Do not rely on Query Design or other visual query tools; use SQL directly

Understanding Entity Relationship Diagrams
Chapter 5 Understanding Entity Relationship Diagrams Welcome to Chapter 5 on Understanding Entity Relationship Models: - First data modeling chapter - Focused on the notation (syntax) not on the application of the ERD notation - Chapter 6 emphasizes application of the notation - Important skill for database development: data modeling - Data modeling is challenging - Ambiguity: part science, part art - Opportunity for some creative problem solving Objectives: understand notation - Entity types, relationships, attributes - Cardinalities - Relationship patterns - Generalization hierarchies - Representation of business rules in an ERD - Use notation to precisely - Construct diagrams without notational errors - Detect notational errors - Explore CASE tools: ER Assistant

Outline Notation basics Understanding relationships
Generalization hierarchies Business rule representation Diagram rules Alternative notations Basic notation: - Entities - Relationships - Cardinalities Understanding relationships: - Identification dependency - M-N relationships with attributes - Self-referencing relationships - M-way relationships - Equivalence between M-N relationships and 1-M relationships Generalization hierarchies: - Notation: supertypes and subtypes, inheritance - When to use Representation of business rules Formally in an ERD Informally in associated documentation Diagram rules: - Completeness rules - Consistency rules - Support in the ER Assistant

Basic Symbols Entity type:
- Collection of things of interest: persons, places, things, events - Contains attributes: like columns - Primary key - Entity: instance or member of an entity type Relationship: - Named association among entities: name is significant (gives it more status) - Similar to a foreign key in relational model except for name and cardinalities - Usually between two entity types: can involve multiple entity types and one - Bidirectional: - Can be used to navigate in both directions - Two names - Course to Offering: Has, Provides - Offering to Course: IsProvidedFor - Which name to use: try to use active verb; not always possible Attribute: - Properties of entity types or relationships - Data type to indicate the kind of values and permissible operations on the attribute - Shown inside entity type or next to relationship

Cardinalities Definition:
- a constraint on the number of entities that participate in a relationship - Specify the minimum and maximum cardinalities in both directions Instance diagram: - Shows occurrences of entity types (entities) - Useful to understand some relationships (similar to sample table usage) - Lines show relationships among entities - Course1 related to Offering1, Offering2, and Offering3 - Course2 related to Offering4 - Course3 not related to any offerings - Course related to a minimum of 0 and maximum of many - Offering: each related to exactly one course

Cardinality Notation Symbols: - Oval: means 0
- Perpendicular line: means 1 - Crow's foot: means many (0 or more); unconstrained - Some drawing tools support exact cardinalities (numbers) Placement: - Inside symbol: minimum cardinality - Outside symbol: maximum cardinality - Interpret the far cardinality symbols: near the other entity type - Course is related to a min of 0 and max of many offerings - Offering is related to a min of 1 and max of 1 courses (exactly one)

Classification of Cardinalities
Minimum cardinality based Mandatory: existence dependent Optional Maximum cardinality based Functional 1-M M-N 1-1 Classification by common values for minimum and maximum cardinalities Minimum cardinality based: - Min cardinality of one: mandatory; makes entity types existent dependent - Min cardinality of 0: optional; similar to a FK that allows null values Maximum cardinality based: - Functional: max cardinality of 1; mathematics based - 1-M: max cardinalities are 1 and M - M-N: max cardinalities are many in both directions - 1-1: max cardinality is one in both directions (not common)

Summary of Cardinalities
Existence dependency: an entity that cannot exist unless another related entity exists. A mandatory relationship produces an existence dependency.

More Relationship Examples
TeamTeaches: - M-N - Optional in both directions WorksIn: - 1-1 - Optional: office can be empty - Mandatory: faculty must be assigned to an office

Comparison to Access Notation
Skip slide if class has not yet covered chapter 3 on the relational data model Differences: - Preference rather than one model being more powerful - Named relationships: no name in relational model (FKs instead) - ERD does not need FKs: redundant with relationship - No maximum cardinalities in relational model: M cardinality is implied by lack of FK - Relationships with attributes: not permitted in the relational model

Understanding Relationships
Identification dependency M-N relationships with attributes Self identifying relationships M-way relationships Equivalence between M-N and 1-M relationships Understand relationships more deeply: - 1-M relationships are most common - To extend data modeling skills, learn more about relationships

Identification Dependency
Concept: - Some entity types borrow part or entire PK - Specialized concept: important when it occurs but not too common - Similar to FK part of PK in relational model - Closely related entities: physical containment - Room is physically contained in a building - Identification of room includes building - Others: country-state, order-orderline Symbols: - Weak entity: - Borrows part or all of PK - Diagonal lines in the corners - Identifying relationship: - Solid line - Indicates the source of PK - Ambiguity if entity type participates in more than one relationship Example: - PK of Room is a combination of RoomNo (local key) and BldgID (borrowed attribute) - Cardinality of weak entity in the identifying relationship must be 1-1 - Room cannot exist unless associated building exists - Identification dependency involves existence dependency: - Weak entity is existent dependent on other entity - Also borrows part of all of PK

M-N Relationships with Attributes
Relationships are first class citizens - Can have attributes just like entity types - Most typical for M-N relationships - Attribute depends on both entity types, not just one entity type - 1-M relationships with attributes is controversial Example: - EnrGrade: grade recorded for a student in a particular course - Depends on the combination of Student and Offering -EnrGrade is not part of the Student or Offering entity types

M-N Relationships with Attributes (II)
AuthOrder: the order in which the author’s name appears in the title of a book - Record order of authors: important in publishing disciplines - AuthOrder is part of the Writes relationship (combination of Author and Book) - AuthOrder is not part of the Author or Book entity types Qty: quantity of part supplied by a supplier; quantity varies by part and supplier

Instance Diagrams for Self-Referencing Relationships
Basic idea: - Associations among members of the same set - Employees: supervisory relationships - Courses: prerequisite structures - Specialized concept: important when occurs but not too common Instance diagram: - Useful to depict self-referencing relationships - 1-M self-referencing: at most one upward connection (traditional org chart) - M-N self-referencing: more than one upward connection - IS461 has multiple prerequisites (IS480 and IS460) - IS320 is the prerequisite for multiple courses

ERD Notation for Self-Referencing Relationships
Relationship connects entity type to itself Position of cardinalities is not important: relationship involves the same entity type Otherwise, nothing is different about self-referencing relationships Key point: - 1-M vs. M-N - Use instance diagrams to help you reason

Associative Entity Types for M-way Relationships
Basic concept: - Relationships can involve more than 2 entity types - Specialized concept: important when occurs but not common - Difficult concept: easy to use inappropriately - Track interaction of 3 entity types: 3 way relationship - 3 way relationship tracks who supplies a part on a specified project - Do not use when only who supplies a part and what parts are used on which projects - Local purchasing: suppliers chosen for each project rather than centrally - Chapters 6, 7, and 12 provide guidelines to reason about the need for M-way relationships ERD notation: - Crow's Foot does not support M-way relationships - Some ERD notations do support - Use an associative entity type and identifying relationships - Associative entity type: - Weak entity that depends on two or more entity types - Replacement for M-N or M-way relationship - Usually borrows entire PK

Relationship Equivalence
Replace M-N relationship Associative entity type Two identifying 1-M relationships M-N relationship versus associative entity type Largely preference Associative entity type is more flexible in some situations Similar to Relational Model representation: - associative or linking table - 2 FKs Choice: - Some students find associative entity type easier to understand than M-N relationship - Neither representation is inherently better - Associative entity type is more flexible in some situations (next slide) Applies to M-way relationships - Some ERD notations do not directly support M-way relationships such as the Crow’s Foot notation - Replace with associative entity and M identifying 1-M relationships

Associative Entity Type Example
Enrollment: - Associative entity type - Represents M-N relationship between Student and Offering Attendance: - Weak entity - PK: Combination of AttDate and PK of Enrollment Must use associative entity type for Enrollment rather than M-N relationship: - Relate Enrollment to other entity types (Attendance) - Cannot connect a relationship to other relationships - Use associative entity type when other relationships involved

Generalization Hierarchies
Classification: - Group objects by similarity - Pervasive in life and business: - Understand animal species by similarity of characteristics - Conduct business by classifying entities: investments, loans, customers, … Generalization hierarchy: - Shows similarity of entity types - Differences in attributes: use when entity types have similar but different attributes - Specialized technique: do not overuse; important when occurs but not common Vocabulary: - Supertype: parent entity type - Subtype: child entity type (additional attributes) - Generalization hierarchy: ISA (not an acronym) - SalaryEmp IS AN Employee - HourlyEmp IS AN Employee Set inclusion: - Subtypes are subsets of supertypes - Set of SalaryEmp entities is a subset of Employee entities

Inheritance Subtypes inherit attributes of supertypes (direct and indirect) Allows abbreviation of attribute list Applies to code (methods) as well as attributes (data) Inheritance related to sharing of characteristics (data and code) First developed for object-oriented programming languages: - Reduce the amount of code by inheriting code from similar objects - Later applied to databases - Chapter 18 on object database management Notation: - Do not show inherited attributes - Assume existence: SalaryEmp includes all attributes of Employee - Sometimes exception for PK: for emphasis can show it

Generalization Constraints
Cardinality of generalization hierarchy: - No need to specify - Subtype: min and max cardinalities are both 1 - Supertype: min cardinality is zero or one; max cardinality is one Disjointness: - Subtypes have common entities: - D means intersection is empty - No symbol means intersection is not empty: Faculty and Student Completeness: - Supertype have free standing entities (not in any subtype) - C: Complete; union of subtype entities is the set of supertype entities - Nothing: supertype can have free standing entities; Union of subtypes has few entities than supertype

Multiple Levels of Generalization
Each subtype inherits from direct and indirect supertypes - Common inherits from Stock and Security Provide constraints for each level of generalization hierarchy Specialized technique: - Need for multiple levels is less common than a single level - Do not overuse generalization hierarchies - Use only to show similarity among attributes

Comprehensive Example
Demonstrates most of the notation previously shown - 1-M relationships: Has, Teaches, Supervises, Registers, Grants - Optional relationships: Teaches, Supervises - Mandatory relationship: Has, Registers, Grants - Self referencing relationship: Supervises - Weak entity (also associative entity type): Enrollment - Identifying relationships: Registers, Grants - Generalization hierarchy: UnivPerson, Student, Faculty - Could apply relationship equivalence to transform Enrollment, Registers, and Grants into a M-N relationship with an attribute

Business Rules Enforce organizational policies
Promote efficient communication Formal representation in ERD Informal representation in documentation associated with an ERD Use rules language to formally represent in relational database after conversion Rules language: Possible to have a rule language with the ERD Since relational model has a rules language, no need to use with ERD Rule representation in relational model: PK, FK, Unique constraint, CHECK constraints (Chapter 14), and triggers (Chapter 11)

Formal Representation
Primary key constraints: entity identification Named relationships: direct connections among business entities Identification dependency: knowledge of other entities for identification Cardinalities: restrict number of related entities in a business situation Generalization hierarchies: classification of business entities and organizational policies Primary keys support entity identification, an important requirement in business communication. Identification dependency involves an entity that depends on other entities for identification, a requirement in some business communication. Relationships indicate direct connections among units of business communication. Cardinalities restrict the number of related entities in relationships supporting organizational policies and consistent business communication. Generalization hierarchies with disjointness and completeness constraints support classification of business entities and organizational policies. Thus, the elements of an ERD are crucial for enforcement of organizational policies and efficient business communication.

Informal Representation
Specify as documentation associated elements of an ERD Candidate key constraints: alternate ways to identify business entities Reasonable values: fixed collection of values or consistent with another attribute Null value constraints: data collection completeness Default values: simplify data entry and provide value when unknown In the absence of a formal rules language, business rules can be stored as informal documentation associated with entity types, attributes, and relationships. Typical kinds of business rules to specify as information documentation are candidate key constraints, attribute comparison constraints, null value constraints, and default values. Candidate keys provide alternative ways to identify business entities. Attribute comparison constraints restrict the values of attributes either to a fixed collection of values or to values of other attributes. Null value constraints and default values support policies about completeness of data collection activities.

Diagram Rules Ensure that ERD notation is correctly used
Similar to syntax rules for a computer language Completeness rules: no missing specifications Consistency rules: no conflicts among specifications Supported by the ER Assistant Apply these rules when completing an ERD to ensure that there are no notation errors in your ERD Similar to syntax rules for a computer language: Ensures proper language structure, not correct meaning Diagram rules ensure proper structure among symbols Do not ensure that you have considered multiple alternatives, correctly represented user requirements, and properly documented your design Completeness rules: no missing symbols or specifications Consistency rules: no conflicts among symbols or specifications Second version of ER Assistant supports the diagram rules

Completeness Rules Primary Key Rule: all entity types have a PK (direct, indirect, or inherited) Naming Rule: all entity types, relationships, and attributes have a name Cardinality Rule: cardinality is specified in both directions for each relationship Entity Participation Rule: all entity types participate in an at least one relationship except for entity types in a generalization hierarchy Generalization Hierarchy Participation Rule: at least one entity type in a generalization hierarchy participates in a relationship Completeness rules: no missing specifications The first three rules are mandatory. A finished ERD should not violate the PK, Naming, and Cardinality rules. PK rule: Direct: Table contains the primary key attribute(s) Indirect: Table borrows (id dependent) for part or all of PK Inherited: Table inherits PK from a supertype The next two rules are optional. Most ERDs will not violate the Entity Participation and Generalization Hierarchy Participation rules. Rule 5 applies to an entire generalization hierarchy, not to each entity type in a generalization hierarchy. In other words, at least one entity type in a generalization hierarchy should be connected to at least one entity type not in the generalization hierarchy. In many generalization hierarchies, multiple entity types participate in relationships. Generalization hierarchies permit subtypes to participate in relationships thus constraining relationship participation.

Primary Key Rule Issue Primary key rule is simple in most cases
For some weak entities, the PK rule is subtle Weak entity with only one 1-M identifying relationship Weak entity must have a local key to augment the borrowed PK from the parent entity type Violation of PK rule if local key is missing In some cases, weak entities must contribute partially to PK. Weak entities with a single 1-M identifying relationship Must provide a local key to augment borrowed PK from the parent entity type Borrowed PK alone cannot identify weak entity instances because there can be many weak entity instances related to the same parent entity Violation of the PK rule if a local key is not provided Associative entity types do not need to provide a local key although they can if needed

PK Rule Violation Example
Room violates the PK rule A single 1-M identifying relationship Does not provide a local key to augment the borrowed PK (BldgId)

Naming Consistency Rules
Entity Name Rule: entity type names must be unique Attribute Name Rule: attribute names must be unique within each entity type and relationship Inherited Attribute Rule: attribute names in a subtype do not match inherited (direct or indirect) attribute names. Consistency rules: no conflicting specifications Naming rules: no conflict among names Entity names must be unique Attribute names must be unique within each entity type and relationship Inherited attribute names (direct or indirect) should not conflict with local attribute names; indirect inheritance is from an ancestor that is not a direct parent

Relationship Names No uniqueness requirement
Participating entities provide a context for relationship names Use unique names as much as possible to distinguish relationships Must provide unique names for multiple relationships between the same entity types The consistency rules do not require unique relationship names because participating entity types provide a context for relationship names. However, it is good practice to use unique relationship names as much as possible to make relationships easy to distinguish. In addition, two or more relationships involving the same entity types should be unique because the entity types no longer provide a context to distinguish the relationships. Since it is uncommon to have more than one relationship between the same entity types, the consistency rules do not include this provision.

Connection Consistency Rules
Relationship/Entity Connection Rule: relationships connect two entity types (not necessarily distinct) Relationship/Relationship Connection Rule: relationships are not connected to other relationships Redundant Foreign Key Rule: foreign keys are not used. Connection Consistency rules: no conflicts or redundancies among relationships Relationship formation rules: Connect two entity types (not necessarily distinct) Do not connect relationships directly (connect through entity types) Self referencing relationships: connect the same entity type two times Redundant foreign key rule: Foreign keys are redundant with 1-M relationships Use FKs in the relational model, not in ERDs Violation of this rule is common: confusion between the ERDs and relational table design Conversion replaces relationships with foreign keys How to detect redundant FKs: - Column in a child entity type (entity type on the M side of the relationship) for a column name that matches the PK of the parent entity type (entity type on the 1 side of the relationship)

Identification Dependency Rules
Weak entity rule: weak entities have at least one identifying relationship Identifying relationship rule: at least one participating entity type must be weak for each identifying relationship Identification dependency cardinality rule: the minimum and maximum cardinality must equal 1 for a weak entity in all identifying relationships Identification Dependency Rules: no conflicts among components of identification dependency (weak entity, identifying relationships, cardinality specification) Common source of diagram errors Weak entity rule At least one identifying relationship Cannot be a weak entity without at least one identifying relationship Identifying relationship rule: At least one participating relationship must be weak Not an identifying relationship if there is not a weak entity Identification dependency cardinality rule: Minimum and maximum cardinality must (1,1) for a weak entity in all identifying relationships (1,1) cardinality should appear near the parent entity type

Example of Diagram Errors
Weak entity rule violation: Faculty is a weak entity but it is not involved in any identifying relationships Resolution: remove weak entity symbols In some cases, resolution involves changing a relationship to identifying Identifying relationship rule violation: Has is an identifying relationship but neither Offering nor Course is a weak entity Resolution: make Has a regular (non-identifying) relationship Sometimes resolution involves making an entity type weak Identification Dependency Cardinality rule: The min/max cardinality of the Registers relationship should be (1,1) near Student Resolution: reverse the cardinalities on the Registers relationship Sometimes the resolution does not involve reversing the cardinality but just changing one cardinality specification Redundant foreign key rule: CourseNo in Offering is redundant with the Has relationship Resolution: remove the CourseNo attribute in Offering This rule can be violated even if the FK attribute has a different name than associated PK. It is more difficult to detect a redundancy if the FK has a different name.

Corrected ERD Weak entity rule violation:
Faculty is a weak entity but it is not involved in any identifying relationships Resolution: remove weak entity symbols In some cases, resolution involves changing a relationship to identifying Identifying relationship rule violation: Has is an identifying relationship but neither Offering nor Course is a weak entity Resolution: make Has a regular (non-identifying) relationship Sometimes resolution involves making an entity type weak Identification Dependency Cardinality rule: The min/max cardinality of the Registers relationship should be (1,1) near Student Resolution: reverse the cardinalities on the Registers relationship Sometimes the resolution does not involve reversing the cardinality but just changing one cardinality specification Redundant foreign key rule: CourseNo in Offering is redundant with the Has relationship Resolution: remove the CourseNo attribute in Offering This rule can be violated even if the FK attribute has a different name than associated PK. It is more difficult to detect a redundancy if the FK has a different name.

Support in the ER Assistant
Relationship formation rules are supported by diagram construction Other rules are supported by the Check Diagram feature For the Redundant Foreign Key rule, the ER Assistant detects FKs that have the same name as the associated PKs ER Assistant support: Construction of diagrams ensures that relationships connect two entity types (not necessarily distinct Check diagram button generates a report for violations of the other rules ER Assistant does not force corrections

ERD Variations No standard ERD notation Symbol variations
Placement of cardinality symbols Rule variations Be prepared to adjust to the ERD notation in use by each employer No ERD standard: Many notations: too many; source of confusion Many variations of a given notation: many variations of the Crow’s Foot notation Crow’s foot notation is widely used but it has many variations Symbol variations: Different symbols for entity types, relationships, and attributes Different symbols for identification dependency and generalization hierarchies Different definitions for commonly used terms: existent dependent is defined incorrectly by some authors to mean identification dependency No standard placement for cardinality symbols Be flexible with ERD notation: you need to adjust to the notation at use by each employer

ERD Rule Variations Lack of ERD standards M-way relationships
M-N relationships Relationships with attributes Self-referencing relationships Relationships connected to other relationships Adapt to notations in work environments Use notation precisely Notations controlled by CASE tools Expect to see variations because of lack of ERD standards: be adaptable Some notations do not support M-way relationships. Some notations do not support M-N relationships. Some notations do not support relationships with attributes. Some notations do not support self-referencing (unary) relationships. Some notations permit relationships to be connected to other relationships. Some notations show foreign keys as attributes. Some notations allow attributes to have more than one value (multi-valued attributes).

Chen ERD Notation Original ERD notation Widely known and used
Variations: Different relationship symbols Reversed cardinality positions Attributes are sometimes shown in ovals attached to entity types and relationships M-way relationships supported Different symbol for associative entity types and weak entities

Unified Modeling Language
Standard notation for object-oriented modeling Objects Object features Interactions among objects UML supports class diagrams, interface diagrams, and interaction diagrams More complex than ERD notation Object-oriented modeling is a separate subject Can be studied as part of object-oriented programming and as part of a systems analysis course Class diagrams are an alternative to ERDs Class diagrams should be studied as part of the entire UML, not just as an alternative to data modeling Need an entire course (or more) to study the UML

Simple Class Diagram Operations depicted because the UML involves not just data modeling but also process modeling Association is similar to a relationship Two role names

Association Class Association class: similar to a M-N relationship with attributes

Generalization Relationship
Generalization and inheritance are original features of the UML, not added later as generalization hierarchies are for ERDs

Composition Relationship
Composition is similar to identification dependency Dark diamond for a composition class

Summary Data modeling is an important skill
Crow’s Foot ERD notation is widely used Use notation precisely Use the diagram rules to ensure structural consistency and completeness Understanding the ERD notation is a prerequisite to applying the notation on business problems Data modeling - Most important skill for database design - Must understand the notation before you can apply it well - Chapter 6 covers the application of the notation - Important to use the notation precisely just as you use a programming language precisely - Diagram rules cover structure (similar to syntax for text language) - Use the diagram rules by yourself and with Check Diagram feature of the ER Assistant

Developing Data Models for Business Databases
Chapter 6 Developing Data Models for Business Databases Welcome to Chapter 6 on developing data models for business databases - Extends your knowledge of the notation of ERDs - Data modeling practice on narrative problems - Convert from ERD to table design - Data modeling is challenging - Ambiguity: part science, part art - Opportunity for some creative problem solving Objectives: - Data modeling practice: - Strategies for analyzing a narrative problem - Transformations for considering alternative designs - Avoidance of common design errors - Master data modeling with lots of practice - Apply conversion rules to transform ERD into a table design

Outline Guidelines for analyzing business information needs
Transformations for generating alternative designs Finalizing an ERD Schema Conversion Analysis of narrative problems: steps to identify entity types, primary keys, and relationships Transformations: Generation of alternatives Can help depict feasible alternatives Finalizing an ERD Documentation Avoiding common design errors Schema conversion: basic rules and specialized rules Alternative notations: Chen, UML

Characteristics of Business Data Modeling Problems
Poorly defined Conflicting statements Wide scope Missing details Many stakeholders Requirements in many formats Add structure Eliminate irrelevant details Add missing details Narrow scope Business requirements are rarely well structured. Rather, as an analyst you will often face an ill-defined business situation in which you need to add structure. You will need to interact with a variety of stakeholders who sometimes provide competing statements about the database requirements. In collecting the requirements, you will conduct interviews, review documents and system documentation, and examine existing data. To determine the scope of the database, you will need to eliminate irrelevant details and add missing details. On large projects, you may work on a subset of the requirements and then collaborate with a team of designers to determine the complete data model.

Goals of Narrative Problem Analysis
Consistency with narrative No contradictions of explicit narrative statements Identify shortcomings Ambiguous statements Missing details Simplicity preference Choose simpler designs especially in initial design Add refinements and additional details later The main goal when analyzing narrative problem statements is to create an ERD that is consistent with the narrative. The ERD should not contradict the implied ERD elements in the problem narrative. For example, if the problem statement indicates that concepts are related by words indicating more than one, the ERD should have a cardinality of many to match that part of the problem statement. you should have a bias toward simpler rather than more complex designs. For example, an ERD with one entity type is less complex than an entity type with two entity types and a relationship. In general, when a choice exists between two ERDs, you should choose the simpler design especially in the initial stages of the design process. As the design process progresses, you can add details and refinements to the original design.

Steps of Narrative Problem Analysis
Identify entity types and attributes Determine primary keys Add relationships Determine connections Determine relationship cardinalities Simplify relationships

Determine Entity Types and Attributes
For entity types, find nouns that represent groups of people, places, things, and events For attributes, look for properties that provide details about the entity types Simplicity principal: consider as an attribute unless other details The simplicity principle should be applied during the search for entity types in the initial ERD, especially involving choices between attributes and entity types. Unless the problem description contains additional sentences or details about a noun, you should consider it initially as an attribute. For example, if courses have an instructor name listed in the catalog, you should consider instructor name as an attribute of the course entity type rather than as an entity type unless additional details are provided about instructors in the problem statement. If there is confusion between considering a concept as an attribute or entity type, you should followup with more requirements collection later.

Determine Primary Keys
Stable: never change after assigned Single purpose: no other purpose Good choices: automatically generated values Compromise choice for industry practices Identify other unique attributes Identification of primary keys is an important part of entity type identification. Ideally, primary keys should be stable and single purpose. “Stable” means that a primary key should never change after it has been assigned to an entity. “Single purpose” means that a primary key attribute should have no purpose other than entity identification. Typically, good choices for primary keys are integer values automatically generated by a DBMS. For example, Access has the AutoNumber data type for primary keys and Oracle has the Sequence object for primary keys. If the requirements indicate the primary key for an entity type, you should ensure that the proposed primary key is stable and single purpose. If the proposed primary key does not meet either criterion, you should probably reject it as a primary key. If the proposed primary key only meets one criterion, you should explore other attributes for the primary key. Sometimes, industry or organizational practices dictate the choice of a primary key even if the choice is not ideal.

Entity Identification Example
Derivation of entity types Customer data include a unique customer number, a name, a billing address, a type (commercial or residential), an applicable rate, and a collection (one or more) of meters Meter data include a unique meter number, an address, a size, and a model. A bill consists of a heading part and a list of detail lines. The heading part contains a customer number, a preparation date, a payment due date, and a date range for the consumption period. When a meter is read, a meter reading document is created containing a unique meter reading number, an employee number, a meter number, a time-stamp (includes date and time), and a consumption level. A rate includes a unique rate code, a description, a fixed dollar amount, a consumption threshold, and a variable amount (dollars per cubic foot). Derivation of primary keys Unique customer number Unique meter number Unique meter reading number Unique rate code Unique bill number: not specified in the problem statement

Identify Relationships
Identify relationships connecting previously identified entity types Relationship references involve associations among nouns representing entity types Sentences that involve an entity type having another entity type as a property Sentences that involve an entity type having a collection of another entity type Identifying relationships: Examine sentences that describe entity types Look for associations among nouns Noun as a property: single valued maximum cardinality Noun as a collection: M maximum cardinality

Relationship Simplification
Problem statement requires direct or indirect connections Hub entity types to simplify Connect other entity types Sometimes associated with important documents Reduce number of direct connections you should look for entity types that are involved in multiple relationships. These entity types can reduce the number of relationships in an ERD by being placed in the center as a hub connected directly to other entity types as spokes of a wheel. Entity types derived from important documents (orders, registrations, purchase orders, etc.) are often hubs in an ERD.

Relationship Identification Example
Derivation of relationships For the Assigned relationship, the narrative states that a customer has a rate, and many customers can be assigned the same rate. These two statements indicate a 1-M relationship from Rate to Customer. For the minimum cardinalities, the narrative indicates that a rate is required for a customer, and that rates are proposed before being associated with customers. For the Uses relationship, the narrative states that a customer includes a collection of meters and a meter is associated with one customer at a time. These two statements indicate a 1-M relationship from Customer to Meter. For the ReadBy relationship, the narrative states that a meter reading contains a meter number, and meters are periodically read. These two statements indicate a 1-M relationship from Meter to Reading. For the SentTo relationship, the narrative indicates that the heading part of a bill contains a customer number and bills are periodically sent to customers. These two statements indicate a 1-M relationship from Customer to Bill. The Includes relationship is 1-M because a bill may involve a collection of readings (one on each detail line), and a reading relates to one bill. s

Diagram Refinements Construct initial ERD Revise many times
Generate feasible alternatives and evaluate according to requirements Gather additional requirements if needed Use transformations to suggest feasible alternatives Data modeling is usually an iterative or repetitive process. You construct a preliminary data model and then refine it many times. In refining a data model, you should generate feasible alternatives and evaluate them according to user requirements. You typically need to gather additional information from users to evaluate alternatives. This process of refinement and evaluation may continue many times for large databases.

Attribute to Entity Type Transformation
Allows more detail in an ERD Design approach: - Initial design: Reading only includes EmpNo - Revised design: learn more about employee details important to the problem - Add employee entity type

Compound Attribute Transformation
Finer level of detail supports improved search: - More difficult to search address (compound) because of lack of standardization - Primitive attributes: easier to search about address details - Initial design: compound attributes - Revised design: split into smaller attributes

Entity Type Expansion Transformation
Usage: - Change in requirements - Initially each rate contains a fixed and variable component - Add multiple tiers of fixed and variable components for each rate - Rates can be highly complex - Lots of effort to understand requirements and represent correctly - Identification dependency is not necessarily part of this transformation - In this situation, identification dependency is reasonable

Weak to Strong Entity Transformation
Usage: - Facilitates references after conversion to a table design - Table design involves a combined PK for a weak entity - Most useful for associative entity types that are on the 1 side in other 1-M relationships - Remove identification dependency symbols (identifying relationships and weak entity) - Find a PK: can use INTEGER data type with DBMS generated values

Attribute History Transformation
Usage: - May be necessary for legal requirements as well as strategic reporting requirements - Can be done for attributes and relationships - When applied to attributes, the transformation is similar to the attribute to entity type transformation - EmpTitle attribute is replaced with an entity type and a 1-M relationship Use version number as a local key Record effective dates (beginning and ending) of change

1-M Relationship Transformation
When applied to a relationship, this transformation typically involves changing a 1-M relationship into an associative entity type and a pair of identifying 1-M relationships. The ERD depicts the transformation of the 1-M Uses relationship into an associative entity type with attributes for the version number and effective dates. The associative entity type is necessary because the combination of customer and meter may not be unique without a version number.

M-N Relationship Transformation
When applied to an M-N relationship, this transformation involves a similar result. The ERD depicts the transformation of the M-N ResidesAt relationship into an associative entity type with a version number and effective change date attributes.

Limited History Transformation
For a limited history, a fixed number of attributes can be added to the same entity type. For example, to maintain a history of the current and the most recent employee titles, two attributes (EmpCurrTitle and EmpPrevTitle) can be used as depicted in Figure To record the change dates for employee titles, two effective date attributes per title attribute can be added.

Generalization Hierarchy Transformation
Usage of generalization hierarchy: - Use this transformation sparingly because generalization hierarchies are specialized modeling tools - Subtypes have specialized attributes that do not apply to all entity types - Accepted classification of entity types - Avoid null values - Residential customers do not have taxpayerid and enterprise zone - Instances of original Customer entity type have null values for specialized attributes

Summary of Transformations
Attribute to entity type Compound attribute split Entity type expansion Weak entity to strong entity Add history: attributes, 1-M relationships, and M-N relationships Generalization hierarchy addition See Table 6-2 in the Chapter06FiguresTables file

Documenting an ERD Important for resolving questions and communicating a design Identify inconsistency and incompleteness in a specification Identify situations when more than one feasible alternative exists Do not repeat the details of the ERD Incorporate documentation into the ERD A large specification typically contains many points of inconsistency and incompleteness. Recording each point allows systematic resolution through additional requirements gathering activities. An information system can undergo a long cycle of repair and enhancement before there is sufficient justification to redesign the system. Good documentation enhances an ERD by communicating the justification for important design decisions. If you are using an ERD tool that has a data dictionary, you should include design justifications in the data dictionary. The ER Assistant supports design justifications as well as comments associated with each item on a diagram. You can use the comments to describe the meaning of attributes.

Documentation with the ER Assistant
Attribute comments Entity type comments Relationship comments Design justifications Diagram notes Comments: Enter in the editing window for entity types, attributes, and relationships Attribute comments are most useful: units of measure, descriptive sentence, uniqueness Entity type comments: combined candidate keys; descriptive sentence Design justifications Numbered Arrange in the diagram so that the numbers indicate the applicable part of the ERD Hide/Show design justifications Explain choice among feasible alternatives; explain design subtleties Diagram notes Use for general information about the design (who, when, what, revision information) Can hide/show

Common Design Errors Misplaced relationships: wrong entity types connected Incorrect cardinalities: typically using a 1-M relationship instead of a M-N relationship Missing relationships: entity types should be connected directly Overuse of specialized modeling tools: generalization hierarchies, identification dependency, self-referencing relationships, M-way associative entity types Redundant relationships: derived from other relationships Design errors: More difficult to detect and resolve than diagram errors Design errors involve the meaning (semantics) of ERD components, not just the structure of components Misplaced relationships: In a large ERD, it is easy to connect the wrong entity types. To help focus, you can look for clusters of entity types in which an entity type in the center is connected to other entity types. Incorrect cardinality: Typical error involves the usage of a 1-M relationship instead of a M-N relationship. This error can be caused by an omission in the requirements. Missing relationships: Sometimes the requirements do not directly indicate a relationship Consider indirect implications to detect whether a relationship is required. Overuse of specialized modeling tools: A typical novice mistake is to use them inappropriately. Generalization hierarchies should not be used just because an entity can exist in multiple states. An associative entity type representing an M-way relationship should be used when the database should record combinations of three (or more) objects rather than just combinations of two objects. In most cases, only combinations of two objects should be recorded. Redundant relationships: Cycles in an ERD may indicate redundant relationships. A cycle involves a collection of relationships arranged in a loop starting and ending with the same entity type. In a cycle, a relationship is redundant if it can be derived from other relationships.

Resolving Design Errors
Misplaced relationships: use entity type clusters to reason about connections Incorrect cardinalities: incomplete requirements: inferences beyond the requirements Missing relationships: examine implications of requirements Overuse of specialized modeling tools: only use when usage criteria are met Redundant relationships: examine relationship cycles for derived relationships Misplaced relationships: In a large ERD, it is easy to connect the wrong entity types. To help focus, you can look for clusters of entity types in which an entity type in the center is connected to other entity types. Incorrect cardinality: Typical error involves the usage of a 1-M relationship instead of a M-N relationship. This error can be caused by an omission in the requirements. Missing relationships: Sometimes the requirements do not directly indicate a relationship you should consider indirect implications to detect whether a relationship is required. Overuse of specialized modeling tools: A typical novice mistake is to use them inappropriately. Generalization hierarchies should not be used just because an entity can exist in multiple states. An associative entity type representing an M-way relationship should be used when the database should record combinations of three (or more) objects rather than just combinations of two objects. In most cases, only combinations of two objects should be recorded. Redundant relationships: Cycles in an ERD may indicate redundant relationships. A cycle involves a collection of relationships arranged in a loop starting and ending with the same entity type. In a cycle, a relationship is redundant if it can be derived from other relationships.

Example Entity Type Cluster
Star pattern with an entity type in the center with 1-M relationships Simplifies connections among entity types Reading is hub (center) or a cluster connecting Bill, Meter, and Employee Other cluster examples: Order entry database with Order connected to Customer, Employee, and Supplier Hospital database with Visit connected to Patient, Physician, and Care Facility

Summary of Data Modeling Guidelines
Use notation precisely Strive for simplicity ERD connections Avoid over connecting the ERD Identify hub(s) of the ERD Use specialized patterns carefully Justify important design decisions Notation: - Communication tool: poor communication if misused - Avoid common errors (next page) Simplicity: - Start with simple design - Use attributes instead of entity types - Consider alternatives that add more detail Connections: - Common mistake: over connecting - In small ERDs: one entity type is typically the hub - In larger ERDs: multiple hub entity types Specialized patterns: - M-way relationships, self-referencing relationships, ID dependency, Gen hierarchies - Not common - Important when they occur - Do not overuse: do not specifically look for specialized patterns Design justifications: - Design can be long-lived - More than one feasible alternative - Requirements that may be unclear: ambiguity and incompleteness are common

Summary of Basic Conversion Rules
Each entity type becomes a table. Each 1-M relationship becomes a foreign key in the table corresponding to the child entity type (the entity type near the crow’s foot symbol). Each M-N relationship becomes an associative table with a combined primary key. Each identifying relationship adds a column to a primary key. For more details see textbook Chapter 6 (section 6.4.1) Apply rules in order: all applications of rule 1, then rule 2, then rule 3, and rule 4. Second rule: fundamental difference between models M-N relationship becomes an associative table with a combined PK.

Application of Basic Rules (I)
Entity type rule: - Two applications - Convert PK and attributes in the table 1-M relationship rule: - 1 application - Offering.CourseNo becomes a FK in the child (M) table (Offering) CREATE TABLE Course (… PRIMARY KEY (CourseNo) ) CREATE TABLE Offering (… PRIMARY KEY OfferNo, FOREIGN KEY (CourseNo) REFERENCES Course )

Application of Basic Rules (II)
CREATE TABLE Enrollment (… PRIMARY KEY (StdSSN, OfferNo), FOREIGN KEY (StdSSN) REFERENCES Student, FOREIGN KEY OfferNo REFERENCES Offering ) Enrolls_In conversion: - Enrollment table: name change not necessary; use noun for table name - Foreign keys: StdSSN and OfferNo

Application of Basic Rules (III)
Conversion process: - Entity type rule: 3 applications - 1-M relationship rule: 2 applications (FKs in the Enrollment table) - Identifying relationship rule: 2 applications - Each application of the identifying relationship rule adds a PK component Same conversion result as the previous slide Different application of rules

Generalization Hierarchy Rule
Mimic generalization hierarchy as much as possible Each subtype table contains specific columns plus the primary key of its parent table. Foreign key constraints for subtype tables CASCADE DELETE option for referenced rows Reduce need for null values Need joins and outer joins to combine tables Generalization hierarchy rule: - Simulate generalization hierarchy with tables and foreign keys - Minimize null values in the tables - Little redundancy except for PK: repeated for each subtype table - CASCADE DELETE: delete parent, automatically delete rows in subtype tables Combining tables: - Join: combine table and all of its parent tables - Full Outer join: combine a table with its sibling tables Other conversion possibilities: - One table: many null values but no need for join and outer join operations - Combine some subtype tables: fewer tables, more nulls, fewer join and outer joins

Generalization Hierarchy Example
CASCADE DELETE for Foreign Keys - SalaryEmp - HourlyEmp Employee table: EmpNo (PK) SalaryEmp table: EmpNo (PK), EmpNo (FK) HourlyEmp table: EmpNo (PK), EmpNo (FK)

Optional 1-M Rule Separate table for each optional 1-M relationship
Avoids null values Requires an extra table and join operation Controversial: in most cases 1-M rule is preferred Optional 1-M relationship: - Relationship in which min cardinality is 0 - No existence dependency - Offering-Faculty: offering can be stored without a faculty assigned - Order-Employee: order can be stored without an employee (internet order) Rule: - Optional 1-M relationship becomes a table instead of just a FK in the M table - Avoids null values - Adds another table and join operation to combine - Rule is controversial - Rule is not necessary - 1-M relationship rule is more widely used for optional 1-M relationships

Optional 1-M Example Other conversion: - Faculty table - Offering table: no FK Teaches conversion: - Teaches table - OfferNo: PK - FacSSN: FK - No null values: Teaches only contains rows for Offerings with assigned faculty CREATE TABLE Teaches (… PRIMARY KEY (OfferNo) , FOREIGN KEY(OfferNo) REFERENCES Offering, FOREIGN KEY(FacSSN) REFERENCES Faculty )

1-1 Relationships 1-1 relationships: not common Use FKs in each mandatory 1-1 relationship: - Can also use for optional 1-1 relationship but will have null values - Office must have an employee: use FK in Office - Employee must be assigned to an office: use FK in Employee UNIQUE constraint: - FK is unique because of the 1-1 relationship - The same employee can be assigned to only one office. CREATE TABLE Office (… PRIMARY KEY (OfficeNo) , FOREIGN KEY(EmpNo) REFERENCES Employee, UNIQUE (EmpNo) )

Summary Data modeling is an important skill Use notation precisely
Preference for simpler designs Consider alternative designs Review design for common errors Work many problems Data modeling - Important skill - Use notation precisely - Problem solving requires careful consideration of alternatives Advanced concepts: - Avoid overuse: typical beginner's mistake - Important when occur but not common

Normalization of Relational Tables
Chapter 7 Normalization of Relational Tables Welcome to Chapter 7 - Logical database design: converting and refining the ERD - Major part of logical database design: normalization - Normalization: refinement; identifying and resolving unwanted redundancy Objectives: - Identify modification anomalies - Define functional dependencies - Apply normalization rules to modest size problems: BCNF and simple synthesis procedure - Understand relationship independence problems - Appreciate the role and objective of normalization in the db development process

Outline Modification anomalies Functional dependencies
Major normal forms Relationship independence Practical concerns Modification anomalies: motivation for normalization; unwanted redundancy Functional dependencies: - Assertions or constraints about the data - Most important part of the process: recording FDs Normal forms: - Rules about allowable patterns of FDs - Apply on modest size problems - CASE tool for large problems Relationship independence: - More specialized redundancy problem - Not as common and important as BCNF Practical concerns: - Role of normalization in the development process: when to use; how important - Analyzing the objective

Modification Anomalies
Unexpected side effect Insert, modify, and delete more data than desired Caused by excessive redundancies Strive for one fact in one place Side effect: unintended consequence; sometimes good, sometimes bad Modification anomaly: - Cannot modify just the desired data - Must modify more than the desired data Cause: - Redundancy: facts stored multiple times - Remove unwanted redundancies to eliminate anomalies

Big University Database Table
- Table 7-1 except for omission of two columns (StdCity and OffTerm) - Typical beginner's mistake: use one table for the entire database Anomalies: - PK: combination of StdSSN and OfferNo - Insert: cannot insert a new student without enrolling in an offering (OfferNo part of PK) - Update: change a course description; change every enrollment of the course - Delete: remove third row; lose information about course C3 Table has obvious redundancies - Easier to query: no joins - More difficult to change: can work around problems (dummy PK) but tedious to do

Modification Anomaly Examples
Insertion Insert more column data than desired Must know student number and offering number to insert a new course Update Change multiple rows to change one fact Must change two rows to change student class of student S1 Deletion Deleting a row causes other facts to disappear Deleting enrollment of student S2 in offering O3 causes loss of information about offering O3 and course C3 To deal with these anomalies, users may circumvent them (such as using a default primary key to insert a new course) or database programmers may write code to prevent inadvertent loss of data. A better solution is to modify the table design to remove the redundancies that cause the anomalies.

Functional Dependencies
Constraint on the possible rows in a table Value neutral like FKs and PKs Asserted Understand business rules Assert: - Defining a business rule - Normative: should be statement - Look at data to see existing practices - Most important part of normalization: asserting FDs Value neutral: - no specific value mentioned in an FD - PK: can be any value but it must be unique - FK: can be any value that matches a row in the PK table

FD Definition X  Y X (functionally) determines Y
X: left-hand-side (LHS) or determinant For each X value, there is at most one Y value Similar to candidate keys Notation: - X->Y - X determines Y (more properly X functionally determines Y) - Refer to X as the LHS or determinant - Each X has at most one Y - Like a mathematical function: f(X) = Y Example StdSSN -> StdClass - There is at most one class for each student - Place StdSSN and StdClass alone in the same table: StdSSN is a candidate key

FD Diagrams and Lists StdSSN  StdCity, StdClass
OfferNo  OffTerm, OffYear, CourseNo, CrsDesc CourseNo  CrsDesc StdSSN, OfferNo  EnrGrade FD diagram: - 7.2 of textbook chapter 7 - See related FDs (same LHS) by line height - Useful for small sets of FDs - Unwieldy for large sets of FDs FD list: - Group by LHS - Shortcut notation: X -> Y, Z is a shortcut for X -> Y and X -> Z Compound LHS: - Similar to a combined PK - Compound LHS is not a shortcut (as is a compound RHS) - Combination of StdSSN and OfferNo determine EnrGrade (not either column alone) Minimality: - LHS must be minimal - Cannot remove columns from LHS without making the FD invalid - Usually non minimal LHS is not a problem: important that LHS does not have extraneous columns - Properly known as full functional dependence: minimal LHS makes full functional dep.

FDs in Data Prove non existence (but not existence) by looking at data
Two rows that have the same X value but a different Y value Looking at data: - Useful when explaining to a user - Automated tools ask for example rows to eliminate FDs Example: - OfferNo -> StdSSN: contradicting rows ( 2, 4) (same OfferNo but a different StdSSN) - StdSSN -> OfferNo: contradicting rows (<1,2>, <3,4>) - StdSSN -> OffYear: data has no contradictions - Add rows to provide contradiction (enroll S1 in a 2001 offering) Assignment 5: - Questions similar to this line of reasoning - Find contradictory rows or add rows if no contradiction is found

Identifying FDs Easy identification Difficult identification
Statements about uniqueness PKs and CKs resulting from ERD conversion 1-M relationship: FD from child to parent Difficult identification LHS is not a PK or CK in a converted table LHS is part of a combined primary or candidate key Ensure minimality of LHS Functional dependencies in which the LHS is not a primary or candidate key can also be difficult to identify. These FDs are especially important to identify after converting an ERD to a table design. You should carefully look for FDs in which the LHS is not a candidate key or primary key. You should also consider FDs in tables with a combined primary or candidate key in which the LHS is part of a key, but not the entire key. The presentation of normal forms in Section 7.2 explains that these kinds of FDs can lead to modification anomalies. Minimality: no extra columns in LHS Difficult FDs to identify: most important because most FDs are identified in developing ERD

Normalization Process of removing unwanted redundancies
Apply normal forms Identify FDs Determine whether FDs meet normal form Split the table to meet the normal form if there is a violation Normal form: - Rule about allowable pattern of FDs (1NF through BCNF) - Higher normal forms are not rules about FDs: more difficult to understand and use - Most important part is to record FDs: CASE tool can perform normalization - Check FDs to see if they violate the pattern permitted in the normal form Split table - Smaller tables do not violate the normal form - Smaller tables should not lose information contained in the larger table Difficulty: - Normalization is easy to apply to small tables with simple dependency structures - Use CASE tool for large databases and tables with complex dependency structures

Relationships of Normal Forms
1NF: least restrictive; every table in 1NF 2NF: more restrictive than 1NF; every table in 2NF is also in 1NF 3NF/BCNF: BCNF is a revised definition of 3NF; BCNF is more restrictive than BCNF 4NF: Inappropriate usage of an M-way relationship; Relationship independence and MVDs; does not involve FDs 5NF: does not involve FDs; Inappropriate usage of an M-way relationship; more specialized than 4NF DKNF: ideal rather than a practical normal form

1NF Starting point for most relational DBMSs
No repeating groups: flat rows Big university database table is not normalized - Not in 1NF - S1 row has repeating values (O1 and O2) - S2 row has repeating values (O3 and O2) Convert to 1NF: - Flatten rows - Split each repeating group into a separate row - Repeat the implied values in the new rows: - S1 JUN for row two with O C2 VB - S2 JUN for row four with O C2 VB Nested tables are permitted in SQL:2003 (Chapter 18) but nested tables are not important in most business databases. Nested tables are not in 1NF.

Combined Definition of 2NF/3NF
Key column: candidate key or part of candidate key Analogy to the traditional justice oath Every non key column depends on all candidate keys, whole candidate keys, and nothing but candidate keys Usually taught as separate definitions Candidate key: - Unique - Minimal: no extraneous columns without losing uniqueness property - Can have multiple candidate keys per table Key column: - A candidate key by itself - Part of a combined CK - Nonkey: a column that is not a key column Combined definition: - Analogy to traditional justice oath - So help me: Ted Codd (father of relational databases) - Usually taught as separate definitions for simplicity

2NF Every nonkey column depends on all candidate keys, not a subset of any candidate key Violations Part of key  nonkey Violations only for combined keys First of combined 2NF/3NF definition: dependent on the whole key Violation: - Part of a key determines a non key - If FDs for a table contain such an FD, split the table

2NF Example Many violations for the big university database table
StdSSN  StdCity, StdClass OfferNo  OffTerm, OffYear, CourseNo, CrsDesc Splitting the table UnivTable1 (StdSSN, StdCity, StdClass) UnivTable2 (OfferNo, OffTerm, OffYear, CourseNo, CrsDesc) Violating FDs: - All FDs violate except StdSSN, OfferNo -> EnrGrade - StdSSN: part of a key (not the entire key) - OfferNo: part of a key (not the entire key) Splitting: - Place each FD group in a separate table - UnivTable1: StdSSN group - UnivTable2: OfferNo group - UnivTable3: StdSSN, OfferNo group - No violations among new tables - CourseNo -> CrsDesc (CourseNo is not part of a key) - This FD violates 3NF, not 2NF Splitting process: - Recover original table with natural join - Not lose any FDs: all FDs are derivable - Books on normalization theory explain criteria: theory not important here

3NF Every nonkey column depends only on candidate keys, not on non key columns Violations: Nonkey  Nonkey Alterative formulation No transitive FDs A  B, B  C then A  C OfferNo  CourseNo, CourseNo  CrsDesc then OfferNo  CrsDesc Second part of combined 2NF/3NF definition: nothing but the key Violation: - Non key determines a non key - If FDs for a table contain such an FD, split the table Alternative formulation: - No transitive FDs - Law of transitivity: A < B, B < C then A < C - Transitivity applies to FDs - Not the preferred definition of 3NF: should not write down transitively derived FDs - Simple synthesis procedure for BCNF

3NF Example One violation in UnivTable2 Splitting the table
CourseNo  CrsDesc Splitting the table UnivTable2-1 (OfferNo, OffTerm, OffYear, CourseNo) UnivTable2-2 (CourseNo, CrsDesc) Violating FDs: - CourseNo is non key - Alternatively, OfferNo -> CrsDesc is a transitively derived FD Splitting UnivTable2: - Arrange by LHS - UnivTable2-2: CourseNo is the PK

BCNF Every determinant must be a candidate key. Simpler definition
Apply with simple synthesis procedure Special cases not covered by 3NF Part of key  Part of key Nonkey  Part of key Special cases are not common Boyce-Codd Normal Form Determinant: LHS Candidate key: unique column(s) for the table Important because it is simpler and more direct to apply Revised 3NF definition Special case is not common

BCNF Example Primary key: (OfferNo, StdSSN)
Many violations for the big university database table StdSSN  StdCity, StdClass OfferNo  OffTerm, OffYear, CourseNo CourseNo  CrsDesc Split into four tables Violating FDs: - All FDs violate except StdSSN, OfferNo -> EnrGrade - StdSSN: not a candidate key (part of a candidate key) - OfferNo: part of a key (not the entire key) Splitting: - Place each FD group in a separate table - UnivTable1: StdSSN group - UnivTable2: OfferNo group - UnivTable3: CourseNo group - UnivTable4: StdSSN, OfferNo group - No violations among new tables

Simple Synthesis Procedure
Eliminate extraneous columns from the LHSs Remove derived FDs Arrange the FDs into groups with each group having the same determinant. For each FD group, make a table with the determinant as the primary key. Merge tables in which one table contains all columns of the other table. Synthesis: - Combine parts into whole - Musical synthesis: combine individual sounds into larger musical units - Combine FDs into tables Simple synthesis procedure - Procedure to apply BCNF: result are BCNF tables - Simple: second step is much more complex in practice - Many ways to derive FDs - Compute minimal cover: difficult to compute by hand except for small FD lists - Use CASE tool for moderate to large FD lists Step 1: - Nake sure LHSs are minimal - Usually not a problem Step 2: - For this class, only consider law of transitivity - Reduce amount of work by not recording derived FDs Step 3: - Sort FDs by LHS - Little work because natural to write FDs in this manner Step 4: - One table per FD group - LHS becomes PK Step 5: - Deals with tables that have multiple candidate keys - Merge FD groups when one table contains another table - Choose the primary key of one of the separate tables as the primary of the new, merged table. - Define unique constraints for the other primary keys that were not designated as the primary key of the new table. Add FKs: - Not formally part of normalization process - Ensure consistency - Add FK: PK or CK used in another table

Simple Synthesis Example I
Begin with FDs shown in Slide 8 Step 1: no extraneous columns Step 2: eliminate OfferNo  CrsDesc Step 3: already arranged by LHS Step 4: four tables (Student, Enrollment, Course, Offering) Step 5: no redundant tables Step 1: - No work - StdSSN, StdClass -> StdCity: remove StdClass Step 2: - OfferNo -> CrsDesc is derived by transitivity - Remove from the FD list Step 3: - StdSSN -> StdCity, StdClass - OfferNo -> OffTerm, OffYear, CourseNo - CourseNo -> CrsDesc - StdSSN, OfferNo -> EnrGrade Step 4: - Student table: StdSSN (PK) - Offering table: OfferNO (PK) - Course table: CourseNo (PK) - Enrollment table: StdSSN, OfferNo (combined PK) Step 5: - No redundant tables - -> StdSSN - Merge and StdSSN tables - StdSSN the PK: more stable than ; missing addresses FKs: - Offering: CourseNo - Enrollment: StdSSN and OfferNo

Simple Synthesis Example II
AuthNo  AuthName, Auth , AuthAddress Auth  AuthNo PaperNo  Primary-AuthNo, Title, Abstract, Status RevNo  RevName, Rev , RevAddress Rev  RevNo RevNo, PaperNo  Auth-Comm, Prog-Comm, Date, Rating1, Rating2, Rating3, Rating4, Rating5 database to track reviews of papers submitted to an academic conference. Prospective authors submit papers for review and possible acceptance in the published conference proceedings.

Simple Synthesis Example II Solution
Author(AuthNo, AuthName, Auth , AuthAddress) UNIQUE (Auth ) Paper(PaperNo, Primary-Auth, Title, Abstract, Status) FOREIGN KEY (Primary-Auth) REFERENCES Author Reviewer(RevNo, RevName, Rev , RevAddress) UNIQUE (Rev ) Review(PaperNo, RevNo, Auth-Comm, Prog-Comm, Date, Rating1, Rating2, Rating3,Rating4, Rating5) FOREIGN KEY (PaperNo) REFERENCES Paper FOREIGN KEY (RevNo) REFERENCES Reviewer Because the LHS is minimal in each FD, the first step is finished. The second step is not necessary because there are no transitive dependencies. Note that the FDs Auth  AuthName, AuthAddress, and Rev  RevName, RevAddress can be transitively derived. If any of these FDs were part of the original list, they should be removed. For each of the six FD groups, you should define a table. In the last step, you combine the FD groups with AuthNo and Auth and RevNo and Rev as determinants. In addition, you should add unique constraints for Auth and Rev because these columns were not selected as the primary keys of the new tables.

Multiple Candidate Keys
Multiple candidate keys do not violate either 3NF or BCNF Step 5 of the Simple Synthesis Procedure creates tables with multiple candidate keys. You should not split a table just because it contains multiple candidate keys. Splitting a table unnecessarily can slow query performance. As this additional example demonstrates, multiple candidate keys do not violate BCNF. The fifth step of the Simple Synthesis Procedure creates tables with multiple candidate keys because it merges tables. Multiple candidate keys do not violate 3NF either. There is no reason to split a table just because it has multiple candidate keys. Splitting a table with multiple candidate keys can slow query performance due to extra joins.

Relationship Independence and 4NF
M-way relationship that can be derived from binary relationships Split into binary relationships Specialized problem 4NF does not involve FDs Analogy to statistical independence: - Do not store joint probabilities when variables are independent - Age of rock and age of person holding the rock are independent - Joint probability (Person age = X and rock age = y) can be derived from marginal probabilities Relationship independence: - Independent relationships: binary relationships that combine to show all possible combinations - Combine using the join operator - No need to store the join (M-way relationship) when it can be derived More specialized problem than BCNF: - M-way relationships are not common: important when occurring - Analysis of M way relationships is not a typical situation - Two ways to analyze M-way relationships: - Given binary relationships, should there be an M-way relationship instead - Given M-way relationship, should it be split into M-1 binary relationships - Relationship independence and 4NF only involve the splitting question - Chapter 12 (Section ) provides a more general (but less rigorous) way to reason about both questions

Relationship Independence Problem
Enroll: - Associative entity type containing combinations of students, offerings, and textbooks - All key: StdSSN, OfferNo, TextNo Design problem: - Should Enroll be split into two binary relationships - Student-Offering: students register for course offerings - Offering-Textbook: professors choose textbooks - Student-Textbook: can be derived from other two relationships - Student-Offering and Offering-Textbook are independent Solution: next slide

Relationship Independence Solution
- Split enroll entity type into two M-N relationships - Enroll relationship: Student-Offering - Orders relationship: Offering-Textbook 3 Way relationship would probably not been considered because of knowledge of business process - Enrollment and textbook ordering are independent events - Occur in different points in time - Separate data entry forms

Extension to the Relationship Independence Solution
Problem extension: - Record textbook purchases - Add an associative entity type but keep the two binary relationships - Purchasing behavior cannot be derived from enrollment and ordering data If the assumptions change slightly, an argument can be made for an associative entity type representing a three-way relationship. Suppose that the bookstore wants to record textbook purchases by offering and student to estimate textbook demand. Then, the relationship between students and textbooks is no longer independent of the other two relationships. Even though a student is enrolled in an offering and the offering uses a textbook, the student may not purchase the textbook (perhaps borrow it) for the offering. In this situation, there is no independence and a three-way relationship is needed. In addition to the M-N relationships in Figure 7.8, there should be a new associative entity type and three 1-M relationships, as shown in Figure You need the Enroll relationship to record student selections of offerings and the Orders relationship to record professor selections of textbooks. The Purchase entity type records purchases of textbooks by students in a course offering. However, a purchase cannot be known from the other relationships.

MVDs and 4NF MVD: difficult to identify 4NF: no non trivial MVDs
A  B | C (multi-determines) A associated with a collection of B and C values B and C are independent Non trivial MVD: not also an FD 4NF: no non trivial MVDs MVDs: - Difficult to identify - Rigorous definition of relationship independence - Concept is confusing to many practitioners: independence concept is often omitted - I stress the independence idea (the key idea) rather than the multi value idea MVD: - Association with a set of values (not just one) - Independence: key idea - An FD is an MVD with collection is a single value - Non trivial MVD: associated with one or more values

MVD Representation Given the two rows above the line, the two rows below the line are in the table if the MVD is true. A  B | C Given the two rows above the line, the two rows below the line are in the table if A multi determines B | C. Independence means that A is associated with every combination of B and C values. OfferNo  StdSSN | TextNo

Higher Level Normal Forms
5NF for M-way relationships DKNF: absolute normal form DKNF is an ideal, not a practical normal form 5NF: - More specialized than 4NF - More difficult to understand than 4NF - Split a three way relationship into three (not two) binary relationships DKNF (Domain Key Normal Form) - Domain: sets of values - Key: candidate key (uniqueness property) - All constraints derivable from domains and keys - Not possible to test a table for DKNF compliance - No known procedure to construct a DKNF table

Role of Normalization Refinement Initial design Use after ERD
Apply to table design or ERD Initial design Record attributes and FDs No initial ERD May reverse engineer an ERD after normalization Approach in textbook: - Refinement - Apply normalization after conversion - Normalization can be applied directly to an ERD Initial design approach: - Use attributes and FDs - May reverse engineer ERD later - Some strongly advocate this approach

Advantages of Refinement Approach
Easier to translate requirements into an ERD than list of FDs Fewer FDs to specify Fewer tables to split Easier to identify relationships especially M-N relationships without attributes This book clearly favors using normalization as a refinement tool, not as an initial design tool. Through development of an ERD, you intuitively group related fields. Much normalization is accomplished in an informal manner without the tedious process of recording functional dependencies. As a refinement tool, there are fewer FDs to specify and less normalization to perform. Applying normalization ensures that candidate keys and redundancies have not been overlooked. Another reason for favoring the refinement approach is that relationships can be overlooked when using normalization as the initial design approach. 1-M relationships must be identified in the child-to-parent direction. For novice data modelers, identifying relationships is easier when considering both sides of a relationship. For an M-N relationship without attributes, there will not be any functional dependencies that show the need for a table. For example, in a design about textbooks and course offerings, if the relationship between them has no attributes, there are no functional dependencies that relate textbooks and course offerings[1]. In drawing an ERD, however, the need for an M-N relationship becomes clear. [1] An FD can be written with a null right-hand side to represent M-N relationships. The FD for the offering-textbook relationship can be expressed as TextId, OfferNo . However, this kind of FD is awkward to state. It is much easier to define an M-N relationship.

Normalization Objective
Update biased Not a concern for databases without updates (data warehouses) Denormalization Purposeful violation of a normal form Some FDs may not cause anomalies May improve performance Carefully analyze objective: - Ignore normalization for FDs that do not cause anomalies - Be careful: most FDs will lead to anomalies Classic example: - ZipCode -> City - Only holds for the city in which the post office is located - Even when it holds, it may not lead to anomalies - Mail order business: track tax rates by zip code (not really accurate) - Important for mail order databases Performance: - Chapter 8 - Consider performance implications after logical design is complete

Summary Beware of unwanted redundancies FDs are important constraints
Strive for BCNF Use a CASE tool for large problems Important tool of database development Focus on the normalization objective Cause of difficult modifications: - Unwanted redundancies - Anomalies can occur with unwanted redundancies FD: - Like a candidate key constraint - Must be able to record FDs - Normalization can be performed by CASE tool: necessary for large databases - BCNF: revised definition of 3NF; most important in practice Role of normalization: - Refinement rather than initial design (my expert opinion) - Can be applied after conversion or directly to an ERD Normalization objective: - Update biased: make a db easier to change - Normalization makes many tables - Difficult and inefficient to query - If an FD does not cause a significant anomaly, perhaps relax from full BCNF - Denomalization can be done to improve performance (Chapter 8)

Physical Database Design
Chapter 8 Physical Database Design Welcome to Chapter 8 on Physical Database Design. This is the final phase of the database development process. Physical database design transforms a table design from the logical design phase to database implementation. Objectives Describe the inputs, outputs, and objectives of physical database design List characteristics of sequential, Btree, hash, and bitmap index file structures Appreciate the difficulties of performing physical database design and the need for periodic review of physical database design choices Understand the trade-offs in index selection and denormalization decisions Understand the need for good tools to help make physical database design decisions Understand the choices made by a query optimizer and the areas in which optimization decisions can be improved

Outline Overview of Physical Database Design File Structures
Query Optimization Index Selection Additional Choices in Physical Database Design You need to understand Physical Database Design in order to achieve an efficient implementation of your table design. To become proficient in physical database design, you need to understand the process and environment. The process of physical database design includes the inputs, outputs, and objectives. The process works along with 2 critical parts of the environment, File Structures and Query Optimization. Index selection is the most important choice of physical database design. Additional Choices in Physical Database Design, this chapter presents denormalization, record formatting, and parallel processing as techniques to improve database performance.

Overview of Physical Database Design
Sequence of decision-making processes. Decisions involve the storage level of a database: file structure and optimization choices. Importance of providing detailed inputs and using tools that are integrated with DBMS usage of file structures and optimization decisions Overview of Physical Database Design Storage Level of Databases Objectives and Constraints Inputs, Outputs, and Environment Difficulties

Storage Level of Databases
Closest to the hardware and operating system. Physical records organized into files. The number of physical record accesses is an important measure of database performance. Difficult to predict physical record accesses Physical records (also known as blocks or pages) is a collection of bytes that are transferred between volatile storage in main memory and stable storage on a disk. Main memory is considered volatile storage because the contents of main memory may be lost if a failure occurs.

Logical Records (LR) and Physical Records (PR)
This figure depicts relationships between logical records (rows of a table) and physical records stored in a file. Typically, a physical record contains multiple logical records (picture (a)). A large logical record may be split over multiple physical records (picture (b)). Another possibility is that logical records from more than one table are stored in the same physical record (picture (c)).

Transferring Physical Records
The DBMS and operating system work together to satisfy requests for logical records made by applications. This figure depicts the process of transferring physical and logical records between a disk, DBMS buffers, and application buffers. Normally, the DBMS and the application have separate memory areas known as buffers. When an application makes a request for a logical record, the DBMS locates the physical record containing it. In the case of a read operation, the operating system transfers the physical record from disk to the memory area of the DBMS. The DBMS then transfers the logical record to the application’s buffer. In the case of a write operation, the transfer process is reversed.

Objectives Minimize response time to access and change a database.
Minimizing computing resources is a substitute measure for response time. Database resources Physical record transfers CPU operations Communication network usage (distributed processing) The goal of physical database design is to minimize response time to access and change a database. Because response time is difficult to estimate directly, minimizing computing resources is used as a substitute measure. The resources that are consumed by database processing are physical record transfers, central processing unit (CPU) operations, main memory, and disk space. In distributed environments, minimizing communication network usage can cause higher response times so communication network usage is not minimized.

Constraints Main memory and disk space
Minimizing main memory and disk space can lead to high response times. Useful to consider additional memory and disk space For most choices in physical database design, the amounts of main memory and disk space are usually fixed. In other words, main memory and disk space are constraints of the physical database design process. As with constraints in other optimization problems, you should consider the effects of changing the given amounts of main memory and disk space. Increasing the amounts of these resources can improve performance. The amount of performance improvement may depend on many factors such as the DBMS, the table design, and the applications using the database. CPU usage also can be a factor in some database applications. For example, sorting requires a large number of comparisons and assignments. However, these operations performed by the CPU are many times faster than a physical record access.

Combined Measure of Database Performance
Weight combines physical record accesses and CPU usage Weight is usually close to 0 Mmany CPU operations can be performed in the time to perform one physical record transfer. The objective of physical database design is to minimize the combined measure for all applications using the database. Combined Measure of Database Performance: PRA + W * CPU-OP where PRA is the number of physical record accesses, CPU-OP is the number of CPU operations such as comparisons and assignments, and W is a weight, a real number between 0 and 1. Generally, improving performance on retrieval applications comes at the expense of update applications and vice-versa. Therefore, an important theme of physical database design is to balance the needs of retrieval and update applications. The measures of performance are too detailed to estimate by hand except for simple situations. Complex optimization software calculates estimates using detailed cost formulas. The optimization software is usually part of the SQL compiler. Understanding the nature of the performance measure helps one to interpret choices made by the optimization software. For most choices in physical database design, the amounts of main memory and disk space are usually fixed. In other words, main memory and disk space are constraints of the physical database design process. As with constraints in other optimization problems, you should consider the effects of changing the given amounts of main memory and disk space. Increasing the amounts of these resources can improve performance. The amount of performance improvement may depend on many factors such as the DBMS, table design, and applications using the database.

Inputs, Outputs, and Environment
Physical database design consists of a number of different inputs and outputs as depicted in this figure. The starting point is the table design from the logical database design phase. The table and application profiles (Inputs) are used specifically for physical database design. The most important outputs are decisions about file structures and data placement. Knowledge about file structures and query optimization is in the environment of physical database design rather than being an input. Physical database design is better characterized as a sequence of decision-making processes rather than one large process.

Difficulty of physical database design
Number of decisions Relationship among decisions Detailed inputs Complex environment Uncertainty in predicting physical record accesses Physical database design is difficult due to the following factors. - The number of possible choices available to the designer can be large. For databases with many fields, the number of possible choices can be too large to evaluate even on large computers. - The decisions cannot be made in isolation of each other. For example, file structure decisions for one table can influence the decisions for other tables. - The quality of decisions is limited to the precision of the table and application profiles. However, these inputs can be large and difficult to collect. In addition, the inputs change over time so that periodic collection is necessary. - The environment knowledge is specific to each DBMS. Much of the knowledge is either a trade secret or too complex to fully know. - The number of physical record accesses is difficult to predict because of uncertainty about the contents of DBMS buffers. The uncertainty arises because the mix of applications accessing the database changes over time.

Inputs of Physical Database Design
Physical database design requires inputs specified in sufficient detail. Table profiles used to estimate performance measures. Application profiles provide importance of applications. Inputs specified without enough detail can lead to poor decisions in physical database design and query optimization.

Table Profile Tables Columns
Number of rows Number of physical records Columns Number of unique values Distribution of values Correlation of columns Relationships: distribution of related rows A table profile summarizes a table as a whole, the columns within a table, and the relationships between tables. Most enterprise DBMSs have programs to generate statistics. The designer may need to periodically run the statistics program so that the profiles do not become obsolete. For large databases, table profiles may be estimated on samples of the database. Using the entire database can be too time consuming and disruptive. For column and relationship summaries, the distribution conveys the number of rows and related rows for column values. The distribution of values can be specified in a number of ways. A simple way is to assume that the column values are uniformly distributed. Uniform distribution means that each value has an equal number of rows. A more detailed way to specify a distribution is to use a histogram where the x-axis represents column ranges and the y-axis represents the number of rows containing the range of values.

Histogram Specify distribution of values Two dimensional graph
Column values on the x axis Number of rows on the y axis Variations Equal-width: do not work well with skewed data Equal-height: control error by the number of ranges A table profile summarizes a table as a whole, the columns within a table, and the relationships between tables. Most enterprise DBMSs have programs to generate statistics. The designer may need to periodically run the statistics program so that the profiles do not become obsolete. For large databases, table profiles may be estimated on samples of the database. Using the entire database can be too time consuming and disruptive. For column and relationship summaries, the distribution conveys the number of rows and related rows for column values. The distribution of values can be specified in a number of ways. A simple way is to assume that the column values are uniformly distributed. Uniform distribution means that each value has an equal number of rows. A more detailed way to specify a distribution is to use a histogram where the x-axis represents column ranges and the y-axis represents the number of rows containing the range of values.

Equal-Width Histogram
For example, the first bar in Figure 8.4 means that 9,000 rows have a salary between $10,000 and $50,000. Traditional equal-width histograms do not work well with skewed data because a large number of ranges are necessary to control estimation errors. In the slide, estimating the number of employee rows using the first two ranges may lead to large estimation errors because more than 97% of employees have salaries less than $80,000. For example, you would calculate about 1,125 rows (12.5% of 9,000) to estimate the number of employees earning between $10,000 and $15,000 using the slide. However, the actual number of rows is much lower because few employees earn less than $15,000.

Equal-Height Histogram
Because skewed data can lead to poor estimates using traditional (equal-width) histograms, most DBMSs use equal-height histograms as shown in Figure 8.5. In an equal-height histogram, the ranges are determined so that each range has about the same number of rows. Thus the width of the ranges varies, but the height is about the same. Most DBMSs use equal-height histograms because the maximum and expected estimation errors can be controlled by increasing the number of ranges.

Application profiles Application profiles summarize the queries, forms, and reports that access a database. Application profiles summarize the queries, forms, and reports that access a database. For forms, the frequency of using the main form and the subform for each kind of operation (insert, update, delete, and retrieval) should be specified. For queries and reports, the distribution of parameter values encodes the number of times the query/report is executed with various parameter values.

File structures Selecting among alternative file structures is one of the most important choices in physical database design. In order to choose intelligently, you must understand characteristics of available file structures.

Sequential Files Simplest kind of file structure
Unordered: insertion order Ordered: key order Simple to maintain Provide good performance for processing large numbers of records

Unordered Sequential File
Inserting a New Logical Record into an Unordered Sequential File: New logical records are appended to the last physical record in the file. Unordered files are sometimes known as heap files because of the lack of order. The primary advantage of unordered sequential files is fast insertion. However, when logical records are deleted, insertion becomes more complicated.

Ordered Sequential File
Inserting a New Logical Record into an Ordered Sequential File. Ordered sequential files can be preferable to unordered sequential files when ordered retrieval is needed. Logical records are arranged in key order where the key can be any column, although it is often the primary key. Ordered sequential files are faster when retrieving in key order, either the entire file or a subset of records. The primary disadvantage to ordered sequential files is slow insertion speed. This figure demonstrates that records must sometimes be rearranged during the insertion process. The rearrangement process can involve movement of logical records between blocks and maintenance of an ordered list of physical records.

Hash Files Support fast access by unique key value
Convert a key value into a physical record address Mod function: typical hash function Divisor: large prime number close to the file capacity Physical record number: hash function plus the starting physical record number Hash File is a specialized file structure that supports search by unique key. The basic idea behind hash files is a function that converts a key value into a physical record address. The mod function (remainder division) is a simple hash function.

Example: Hash Function Calculations for StdSSN Key
This example applies the mod function to the StdSSN column values. For simplicity, assume that the file capacity is 100 physical records. The divisor for the mod function is 97, a large prime number close to the file capacity. The physical record number is the result of the hash function result plus the starting physical record number, assumed to be 150.

Hash File after Insertions
This figure shows selected physical records of the hash file from previous slide.

Collision Handling Example
During insertion, collision may occur. Hash functions may assign more than one key to the same physical record address. A collision occurs when two keys hash to the same physical record address. As long as the physical record has free space, a collision is no problem. However, if the original or home physical record is full, a collision-handling procedure locates a physical record with free space. This figure demonstrates the linear probe procedure for collision handling. In the linear probe procedure, a logical record is placed in the next available physical record if its home address is occupied. To retrieve a record by its key, the home address is initially searched. If the record is not found in its home address, a linear probe is initiated. The existence of collisions highlights a potential problem with hash files. If collisions do not occur often, insertions and retrievals are very fast. If collisions occur often, insertions and retrievals can be slow. The likelihood of a collision depends on how full the file is. Generally, if the file is less than 70 percent full, collisions do not occur often. However, maintaining a hash file that is only 70 percent full can be a problem if the table grows. If the hash file becomes too full, a reorganization is necessary. A reorganization can be time consuming and disruptive because a larger hash file is allocated and all logical records are inserted into the new file.

Hash File Limitations Poor performance for sequential search
Reorganization when capacity exceeds 70% Dynamic hash files reduce random search performance but eliminate periodic reorganization Generally, if the file is less than 70 percent full, collisions do not occur often. However, maintaining a hash file that is only 70 percent full can be a problem if the table grows. If the hash file becomes too full, a reorganization is necessary. A reorganization can be time-consuming and disruptive because a larger hash file is allocated and all logical records are inserted into the new file. Good hash functions tend to spread logical records uniformly among physical records. Because of gaps between physical records, sequential search may examine empty physical records. To eliminate reorganizations, dynamic hash files have been proposed. In a dynamic hash file, periodic reorganization is never necessary and search performance does not degrade after many insert operations. However, the average number of physical record accesses to retrieve a record may be slightly higher as compared to a static hash file that is not too full. The basic idea in dynamic hashing is that the size of the hash file grows as records are inserted.

Multi-Way Tree (Btrees) Files
A popular file structure supported by most DBMSs. Btree provides good performance on both sequential search and key search. Btree characteristics: Balanced Bushy: multi-way tree Block-oriented Dynamic While Sequential files perform well on sequential search but poorly on key search and Hash files perform well on key search but poorly on sequential search, Btree is a compromise and widely used file structure. Balanced: all leaf nodes (nodes without children) reside on the same level of the tree. A balanced tree ensures that all leaf nodes can be retrieved with the same access cost. Bushy: the number of branches from a node is large, perhaps 10 to 100 branches. Multi-way, meaning more than two, is a synonym for bushy. The width (number of arrows from a node) and height (number of nodes between root and leaf nodes) are inversely related: increase width, decrease height. The ideal Btree is wide (bushy) but short (few levels). Block-Oriented: each node in a Btree is a block or physical record. To search a Btree, you start in the root node and follow a path to a leaf node containing data of interest. The height of a Btree is important because it determines the number of physical record accesses for searching. Dynamic: the shape of a Btree changes as logical records are inserted and deleted. Periodic reorganization is never necessary for a Btree.

Structure of a Btree of Height 3
A Btree is a special kind of tree as depicted in this figure. A tree is a structure in which each node has at most one parent except for the root or top node. The Btree structure possesses a number of characteristics, discussed in the following list, that make it a useful file structure. Some of the characteristics are possible meanings for the letter “B” in the name. - Balanced: all leaf nodes (nodes without children) reside on the same level of the tree. In this figure, all leaf nodes are two levels beneath the root. A balanced tree ensures that all leaf nodes can be retrieved with the same access cost. - Bushy: the number of branches from a node is large, perhaps 10 to 100 branches. Multi-way, meaning more than two, is a synonym for bushy. The width (number of arrows from a node) and height (number of nodes between root and leaf nodes) are inversely related: increase width, decrease height. The ideal Btree is wide (bushy) but short (few levels). - Block-Oriented: each node in a Btree is a block or physical record. To search a Btree, you start in the root node and follow a path to a leaf node containing data of interest. The height of a Btree is important because it determines the number of physical record accesses for searching. - Dynamic: the shape of a Btree changes as logical records are inserted and deleted. Periodic reorganization is never necessary for a Btree. The next subsection describes node splitting and concatenation, the ways that a Btree changes as records are inserted and deleted. - Ubiquitous: the Btree is a widely implemented and used file structure.

Btree Node Containing Keys and Pointers
This figure depicts the contents of a node in the tree. Each node consists of pairs with a key value and a pointer, sorted by key value. The pointer identifies the physical record that contains the logical record with the key value. Other data in a logical record, besides the key, do not usually reside in the nodes. The other data may be stored in separate physical records or in the leaf nodes. An important property of a Btree is that each node, except the root, must be at least half full. The physical record size, the key size, and the pointer size determine node capacity.

Btree Insertion Examples
Insertions are handled by placing the new key in a nonfull node or by splitting nodes, as depicted in this figure. In the partial Btree in (a), each node contains a maximum of four keys. Inserting the key value 55 in (b) requires rearrangement in the right-most leaf node. Inserting the key value 58 in (c) requires more work because the right-most leaf node is full. To accommodate the new value, the node is split into two nodes and a key value is moved to the root node. When a split occurs at the root, the tree grows another level.

Btree Deletion Examples
Deletions are handled by removing the deleted key from a node and repairing the structure if needed as demonstrated in this figure. If the node is still at least half-full, no additional action is necessary (figure(b)). However, if the node is less than half-full, the structure must be changed. If a neighboring node contains more than half capacity, a key can be borrowed as shown in Figure (c). If a key cannot be borrowed, nodes must be concatenated.

Cost of Operations The height of Btree dominates the number of physical record accesses operation. Logarithmic search cost Upper bound of height: log function Log base: minimum number of keys in a node Insertion cost Cost to locate the nearest key Cost to change nodes The height of a Btree is small even for a large table when the branching factor is large. The cost in terms of physical record accesses to find a key is less than or equal to the height. The cost to insert a key includes the cost to locate the nearest key plus the cost to change nodes. In the best case (Btree insertion example (b)), the additional cost is one physical record access to change the index record and one physical record access to write the row data. The worst case occurs when a new level is added to the tree. Even in the worst case, the height of the tree still dominates.

B+Tree Provides improved performance on sequential and range searches.
In a B+tree, all keys are redundantly stored in the leaf nodes. To ensure that physical records are not replaced, the B+tree variation is usually implemented. Sequential searches can be a problem with Btrees. To perform a range search, the search procedure must travel up and down the tree. This procedure has problems with retention of physical records in memory. Operating systems may replace physical records if there have not been recent accesses. Because some time may elapse before a parent node is accessed again, the operating system may replace it with another physical record if main memory becomes full. Thus, another physical record access may be necessary when the parent node is accessed again. To ensure that physical records are not replaced, the B+tree variation is usually implemented.

B+tree Illustration B+tree has 2 part: index set and sequence set which contains the leaf node. All keys reside in the leaf nodes even if a key appears in the index set. The leaf nodes are connected together so that sequential searches do not need to move up the tree. Once the initial key is found, the search process accesses only nodes in the sequence set.

Index Matching Determining usage of an index for a query
Complexity of condition determines match. Single column indexes: =, <, >, <=, >=, IN <list of values>, BETWEEN, IS NULL, LIKE ‘Pattern’ (meta character not the first symbol) Composite indexes: more complex and restrictive rules Determining whether an index can be used in a query is known as index matching. When a condition in a WHERE clause references an indexed column, the DBMS must determine if the index can be used. The complexity of a condition determines whether an index can be used. For single column indexes, an index matches a condition if the column appears alone without functions or operators and the comparison operator matches one of the following items. For composite indexes involving more than one column, the matching rules are more complex and restrictive. Composite indexes are ordered by the most significant (first column in the index) to the least significant (last column in the index) column.

Index Matching Examples
C2 BETWEEN 10 and 20: match on C2 C3 IN (10,20): match on C3 C1 <> 10: no match C4 LIKE 'A%‘: match on C4 C4 LIKE '%A‘: no match C2 = 5 AND C3 = 20 AND C1 = 10: matches on index with C1, C2, and C3 For complete examples, see Chapter08FiguresTables document

Bitmap Index Can be useful for stable columns with few values Bitmap:
String of bits: 0 (no match) or 1 (match) One bit for each row Bitmap index record Column value Bitmap DBMS converts bit position into row identifier. Btree and hash files work best for columns with unique values. For non unique columns, Btrees index nodes can store a list of row identifiers instead of an individual row identifier for unique columns. However if a column has few values, the list of row identifiers can be very long. As an alternative structure for columns with few values, many DBMSs support bitmap indexes. A bitmap contains a string of bits (0 or 1 values) with one bit for each row of a table. In A record of a bitmap column index contains a column and a bitmap. A 0 value in a bitmap indicates that the associated row does not have the column value. A 1 value indicates that the associated row has the column value. The DBMS provides an efficient way to convert a position in a bitmap to a row identifier.

Bitmap Index Example Faculty Table Bitmap Index on FacRank
This slide depicts a bitmap column index for a sample Faculty table. A bitmap contains a string of bits (0 or 1 values) with one bit for each row of a table. In this slide, the length of the bitmap is 12 positions because there are 12 rows in the sample Faculty table. Bitmap Index on FacRank

Bitmap Join Index Bitmap identifies rows of a related table.
Represents a precomputed join Can define for a join column or a non-join column Typically used in query dominated environments such as data warehouses (Chapter 16) In a bitmap join index, the bitmap identifies the rows of a related table, not the table containing the indexed column. Thus, a bitmap join index represents a pre-computed join from a column in a parent table to the rows of a child table that join with rows of the parent table. A join bitmap index can be defined for a join column such as FacSSN or a non join column such as FacRank.

Summary of File Structures
In the first row, hash files can be used for sequential access but there may be extra physical records because keys are evenly spread among physical records. In the second row, unordered and ordered sequential files must examine on average half the physical records (linear). Hash files examine a constant number (usually close to 1) of physical records assuming that the file is not too full. Btrees have logarithmic search costs because of the relationship between the height, the log function, and search cost formulas. File structures can store all the data of a table (primary file structure) or store only key data along with pointers to the data records (secondary file structure). A secondary file structure or index provides an alternative path to the data. A bitmap index supports range searches by performing union operations on the bitmaps for each column value in the range.

Query Optimization Query optimizer determines implementation of queries. Major improvement in software productivity Improve performance with knowledge about the optimization process In most relational DBMSs, you do not have the choice of how queries are implemented on the physical database. The query optimization component assumes this responsibility. Your productivity increases because you do not need to make these tedious decisions. However, you can sometimes improve the optimization process if you understand it. To provide you with an understanding of the optimization process, this section describes the tasks performed and discusses tips to improve optimization results.

Translation Tasks When you submit an SQL statement for execution, the DBMS translates your query in four phases as shown in this figure. The first and fourth phases are common to any computer language translation process. The second phase has some unique aspects. The third phase is unique to translation of database languages. The first phase, Syntax and Semantic Analysis, analyzes a query for syntax and simple semantic errors. Syntax errors involve misuse of keywords such as keyword misspelling. Semantic errors involve misuse of columns and tables such as incompatible data types. The second phase, Query Transformation, transforms a query into a simplified and standardized format so that the query can be executed faster. While the simplification could be eliminating redundant parts of a logical expression, the standardized format is usually based on relational algebra. The third, Access Plan Evaluation, phase determines how to implement the rearranged relational algebra expression as an access plan. Access Plan is a tree that encodes decisions about file structures to access individual tables, the order of joining tables, and the algorithm to join tables. Typically, The query optimization component evaluates a large number of access plans. Evaluating access plans can involve a significant amount of time when the query contains more than four tables. Each operation in an access plan has a corresponding cost formula that estimates the physical record accesses and CPU operations. The cost formulas use table profiles to estimate the number of rows in a result. The query optimization component chooses the access plan with the lowest cost. The last phase, Access Plan Execution, executes the selected access plan. The query optimization component either generates machine code or interprets the access plan. Execution of machine code results in faster response than interpreting an access plan. However, most DBMSs interpret access plans because of the variety of hardware supported. The performance difference between interpretation and machine code execution is usually not significant for most users.

Access Plan Evaluation
Optimizer evaluates thousands of access plans Access plans vary by join order, file structures, and join algorithm. Some optimizers can use multiple indexes on the same table. Access plan evaluation can consume significant resources The query optimization component evaluates a large number of access plans. Access plans vary by join orders, file structures, and join algorithms. For file structures, some optimization components can consider set operations to combine the results of multiple indexes on the same table. The query optimization component can evaluate many more access plans than you can mentally comprehend. Typically, the query optimization component evaluates thousands of access plans. Evaluating access plans can involve a significant amount of time when the query contains more than four tables.

Access Plan Example 1 An access plan indicates how to implement a query as operations on files, as depicted in this slide. In an access plan, the leaf nodes are individual tables in the query, and the arrows point upwards to indicate the flow of data. The nodes above the leaf nodes indicate decisions about accessing individual tables. In this slide, Btree indexes are used to access individual tables. The first join combines the Enrollment and the Offering tables. The Btree file structures provide the sorting needed for the merge join algorithm. The second join combines the result of the first join with the Faculty table. The intermediate result must be sorted on FacSSN before the merge join algorithm can be used.

Access Plan Example 2 This plan shows a variation of the access plan from the previous slide in which the join order is changed.

Join Algorithms Nested loops: inner and outer loops; universal
Sort merge: join column sorting or indexes Hybrid join: combination of nested loops and sort merge Hash join: uses internal hash table Star join: uses bitmap join indexes For each join operation in a query, the optimization component considers each supported join algorithm. For the nested loops and the hybrid algorithms, the optimization component also must choose the outer table and the inner table. All algorithms except the star join involve two tables at a time. The star join can combine any number of tables matching the star pattern (a child table surrounded by parent tables in 1-M relationships). The nested loops algorithm can be used with any join operation, not just an equi-join operation.

Improving Optimization Results
Monitor poorly performing access plans Look for problems involving table profiles and query coding practices Use hints carefully to improve results Override optimizer judgment Cover file structures, join algorithms, and join orders Use as a last result To improve poor decisions in access plans, some enterprise DBMSs allow hints that influence the choice of access plans. For example, Oracle allows hints to choose the optimization goal, the file structures to access individual tables, the join algorithm, and the join order. Hints should be used with caution because they override the judgment of the optimizer. Hints with join algorithms and join orders are especially problematic because of the subtlety of these decisions. Overriding the judgment of the optimizer should only be done as a last resort after determining the cause of poor performance. In many cases, the database administrator can fix problems with table profile deficiencies and query coding style to improve performance rather than override the judgment of the optimizer.

Table Profile Deficiencies
Detailed and current statistics needed Beware of uniform value assumption and independence assumption Use hints to overcome optimization blind spots Estimation of result size for parameterized queries Correlated columns: multiple index access may be useful A hint can be useful for conditions involving parameter values. If the database administrator knows that the typical parameter values result in the selection of few rows, a hint can be used to force the optimization component to use an index. An optimization component sometimes needs detailed statistics on combinations of columns. If a combination of columns appears in the WHERE clause of a query, statistics on the column combination are important if the columns are not independent. Most optimization components assume that combinations of columns are statistically independent to simplify the estimation of the number of rows. Unfortunately, few DBMSs maintain statistics on combinations of columns. If a DBMS does not maintain statistics on column combinations, a database designer may want to use hints to override the judgment of the DBMS when a joint condition in a WHERE clause generates few rows. Using a hint could force the optimization component to combine indexes when accessing a table rather than using a sequential table scan.

Query Coding Practices
Avoid functions on indexable columns Eliminate unnecessary joins For conditions on join columns, test the condition on the parent table. Do not use the HAVING clause for row conditions. Avoid repetitive binding of complex queries Beware of queries that use complex views - You should not use functions on indexable columns as functions eliminate the opportunity to use an index. You should be especially aware of implicit type conversions even if a function is not used. An implicit type conversion occurs if the data type of a column and the associated constant value do not match. - For queries involving 1-M relationships in which there is a condition on the join column, make the condition on the parent table rather than the child table. - Conditions involving simple comparisons of columns in the GROUP BY clause belong in the WHERE clause, not the HAVING clause. Moving these conditions to the WHERE clause will eliminate rows sooner, thus providing faster execution.

Index Selection Most important decision Difficult decision
Choice of clustered and nonclustered indexes Index selection is the most important decision available to the physical database designer. However, it also can be one of the most difficult decisions. As a designer, you need to understand why index selection is difficult and the limitations of performing index selection without an automated tool

Clustering Index Example
In a clustering index, the order of the rows is close to the index order. Close means that physical records containing rows will not have to be accessed more than one time if the index is accessed sequentially. This figure shows the sequence set of a B+tree index pointing to associated rows inside physical records. Note that for a given node in the sequence set, most associated rows are clustered inside the same physical record. Ordering the row data by the index field is a simple way to make a clustered index.

Nonclustering Index Example
Contrast to clustering index, a nonclustering index does not have this closeness property. In a non-clustered index, the order of the rows is not related to the index order. This figure shows that the same physical record may be repeatedly accessed when using the sequence set. The pointers from the sequence set nodes to the rows cross many times, indicating that the index order is different than the row order.

Inputs and Outputs of Index Selection
Index selection involves choices about clustered and non-clustered indexes as shown in this figure. It is usually assumed that each table is stored in one file. The SQL statements indicate the database work to be performed by applications. The weights should combine the frequency of a statement with its importance. The table profiles must be specified in the same level of detail as required for query optimization. Usually, the index selection problem is restricted to Btree indexes and separate files for each table.

Trade-offs in Index Selection
Balance retrieval against update performance Nonclustering index usage: Few rows satisfy the condition in the query Join column usage if a small number of rows result in child table Clustering index usage: Larger number of rows satisfy a condition than for nonclustering index Use in sort merge join algorithm to avoid sorting More expensive to maintain A clustering index can improve retrievals under more situations than a non-clustering index. A clustering index is useful in the same situations as a nonclustering index except that the number of resulting rows can be larger. Merging rows is often a fast way to join tables if the tables do not need to be sorted (clustered indexes exist). Clustering index choices are more sensitive to maintenance than nonclustering index choices. Clustering indexes are more expensive to maintain than nonclustering indexes because the data file must be changed similar to an ordered sequential file.

Difficulties of Index Selection
Application weights are difficult to specify. Distribution of parameter values needed Behavior of the query optimization component must be known. The number of choices is large. Index choices can be interrelated. Index selection is difficult to perform well for a variety of reasons: - Application weights are difficult to specify. Judgments that combine frequency and importance can make the result subjective. - Distribution of parameter values is sometimes needed. Many SQL statements in reports and forms use parameter values. If parameter values vary from being highly selective to not very selective, selecting indexes is difficult. · The behavior of the query optimization component must be known. Even if an index appears useful for a query, the query optimization component must use it. There may be subtle reasons why the query optimization component does not use an index, especially a non-clustering index. · The number of choices is large. Even if indexes on combinations of columns are ignored, the theoretical number of choices is exponential in the number of columns (2NC where NC is the number of columns). Although many of these choices can be easily eliminated, the number of practical choices is still quite large. · Index choices can be interrelated. The interrelationships can be subtle especially when choosing indexes to improve join performance. An index selection tool can help with the last three problems. A good tool should use the query optimization component to derive cost estimates for each application under a given choice of indexes. However, a good tool cannot help alleviate the difficulty of specifying application weights and parameter value distributions.

Selection Rules I Rule 1: A primary key is a good candidate for a clustering index. Rule 2: To support joins, consider indexes on foreign keys. Rule 3: A column with many values may be a good choice for a non-clustering index if it is used in equality conditions. Rule 4: A column used in highly selective range conditions is a good candidate for a non-clustering index. Rule 5: A combination of columns used together in query conditions may be good candidates for nonclustering indexes if the joint conditions return few rows, the DBMS optimizer supports multiple index access, and the columns are stable. Despite the difficulties previously discussed, you usually can avoid poor index choices by following some simple rules.

Selection Rules II Rule 6: A frequently updated column is not a good index candidate. Rule 7: Volatile tables (lots of insertions and deletions) should not have many indexes. Rule 8: Stable columns with few values are good candidates for bitmap indexes if the columns appear in WHERE conditions. Rule 9: Avoid indexes on combinations of columns. Most optimization components can use multiple indexes on the same table.

Index Creation To create the indexes, the CREATE INDEX statement can be used. The word following the INDEX keyword is the name of the index. CREATE INDEX is not part of SQL:2003. Examples

Denormalization Additional choice in physical database design
Denormalization combines tables so that they are easier to query. Use carefully because normalized designs have important advantages. Although index selection is the most important decision of physical database design, there are other decisions that can significantly improve performance.

Normalized designs Better update performance
Require less coding to enforce integrity constraints Support more indexes to improve query performance Denormalization should always be done with extreme care because a normalized design has important advantages.

Repeating Groups Collection of associated values.
Normalization rules force repeating groups to be stored in an M table separate from an associated one table. If a repeating group is always accessed with its associated parent table, denormalization may be a reasonable alternative. Repeating Groups is an additional situation under which denormalization may be justified.

Denormalizing a Repeating Group
This figure shows a denormalization example of quarterly sales data. Although the denormalized design does not violate BCNF, it is less flexible for updating than the normalized design. The normalized design supports an unlimited number of quarterly sales as compared to only four quarters of sales results for the denormalized design. However, the denormalized design does not require a join to combine territory and sales data.

Denormalizing a Generalization Hierarchy
Generalization Hierarchies can result in many tables. If queries often need to combine these separate tables, it may be reasonable to store the separate tables as one table. This figure demonstrates denormalization of the Emp, HourlyEmp, and SalaryEmp tables. They have 1-1 relationships because they represent a generalization hierarchy. Although the denormalized design does not violate BCNF, the combined table may waste much space because of null values. However, the denormalized design avoids the outer join operator to combine the tables.

Codes and Meanings Normalization rules require that foreign keys be stored alone to represent 1-M relationships. If a foreign key represents a code, the user often requests an associated name or description in addition to the foreign key value. For example, the user may want to see the state name in addition to the state code. Storing the name or description column along with the code violates BCNF, but it eliminates some join operations. If the name or description column is not changed often, denormalization may be a reasonable choice. This figure demonstrates denormalization for the Dept and Emp tables. In the denormalized design, the DeptName column has been added to the Emp table.

Record Formatting Record formatting decisions involve compression and derived data. Compression is a trade-off between input-output and processing effort. Derived data is a trade-offs between query and update operations. Record Formatting is another choice to improve database performance. With an increasing emphasis on storing complex data types such as audio, video, and images, compression is becoming an important issue. Compression reduces the number of physical records transferred but may require considerable processing effort to compress and decompress the data. For query purposes, storing derived data reduces the need to retrieve data needed to calculate the derived data. However, updates to the underlying data require additional updates to the derived data. Storing derived data to reduce join operations may be reasonable.

Storing Derived Data to Improve Query Performance
This figure demonstrates derived data in the Order table. If the total amount of an order is frequently requested, storing the derived column OrdAmt may be reasonable. Calculating order amount requires a summary or aggregate calculation of related OrdLine and Product rows to obtain the Qty and ProdPrice columns. Storing the OrdAmt column avoids two join operations.

Parallel Processing Parallel processing can improve retrieval and modification performance. Retrieving many records can be improved by reading physical records in parallel. Many DBMSs provide parallel processing capabilities with RAID systems. RAID is a collection of disks (a disk array) that operates as a single disk. For example, a report to summarize daily sales activity may read thousands of records from several tables. Parallel reading of physical records can reduce significantly the execution time of the report. As a response to the potential performance improvements, many DBMSs provide parallel processing capabilities. These capabilities require hardware and software support for Redundant Arrays of Independent Disks (RAID). More on parallel database processing in Chapter 17.

Striping in RAID Storage Systems
Striping is an important concept for RAID storage. Striping involves the allocation of physical records to different disks. A stripe is the set of physical records that can be read or written in parallel. Normally, a stripe contains a set of adjacent physical records. This figure depicts an array of four disks that allows the reading or writing of four physical records in parallel. To utilize RAID storage, a number of architectures have emerged. The architectures, known as RAID-0 through RAID-6, support parallel processing with varying amounts of performance and reliability. Reliability is an important issue because the mean time between failures (a measure of disk drive reliability) decreases as the number of disk drives increases. To combat reliability concerns, RAID architectures incorporate redundancy and error-correcting codes.

Other Ways to Improve Performance
Transaction processing: add computing capacity and improve transaction design. Data warehouses: add computing capacity and store derived data. Distributed databases: allocate processing and data to various computing locations. There are a number of other ways to improve database performance that are related to a specific kind of processing. For transaction processing (Chapter 15), you can add computing capacity (faster and more processors, memory, and hard disk) and make trade-offs in transaction design. For data warehouses (Chapter 16), you can add computing capacity and design new tables with derived data. For distributed database processing (Chapter 17), you can allocate processing and data to various computing locations. Data can be allocated by partitioning a table vertically (column subset) and horizontally (row subset) to locate data close to its usage. These design choices are discussed in the respective chapters in Part 7. In addition to tuning performance for specific processing requirements, you also can improve performance by utilizing options specific to a DBMS. For example, most DBMSs have options for file structures that can improve performance. You must carefully study the specific DBMS to understand these options. It may take several years of experience and specialized education to understand options of a particular DBMS. However, the payoff of increased salary and demand for your knowledge can be worth the study.

Summary Goal: minimize response time
Constraints: disk space, memory, communication bandwidth Table profiles and application profiles must be specified in sufficient detail. Environment: file structures and query optimization Monitor and possibly improve query optimization results Index selection: most important decision Other techniques: denormalization, record formatting, and parallel processing Response time involves minimization of processing time and disk accesses (most important) in centralized computing environments. Constraints: disk space, memory, communication bandwidth

Advanced Query Formulation with SQL
Chapter 9 Advanced Query Formulation with SQL Welcome to Chapter 9 on advanced query formulation Chapter 9 covers advanced matching problems: - These problems are not common but important when they occur. - Solving these problems requires more specialized knowledge: higher class grade and job expertise Objectives: - Analyze problem statement - Identify the essential problem element - Apply template SQL solutions - SELECT statement extensions - Outer join operator - Nested queries - Recognize - Understand conceptual evaluation process - Interpret SQL statements that use the extensions - Use in advanced matching problems

Outline Outer join problems Type I nested queries
Type II nested queries and difference problems Nested queries in the FROM clause Division problems Null value effects Outer join problems: - One-sided and full outer join - Outer join animation supplements the lecture notes - SQL:2003 notation for outer joins: Access 2002 and Oracle 9i Type I nested queries: - Understand evaluation - Use in SELECT and DELETE statements Difference problems: - Understand evaluation of Type II nested queries - Solve difference problems using Type I and II nested queries Nested queries in the FROM clause: - Motivation: nested aggregates and multiple aggregate computations - Example problems Division problems - Division operator: chapter 3 material - Animation supplement to reinforce the meaning of division - Use the Count method to solve division problems Null value effects: - Simple conditions - Compound conditions - Grouping and aggregate functions

Outer Join Overview Join excludes non matching rows
Preserving non matching rows is important in some business situations Outer join variations Full outer join One-sided outer join Review of Chapter 3 material: cover for review if desired Can skip this material in initial coverage of chapter 2 and cover now Importance of preserving non matching rows: - Offerings without assigned faculty - Orders without sales associates Outer join variations: - Full: preserves non matching rows of both tables - One-sided: preserves non matching rows of the designated table - One-sided outer join is more common

Outer Join Operators Full outer join Left Outer Join Right Outer Join
Review of Chapter 3 material: cover for review if desired Can skip this material in initial coverage of chapter 2 and cover now Outer join matching: - join columns, not all columns as in traditional set operators - One-sided outer join: preserving non matching rows of a designated table (left or right) - Full outer join: preserving non matching rows of both tables Unmatched rows of the left table Matched rows using the join condition Unmatched rows of the right table

Full Outer Join Example
Outer join result: - Join part: rows 1 – 3 - Outer join part: non matching rows (rows 4 and 5) - Null values in the non matching rows: columns from the other table One-sided outer join: - Preserve non matching rows of the designated table - Preserve the Faculty table in the result: first four rows - Preserve the Offering table: first three rows and fifth row See outer join animation for interactive example

University Database Knowledge of optional relationships useful for one-sided outer joins - Optional relationship: FK accepts null values - Offering.FacSSN allows null values for Offerings without assigned faculty - OrderTbl.EmpNo (Order Entry database) allows null values for Internet orders (orders without a sales associate)

LEFT JOIN and RIGHT JOIN Keywords
Example 1 (Access) SELECT OfferNo, CourseNo, Offering.FacSSN, FacFirstName, FacLastName FROM Offering LEFT JOIN Faculty ON Offering.FacSSN = Faculty.FacSSN WHERE CourseNo LIKE 'IS*' Example 2 (Oracle) FROM Faculty RIGHT JOIN Offering WHERE CourseNo LIKE 'IS%' Example 1: - Preserve rows of the table on the left (Offering) - Result includes Offering rows that join with Faculty rows as well as non matching Offering rows Example 2: - Result is identical to Example 1 Outer join keywords: Access syntax as well as SQL:2003 syntax and Oracle 9i Oracle 9i/10g examples are identical except for % wild card character

Full Outer Join Example I
Example 3 (SQL:2003 and Oracle 9i/10g) SELECT FacSSN, FacFirstName, FacLastName, FacSalary, StdSSN, StdFirstName, StdLastName, StdGPA FROM Faculty FULL JOIN Student ON Student.StdSSN = Faculty.FacSSN Example 3: - Full outer join: retrieve all rows of two similar but not union compatible tables - Student and Faculty both are kinds of university people - Retrieves selected columns of student and faculty tables - Result contains every university person (student only, faculty only, TA) - Can use full outer join on tables that are not union compatible - Full outer join makes comparison on join column(s) not all columns - SQL:1999/2003: FULL JOIN keyword

Full Outer Join Example II
Example 4 (Access) SELECT FacSSN, FacFirstName, FacLastName, FacSalary, StdSSN, StdFirstName, StdLastName, StdGPA FROM Faculty RIGHT JOIN Student ON Student.StdSSN = Faculty.FacSSN UNION FROM Faculty LEFT JOIN Student Notation: - Access: UNION of two one-sided outer joins; LEFT JOIN, RIGHT JOIN notation - Result is identical to Example 3 (FULL JOIN)

Mixing Inner and Outer Joins I
Example 5 (Access) SELECT OfferNo, Offering.CourseNo, OffTerm, CrsDesc, Faculty.FacSSN, FacLastName FROM ( Faculty RIGHT JOIN Offering ON Offering.FacSSN = Faculty.FacSSN ) INNER JOIN Course ON Course.CourseNo = Offering.CourseNo WHERE Course.CourseNo LIKE 'IS*' Mixing inner and outer joins: - Very common when using one-sided outer join - Access requires parentheses when using more than one join operation in the FROM clause (should not be required for INNER JOINs) - Nested parentheses make the statement somewhat difficult to read - Access rule on combining inner and outer joins: - Cannot nest inner join inside outer joins - Perform outer joins first: make outer join most deeply nested - Oracle 9i/10g solution: use % instead of *

Type I Nested Queries Query inside a query
Use in WHERE and HAVING conditions Similar to a nested procedure Executes one time No reference to outer query Also known as non-correlated or independent nested query Nested query: - query inside a query (SELECT statement inside a SELECT statement) - Use in conditions in the WHERE and HAVING clauses - Also use nested queries in the FROM clause Type I: - Primarily an alternative join style Nested procedure execution model: procedure executes and is replaced by a value - Executes one time and is replaced by a list of values Distinguishing feature: - No reference to outer query - Type I nested query is independent of outer query - Non-correlated or independent nested query

Type I Nested Query Examples I
Example 6 (Access): List finance faculty who teach IS courses. SELECT FacSSN, FacLastName, FacDept FROM Faculty WHERE FacDept = 'FIN' AND FacSSN IN ( SELECT FacSSN FROM Offering WHERE CourseNo LIKE 'IS*' ) Example 6: - Type I nested query replaces a join to Offering table - IN keyword: "set element of" operator - Nested query executes one time and produces a list of FacSSN values - Oracle: use % instead of * Applicability of Type I nested queries: - Do not need columns from nested query in the final result - Cannot use Type I nested query if Offering columns needed in the result - Use Type I nested queries to test a condition from another table

Type I Nested Query Examples II
Example 7 (Oracle): List finance faculty who teach 4 unit IS courses. SELECT FacSSN, FacLastName, FacDept FROM Faculty WHERE FacDept = 'FIN' AND FacSSN IN ( SELECT FacSSN FROM Offering WHERE CourseNo LIKE 'IS%' AND CourseNo IN ( SELECT CourseNo FROM Course WHERE CrsUnits = 4 ) ) Applicability of Type I nested queries: - Do not need columns from nested query in the final result - Cannot use Type I nested query if Offering columns needed in the result - Use Type I nested queries to test a condition from another table Example 7: - Nested query inside a nested query - Second nested query tests a condition from the Course table - Access: use * instead of % - Cannot use type I nested queries if a column from the offering or course table is needed

DELETE Example Use Type I nested queries to test conditions on other tables Use for UPDATE statements also Example 8: Delete offerings taught by Leonard Vince. DELETE FROM Offering WHERE Offering.FacSSN IN ( SELECT FacSSN FROM Faculty WHERE FacFirstName = 'Leonard' AND FacLastName = 'Vince' ) Type I nested query good for complex deletions: - Test conditions on other tables - Portable across most DBMSs

Type II Nested Queries Similar to nested loops
Executes one time for each row of outer query Reference to outer query Also known as correlated or variably nested query Use for difference problems not joins Type II: - More complex execution model - Nested loop execution model: inner loop executes one time for each outer loop iteration - Look for a column in nested query that refers to a table used in the outer query - Use for difference problems - Do not use for joins: performance penalty likely - See animation slides on Type II nested queries

Type II Nested Query Example for a Difference Problem
Example 9: Retrieve MS faculty who are not teaching in winter 2006. SELECT FacSSN, FacLastName, FacDept FROM Faculty WHERE FacDept = 'MS' AND NOT EXISTS ( SELECT * FROM Offering WHERE OffTerm = 'WINTER' AND OffYear = 2006 AND Faculty.FacSSN = Offering.FacSSN ) EXISTS: - Table comparison operator - True if nested query produces 1 or more rows; false otherwise - NOT EXISTS: true if nested query produces 0 rows; false otherwise Difference problem: - Elements in one set but not another set - All MS faculty MINUS MS faculty teaching in winter 2003 - Remaining set contains MS faculty not teaching in winter 2003 - Look for "not" connecting different parts of a sentence (faculty not teaching) Example 9: - Type II nested query: Faculty.FacSSN references Faculty table in the outer query - See Nested Query animation for derivation of the result

Limited Formulations for Difference Problems
Type I nested query with NOT IN condition One-sided outer join with IS NULL condition Difference operation using MINUS (EXCEPT) operator Type I formulation: only one column for comparing rows of the two tables One-sided outer join formulation: no conditions (except the IS NULL condition) on the excluded table Difference operation using MINUS: result must contain only union-compatible columns

Type I Difference Formulation
Example 10: Retrieve MS faculty who are not teaching in winter 2006. SELECT FacSSN, FacLastName, FacDept FROM Faculty WHERE FacDept = 'MS' AND FacSSN NOT IN ( SELECT FacSSN FROM Offering WHERE OffTerm = 'WINTER' AND OffYear = 2006 ) Example 10: - Nested query is Type I (no reference to outer query) - Type I formulation does not work for complex difference problems such as Example 9.20 which involves a comparison of multiple correlated columns in the Type II nested query - One-sided outer join formulation does not work because of conditions on the excluded table (Offering) - Cannot use MINUS formulation because tables are not union compatible even on a subset of columns

One-Sided Outer Join Difference Formulation
Example 11: Retrieve MS faculty who have never taught a course (research faculty). SELECT FacSSN, FacLastName, FacDept FROM Faculty LEFT JOIN Offering ON Faculty.FacSSN = Offering.FacSSN WHERE FacDept = 'MS' AND Offering.FacSSN IS NULL Example 11: - Example 11 is not as useful as Example 10: more interested in faculty not teaching in a particular academic term - One-sided difference formulation cannot test conditions on the excluded table (Offering) - Cannot use MINUS formulation because tables are not union compatible even on a subset of columns

MINUS Operator Difference Formulation
Example 12 (Oracle): Retrieve faculty who are not students SELECT FacSSN AS SSN, FacFirstName AS FirstName, FacLastName AS LastName, FacCity AS City, FacState AS State FROM Faculty MINUS SELECT StdSSN AS SSN, StdFirstName AS FirstName, StdLastName AS LastName, StdCity AS City, StdState AS State FROM Student Example 12: - Faculty and student tables are partially union compatible - Cannot use MINUS operator on Examples 10 and 11 because Faculty and Offering tables are not even partially union compatible

Nested Queries in the FROM Clause
More recent introduction than nested queries in the WHERE and HAVING clauses Consistency in language design Wherever table appears, table expression can appear Specialized uses Nested aggregates Multiple independent aggregate calculations Nested queries in the WHERE and HAVING clauses were part of original SELECT statement design Nested queries in the FROM clause were introduced in SQL-92 Consistency in language design: consistency Wherever X constants are permitted, X expressions should be permitted FROM clause supports table constants and table expressions (SELECT statements) Specialized uses: not as important as nested queries in the FROM and HAVING clauses Nested aggregates: average of the sum of credit hours or average of the number of students Multiple aggregates: number of students enrolled and the average resources consumed in the course (grouping of different tables)

Nested FROM Query Example
Example 13: Retrieve the course number, course description, the number of offerings, and the average enrollment across offering. SELECT T.CourseNo, T.CrsDesc, COUNT(*) AS NumOfferings, Avg(T.EnrollCount) AS AvgEnroll FROM (SELECT Course.CourseNo, CrsDesc, Offering.OfferNo, COUNT(*) AS EnrollCount FROM Offering, Enrollment, Course WHERE Offering.OfferNo = Enrollment.OfferNo AND Course.CourseNo = Offering.CourseNo GROUP BY Course.CourseNo, CrsDesc, Offering.OfferNo) T GROUP BY T.CourseNo, T.CrsDesc Example 13: Needs a nested query in the FROM clause because nested aggregate is computed Nested aggregate: average number of students enrolled per offering Nested query retrieves one row per offering: course number, description, and number of students enrolled The outer query groups the nested query by course number and description so that number of offerings and the average enrollment per offering can be computed A useful query to compare introductory courses containing many sections Must use two queries to formulate if not using a nested query in the FROM clause

Divide Operator Match on a subset of values Specialized operator
Suppliers who supply all parts Faculty who teach every IS course Specialized operator Typically applied to associative tables representing M-N relationships Chapter 3 material: - Present as review if desired - Can skip material when covering chapter 3 and instead cover now Subset matching: - Use of every or all connecting different parts of a sentence - Use any or some: join problem - Specialized matching but important when necessary - Conceptually difficult Table structures: - Typically applied to associative tables such as Enrollment, Supp-Part, StdClub - Can also be applied to child (M) tables in a 1-M relationship (Offering table)

Division Example Chapter 3 material: - Present as review if desired
- Can skip material when covering chapter 2 and instead cover now - See also animation slides for the division operator Table structure: - SuppPart: associative table between Part and Supp tables - List suppliers who supply every part Formulation: - See Division animation for interactive presentation - Sort SuppPart table by SuppNo - Choose Suppliers that are associated with every part - Set of parts for a supplier contains the set of all parts - S3 associated with P1, P2, and P3 - Must look at all rows with S3 to decide whether S3 is in the result

COUNT Method for Division Problems
Compare the number of rows associated with a group to the total number in the subset of interest Type I nested query in the HAVING clause Example 14: List the students who belong to all clubs. SELECT StdNo FROM StdClub GROUP BY StdNo HAVING COUNT(*) = ( SELECT COUNT(*) FROM Club ) Count method: - Compare the number of rows in two sets rather than comparing the sets directly - Make sure that the sets are comparable - Use Type I nested query in the HAVING clause: - COUNT(*) to table with a single row and column (COUNT(*)) - Much simpler than alternative formulations: doubly nested Type II subqueries Example 14: - Left-hand COUNT(*): number of rows in a StdNo group - Nested query: total number of clubs - Nested query executes one time: no reference to the outer query

Typical Division Problems
Compare to an interesting subset rather than entire table Use similar conditions in outer and nested query Example 15: List the students who belong to all social clubs. SELECT Student1.StdNo, SName FROM StdClub, Club, Student1 WHERE StdClub.ClubNo = Club.ClubNo AND Student1.StdNo = StdClub.StdNo AND CPurpose = 'SOCIAL' GROUP BY Student1.StdNo, SName HAVING COUNT(*) = ( SELECT COUNT(*) FROM Club WHERE CPurpose = 'SOCIAL' ) Example 15: - Interesting subset: social clubs - Same condition in WHERE clause of both queries (outer and nested) - Ensures that sets are comparable

Advanced Division Problems
Count distinct values rather than rows Faculty who teach at least one section of selected course offerings Offering table has duplicate CourseNo values Use COUNT(DISTINCT column) Use stored query or nested FROM query in Access Use COUNT(DISTINCT …) to count unique column values See Section for detailed explanation Access: - Does not support COUNT(DISTINCT …) - Use SELECT DISTINCT in a stored query for the same effect - Use a nested query in the FROM clause for the same effect - See Appendix 9.A for Access details about stored queries

Advanced Division Problem Example
Example 16: List the SSN and the name of faculty who teach at least one section of all of the fall 2005, IS courses. SELECT Faculty.FacSSN, FacFirstName, FacLastName FROM Faculty, Offering WHERE Faculty.FacSSN = Offering.FacSSN AND OffTerm = 'FALL' AND CourseNo LIKE 'IS%' AND OffYear = 2005 GROUP BY Faculty.FacSSN, FacFirstName, HAVING COUNT(DISTINCT CourseNo) = ( SELECT COUNT(DISTINCT CourseNo) FROM Offering WHERE OffTerm = 'FALL' AND OffYear = 2005 AND CourseNo LIKE 'IS%' ) Example 9.29 is not particularly useful because it is unlikely that an instructor has taught every offering. Rather, it is more useful that an instructor has taught one offering of every course as demonstrated in Example 14 (9.30). Rather than counting the rows in each group, count the unique CourseNo values. This change is necessary because CourseNo is not unique in the Offering table. There can be multiple rows with the same CourseNo, corresponding to a situation where there are multiple offerings for the same course. The solution only executes in Oracle because Access does not support the DISTINCT keyword in aggregate functions.

Null Value Effects Simple conditions Compound conditions
Grouping and aggregate functions SQL:2003 standard but implementation may vary The last section of this chapter does not involve difficult matching problems or new parts of SQL. Rather, this section presents interpretation of query results when tables contain null values. These effects have largely been ignored until this section to simplify the presentation. Because most databases use null values, you need to understand the effects to attain a deeper understanding of query formulation. Null values affect simple conditions involving comparison operators, compound conditions involving Boolean operators, aggregate calculations, and grouping. As you will see, some of the null value effects are rather subtle. Because of these subtle effects, a good table design minimizes, although it usually does not eliminate, the use of null values. The null effects described in this section are specified in the SQL2 standard. Because specific DBMSs may provide different results, you may need to experiment with your DBMS.

Simple Conditions Simple condition is null if either left-hand or right-hand side is null. Discard rows evaluating to false or null Retain rows evaluating to true Rows evaluating to null will not appear in the result of the simple condition or its negation Simple conditions involve a comparison operator, a column or column expression and a constant, column, or column expression. A simple condition results in a null value if either column (or column expression) in a comparison is null. A row qualifies in the result if the simple condition evaluates to true for the row. Rows evaluating to false and null are discarded. A more subtle result can occur when a simple condition involves two columns and at least one column contains null values. If neither column contains null values, every row will be in the result of either the simple condition or the opposite (negation) of the simple condition. For example, if < is the operator of a simple condition, the opposite condition contains  as its operator assuming the columns remain in the same positions. If at least one column contains null values, some rows will not appear in the result of either the simple condition or its negation.

Compound Conditions Compound conditions involve one or more simple conditions connected by the Boolean operators AND, OR, and NOT. Like simple conditions, compound conditions evaluate to true, false, or null. A row is selected if the entire compound condition in the WHERE clause evaluates to true. To evaluate the result of a compound condition, the SQL2 standard uses truth tables with three values. A truth table shows how combinations of values (true, false, and null) combine with the Boolean operators. Truth tables with three values define a three-valued logic. The above tables depict truth tables for the AND, OR, and NOT operators. The internal cells in these tables are the result values. For example, the first internal cell (True) in Table 13 results from the AND operator applied to two conditions with true values.

Aggregate Functions Null values ignored Effects can be subtle
COUNT(*) may differ from Count(Column) SUM(Column1) + SUM(Column2) may differ from SUM(Column1 + Column2) Null values are ignored in aggregate calculations. Although this statement sounds simple, the results can be subtle. For the COUNT function, COUNT(*) returns a different value than COUNT(column) if the column contains null values. COUNT(*) always returns the number of rows. COUNT(column) returns the number of non-null values in the column. An even more subtle effect can occur if the SUM or AVG functions are applied to a column with null values. Without regard to null values, the following equation is true: SUM(Column1) + SUM(Column2) = SUM(Column1 + Column2). With null values in at least one of the columns, the equation may not be true because a calculation involving a null value yields a null value. If Column1 has a null value in one row, the plus operation in SUM(Column1 + Column2) produces a null value for that row. However, the value of Column2 in the same row is counted in SUM(Column2).

Grouping Effects Rows with null values are grouped together
Grouping column contains null values Null group can be placed at beginning or end of the non-null groups Null values also can affect grouping operations performed in the GROUP BY clause. The SQL:2003 standard stipulates that all rows with null values are grouped together. The grouping column shows null values in the result. In the university database, this kind of grouping operation is useful to find course offerings without assigned professors.

Summary Advanced matching problems not common but important when necessary Understand outer join, difference, and division operators Nested queries important for advanced matching problems Lots of practice to master query formulation and SQL Gain competitive advantage by understanding advanced matching problems: - Class: difference between A and B (this may not be true in your class) - Job: solving difficult but important problems Skills: - Understand outer join, difference, and division operators - Recognize problem statements that involve these operators - Modify a template SQL solution to formulate the SELECT statement - Nested queries used in difference and division problems Lots of practice - Work many problems without seeing the solutions - 50 problems to develop understanding of query formulation and SQL - Do not rely on visual tools such as Query Design in Access; use SQL directly

Oracle 8i Notation for One-Sided Outer Joins
Example A1 (Oracle 8i) SELECT OfferNo, CourseNo, Offering.FacSSN, FacFirstName, FacLastName FROM Faculty, Offering WHERE Offering.FacSSN = Faculty.FacSSN (+) AND CourseNo LIKE 'IS%' Example A2 (Oracle 8i) WHERE Faculty.FacSSN (+) = Offering.FacSSN Oracle 8i does not support SQL:2003 syntax for outer joins - Oracle developed its own proprietary syntax that predates SQL-92 and SQL:1999 - Oracle 9i supports the standard SQL notation for outer joins - Oracle notation inferior to SQL:2003 notation when multiple outer joins - For multiple outer join (not common), ordering must be specified for multiple outer joins) - Oracle 8i is still used so older notation is presented (+) notation: - Place by the null table (table with null values in the result) - Opposite of the SQL:1999 notation Example 3 and 4 produce the same results as examples 1 and 2

Full Outer Join Example III
Example A3 (Oracle 8i) SELECT FacSSN, FacFirstName, FacLastName, FacSalary, StdSSN, StdFirstName, StdLastName, StdGPA FROM Faculty, Student WHERE Student.StdSSN = Faculty.FacSSN (+) UNION WHERE Student.StdSSN (+) = Faculty.FacSSN Notation: - SQL:1999: FULL JOIN keyword - Access: UNION of two one-sided outer joins; LEFT JOIN, RIGHT JOIN notation - Oracle: UNION of two one-sided outer joins; (+) notation - Result is identical to Example 5 for Oracle 9i and Access

Mixing Inner and Outer Joins II
Example A4 (Oracle 8i) SELECT OfferNo, Offering.CourseNo, OffTerm, CrsDesc, Faculty.FacSSN, FacLastName FROM Faculty, Course, Offering WHERE Offering.FacSSN = Faculty.FacSSN (+) AND Course.CourseNo = Offering.CourseNo AND Course.CourseNo LIKE 'IS%' Mixing inner and outer joins: - Very common when using one-sided outer join - Oracle 8i solution is easier to read: no parentheses - Oracle 8i notation does not generalize to difficult problems involving multiple outer joins

Introduction to Database Management

Similar presentations

Presentation on theme: "Introduction to Database Management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Database Management

Similar presentations

Presentation on theme: "Introduction to Database Management"— Presentation transcript:

Similar presentations

About project

Feedback