Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh 10-11
Fall Administrative Stuff What you should know to take this class. Handouts: Syllabus and Homework 1. Resources: Text, TAs, Web site, bulletin board and office hours. Coursework: homeworks, exams, project. Computer accounts.
Fall What the subject is about Modeling and organization of data Efficient (expressive?) retrieval of data Reliable and consistent storage of data Not surprisingly, all these topics are interrelated.
Fall What is a DBMS? A database (DB) is a large, integrated collection of data. A DB models a real-world enterprise. A database management system (DBMS) is a software package designed to store and manage databases.
Fall Why study databases? Everybody needs them, i.e. $$$. There are lots of interesting problems, both in database research and in implementation. Good design is always a challenge.
Fall Connection to other areas of CS… Programming languages and software engineering (obviously) Algorithms (obviously) Logic, discrete math, and theory of computation “Systems” issues: concurrency, operating systems, file organization and networks.
Fall But 80% of the world’s data is not in a DB! Examples: -scientific data (large images, complex programs that analyze the data) -personal data -WWW
Fall Why don't we “program up” databases when we need them? For simple and small databases this is often the best solution. Flat files and grep get us a long way. We run into problems when –The structure is complicated (more than a simple table) –The database gets large –Many people want to use it simultaneously
Fall We might start by building a file with the following structure: This text file is easy to deal with. So there's no need for a DBMS! Example: Personal Calendar WhatDayWhenWhoWhere Lunch10/241pmRickJoe’s Diner CS12310/259amDr. EggheadMorris234 Biking10/269amJaneJane’s house Dinner10/266PMJaneCafé Le Boeuf
Fall Problem 1: Data Organization Consider the all-important “who” field. Do we also want to keep addresses, telephone numbers etc? Expand our file to look like: Now we are keeping our address book in our calendar and doing so redundantly. WhatWhenWho-nameWho- Who-tel …. Where …
Fall “Link” Calendar with Address Book? Two conceptual “entities” -- contact information and calendar -- with a relationship between them, linking people in the calendar to their contact information. This link could be based on something as simple as the person's name.
Fall Problem 2: Efficiency Size of personal address book is probably less than one hundred entries, but there are things we'd like to do quickly and efficiently. –“Give me all appointments on 10/28” –“When am I next meeting Jim?” “Program” these as quickly as possible. Have these programs executed efficiently. What would happen if you were using a corporate calendar with hundreds of thousands of entries?
Fall Problem 3. Concurrency and Reliability Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess? Suppose the system crashes while we are changing the calendar. How do we recover our work?
Fall Transactions Key concept for concurrency is that of a transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry). Key concept for recoverability is that of a log : keeping track of all actions carried out by the db. Sounds like operating systems all over again!
Fall Database architecture -- the traditional view It is common to describe databases in two ways: –The logical structure. What users see. The program or query language interface. –The physical structure. How files are organized. What indexing mechanisms are used. Further it is traditional to split the logical level into two components: overall database design (conceptual) and the views that various users get to see.
Fall Three-level architecture View 1View 2…View N Physical Level (file organization, indexing) Schema Conceptual Level
Fall Data independence A user of a relational database system should be able to use SQL to query the database without knowing about how the precisely how data is stored, e.g. After all, you don't worry much how numbers are stored when you program some arithmetic or use a computer-based calculator. SELECT When, Where FROM Calendar WHERE Who = "Bill"
Fall More on data independence Logical data independence protects the user from changes in the logical structure of the data -- could completely reorganize the calendar “schema” without changing how I query it. Physical data independence protects the user from changes in the physical structure of data: could add an index on Who without changing how the user would write the query, but the query would execute faster (query optimization).
Fall That's the traditional view, but... Three-level architecture is not always achievable for database programmers. When databases get big, queries must be carefully written to achieve efficiency. There are databases over which we have no control. The Web is a giant, disorganized, database. There are also well-organized database on the web ( e.g., the Movie database) for which the terminology does not quite apply.
Fall In this course... Study relational databases, their design, how to query, what forms of indices to use. Beyond relational algebra: a logical model of data (Datalog), recursion. Beyond “first-normal form”: object-oriented databases, how to query, using OO design techniques. XML and semi-structured data models
Fall What we won’t cover in any depth... The “technology” of databases: –details of physical design –concurrency control –transaction management –query optimization (although a few of these issues will be briefly discussed)