Why use a DBMS in your website? Suppose we are building web-based music distribution site. Several questions arise: How do we store the data? (file organization, etc.) How do we query the data? (write programs…) Make sure that updates don’t mess things up? Provide different views on the data? (registrar versus students) How do we deal with crashes? Way too complicated! Buy a database system!
Functionality of a DBMS Storage management Abstract data model High level query and data manipulation language May tell us what we are missing in text-based search Efficient query processing May change in the internet scenario Transaction processing Resiliency: recovery from crashes, Different views of the data, security May be useful to model a collection of databases together Interface with programming languages
Database Outline What we care about Structured data representations Relational databases Deductive databases Structured query languages SQL Views (& materialized views) Query optimization overview
Building an Application with a Database System Requirements modeling (conceptual, pictures) Decide what entities should be part of the application and how they should be linked. Schema design and implementation Decide on a set of tables, attributes. Define the tables in the database system. Populate database (insert tuples). Write application programs using the DBMS Now much easier, with data management API
Conceptual Modeling name category name ssn Takes Course Student quarter Advises Teaches Professor name field address
Schema Design & Implementation Table Students Separates the logical view from the physical view of the data.
Terminology Product Attribute names Name Price Category Manufacturer gizmo $19.99 gadgets GizmoWorks Power gizmo $29.99 gadgets GizmoWorks SingleTouch $149.99 photography Canon MultiTouch $203.99 household Hitachi tuples (Arity=4) Product(name: string, Price: real, category: enum, Manufacturer: string)
Querying a Database Find all the students taking CSE490i in Q1, 2000 S(tructured) Q(uery) L(anguage) select E.name from Enroll E where E.course=CS490i and E.quarter=“Winter, 2000” Query processor figures out how to answer the query efficiently.
Relational Algebra Operators Basic Binary Set Operators tuple sets as input, new set as output Basic Binary Set Operators Result is table (set) with same attributes Sets must be compatible! R1(A1,A2,A3) R2(B1,B2,B3) Domain(Ai) = Domain(Bi) Union All tuples in either R1 or in R2 Intersection All tuples in both R1 and R2 Difference All tuples in R1 but not in R2 Complement - what’s the universe? Selection, Projection, Cartesian Product, Join
Selection s Grab a subset of the tuples in a relation that satisfy a given condition Use and, or, not, >, <… to build condition Unary operation… returns set with same attributes, but ‘selects’ rows
Projection p Unary operation, selects columns Returned schema is different, So returned tuples are not subset of original set Contrast with selection Eliminates duplicate tuples
Cartesian Product X Binary Operation Result is set of tuples combining all elements of R1 with all elements of R2, for R1 R2 Schema is union of Schema(R1) & Schema(R2) Notice we could do selection on result to get meaningful info!
Join Most common (and exciting!) operator… Combines 2 relations Selecting only related tuples Equivalent to Cross product followed by selection followed by Projection Result has all attributes of the two relations Equijoin Join condition is equality between two attributes Natural join Equijoin on attributes of same name result has only one copy of join condition attribute
Example: Natural Join Employee Dependents
Complex Queries Product ( pname, price, category, maker) Purchase (buyer, seller, store, prodname) Company (cname, stock price, country) Person( per-name, phone number, city) Find phone numbers of people who bought gizmos from Fred. Find telephony products that somebody bought
Exercises Product ( pname, price, category, maker) Purchase (buyer, seller, store, prodname) Company (cname, stock price, country) Person( per-name, phone number, city) Ex #1: Find people who bought telephony products. Ex #2: Find names of people who bought American products Ex #3: Find names of people who bought American products and did not buy French products Ex #4: Find names of people who bought American products and they live in Seattle. Ex #5: Find people who bought stuff from Joe or bought products from a company whose stock prices is more than $50.
Deductive Databases Relations viewed as predicates. Interrelations between relations expressed as “datalog” rules (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection] Wealthy-Employee(Name) :- Employee(Name,SSN), Salary(SSN,Money),Money> 100000 [Selection] Ed(Name, Dname) :- Employee(Name, SSN), Employee_Dependents(SSN, Dname) [Join] Emprelated(Name,Dname) :- Ed(Name,Dname) Emprelated(Name,Dname) :- Ed(Name,D1), Emprelated(D1,D2) [Recursion]
More datalog terminology A datalog program is a set of datalog rules. A program with a single rule is a conjunctive query. We distinguish EDB predicates and IDB predicates EDB’s are stored in the database, appear only in the bodies IDB’s are intensionally defined, appear in both bodies and heads.
Structured Query Language 2/23/2019 11:46 AM 22
SQL Introduction Standard language for querying and manipulating data Structured Query Language Many standards out there: SQL92, SQL2, SQL3, SQL99 Vendors support various subsets of these (but we’ll only discuss a subset of what they support) Basic form = syntax on relational algebra (but many other features too) Select attributes From relations (possibly multiple, joined) Where conditions (selections)
11/13 2/23/2019 11:46 AM 24
Selections s SELECT * FROM Company WHERE country=“USA” AND stockPrice > 50 You can use: Attribute names of the relation(s) used in the FROM. Comparison operators: =, <>, <, >, <=, >= Apply arithmetic operations: stockprice*2 Operations on strings (e.g., “||” for concatenation). Lexicographic order on strings. Pattern matching: s LIKE p Special stuff for comparing dates and times.
Projection p Select only a subset of the attributes SELECT name, stock price FROM Company WHERE country=“USA” AND stockPrice > 50 Rename the attributes in the resulting table SELECT name AS company, stockprice AS price FROM Company WHERE country=“USA” AND stockPrice > 50
Ordering the Results SELECT name, stock price FROM Company WHERE country=“USA” AND stockPrice > 50 ORDERBY country, name Ordering is ascending, unless you specify the DESC keyword. Ties are broken by the second attribute on the ORDERBY list, etc.
Join SELECT name, store FROM Person, Purchase WHERE per-name=buyer AND city=“Seattle” AND product=“gizmo” Product ( pname, price, category, maker) Purchase (buyer, seller, store, product) Company (cname, stock price, country) Person( per-name, phone number, city)
Disambiguating Attributes Find names of people buying telephony products: SELECT Person.name FROM Person, Purchase, Product WHERE Person.name=buyer AND product=Product.name AND Product.category=“telephony” Product ( name, price, category, maker) Purchase (buyer, seller, store, product) Person( name, phone number, city)
Tuple Variables Find pairs of companies making products in the same category SELECT product1.maker, product2.maker FROM Product AS product1, Product AS product2 WHERE product1.category = product2.category AND product1.maker <> product2.maker Product ( name, price, category, maker)
Union, Intersection, Difference (SELECT name FROM Person WHERE City=“Seattle”) UNION FROM Person, Purchase WHERE buyer=name AND store=“The Bon”) Similarly, you can use INTERSECT and EXCEPT. Inputs must have the same attribute names (otherwise: rename).
Subqueries SELECT Purchase.product FROM Purchase WHERE buyer = (SELECT name FROM Person WHERE social-security-number = “123 - 45 - 6789”); In this case, the subquery returns one value. If it returns more, it’s a run-time error.
Subqueries Returning Relations Find companies who manufacture products bought by Joe Blow. SELECT Company.name FROM Company, Product WHERE Company.name=maker AND Product.name IN (SELECT product FROM Purchase WHERE buyer = “Joe Blow”); You can also use: s > ALL R s > ANY R EXISTS R
Views 2/23/2019 11:46 AM 36
Defining Views Views are relations, except that they are not physically stored. They are used mostly in order to simplify complex queries and to define conceptually different views of the database to different classes of users. View: purchases of telephony products: CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase.product = Product.name AND Product.category = “telephony”
A Different View CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes” What’s really happening when we query a view??
Updating Views How can I insert a tuple into a table that doesn’t exist? CREATE VIEW bon-purchase AS SELECT store, seller, product FROM Purchase WHERE store = “The Bon Marche” If we make the following insertion: INSERT INTO bon-purchase VALUES (“the Bon Marche”, Joe, “Denby Mug”) We can simply add a tuple (“the Bon Marche”, Joe, NULL, “Denby Mug”) to relation Purchase.
Materialized Views Views whose corresponding queries have been executed and the data is stored in a separate database Uses: Caching Issues Using views in answering queries Normally, the views are available in addition to database (so, views are local caches) In information integration, views may be the only things we have access to. An internet source that specializes in woody allen movies can be seen as a view on a database of all movies. Except, there is no database out there which contains all movies.. Maintaining consistency of materialized views 2/23/2019 11:46 AM 40
Query Optimization 2/23/2019 11:46 AM 41
Query Optimization Goal: Declarative SQL query Imperative query execution plan: buyer SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’ City=‘seattle’ phone>’5430000’ Inputs: the query statistics about the data (indexes, cardinalities, selectivity factors) available memory Buyer=name (Simple Nested Loops) Purchase Person (Table scan) (Index scan) Ideally: Want to find best plan. Practically: Avoid worst plans! 2/23/2019 11:46 AM 42
R.bid=100 AND S.rating>5 SELECT S.sname FROM Reserves R, Sailors S sid=sid bid=100 rating > 5 sname (Simple Nested Loops) (On-the-fly) Reserves Sailors sid=sid bid=100 sname (On-the-fly) rating > 5 (Scan; write to temp T1) temp T2) (Sort-Merge Join) (On-the-fly) sname SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Goal of optimization: To find more efficient plans that compute the same answer. (On-the-fly) rating > 5 sid=sid with pipelining ) (Use hash index; do bid=100 Sailors not write result to temp) Reserves 2/23/2019 11:46 AM 43
Relational Algebra Equivalences Allow us to choose different join orders and to ‘push’ selections and projections ahead of joins. Selections: (Cascade) (Commute) Projections: (Cascade) (Associative) Joins: R (S T) (R S) T (Commute) (R S) (S R) Show that: R (S T) (T R) S 10
Optimizing Joins Q(u,x) :- R(u,v), S(v,w), T(w,x) R S T Many ways of doing a single join R S Symmetric vs. asymmetric join operations Nested join, hash join, double pipe-lined hash join etc. Processing costs alone vs. processing + transfer costs Get R and S together vs, get R, get just the tuples of S that will join with R (“semi-join”) Many orders in which to do the join (R join S) join T (S join R) join T (T join S) join R etc. All with different costs 2/23/2019 11:46 AM 45
Determining Join Order In principle, we need to consider all possible join orderings: As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space. System-R: consider only left-deep join trees. Left-deep trees allow us to generate all fully pipelined plans:Intermediate results not written to temporary files. Not all left-deep trees are fully pipelined (e.g., SM join). C D B A B A C D 15
Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: identify distinct blocks (corresponding to nested sub-queries or views). Query rewrite phase: apply algebraic transformations to yield a cheaper plan. Merge blocks and move predicates between blocks. Optimize each block: join ordering. Complete the optimization: select scheduling (pipelining strategy).
Cost Estimation For each plan considered, must estimate cost: Must estimate cost of each operation in plan tree. Depends on input cardinalities. Must estimate size of result for each operation in tree! Use information about the input relations. For selections and joins, assume independence of predicates. System R cost estimation approach. Very inexact, but works ok in practice. More sophisticated techniques known now. 8
Key Lessons in Optimization There are many approaches and many details to consider in query optimization Classic search/optimization problem! Not completely solved yet! Main points to take away are: Algebraic rules and their use in transformations of queries. Deciding on join ordering: System-R style (Selinger style) optimization. Estimating cost of plans and sizes of intermediate results.