CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 5A Relational Algebra.
Advertisements

Relational Algebra Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Lecture 07: Relational Algebra
1 Relational Algebra. Motivation Write a Java program Translate it into a program in assembly language Execute the assembly language program As a rough.
Relational Algebra Ch. 7.4 – 7.6 John Ortiz. Lecture 4Relational Algebra2 Relational Query Languages  Query languages: allow manipulation and retrieval.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
FALL 2004CENG 351 File Structures and Data Managemnet1 Relational Algebra.
1 Relational Algebra. 2 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports.
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
1 Lecture 07: Relational Algebra. 2 Outline Relational Algebra (Section 6.1)
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
One More Normal Form Consider the dependencies: Product Company Company, State Product Is it in BCNF?
Relation Decomposition A, A, … A 12n Given a relation R with attributes Create two relations R1 and R2 with attributes B, B, … B 12m C, C, … C 12l Such.
Relational Algebra.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
1 Introduction to Database Systems CSE 444 Lecture 20: Query Execution: Relational Algebra May 21, 2008.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Transactions, Relational Algebra, XML February 11 th, 2004.
CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Relational Algebra 2. Relational Algebra Formalism for creating new relations from existing ones Its place in the big picture: Declartive query language.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
 CS 405G: Introduction to Database Systems Lecture 6: Relational Algebra Instructor: Chen Qian.
CMPT 258 Database Systems Relational Algebra (Chapter 4)
Lecture 13: Relational Decomposition and Relational Algebra February 5 th, 2003.
The Relational Model Lecture 16. Today’s Lecture 1.The Relational Model & Relational Algebra 2.Relational Algebra Pt. II [Optional: may skip] 2 Lecture.
Lectures 2&3: Introduction to SQL. Lecture 2: SQL Part I Lecture 2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Lecture 14: Relational Algebra Projects XML?
Relational Algebra.
COP4710 Database Systems Relational Algebra.
Relational Algebra at a Glance
Lecture 8: Relational Algebra
Relational Algebra Chapter 4, Part A
Database Management Systems (CS 564)
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Lecture 18 The Relational Model.
April 6th – relational algebra
LECTURE 3: Relational Algebra
Relational Algebra Chapter 4 - part I.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
RDBMS RELATIONAL DATABASE MANAGEMENT SYSTEM.
Chapter 2: Intro to Relational Model
Lecture 33: The Relational Model 2
Chapter 2: Intro to Relational Model
Relational Algebra Friday, 11/14/2003.
Chapter 2: Intro to Relational Model
CS639: Data Management for Data Science
Lecture 3: Relational Algebra and SQL
Syllabus Introduction Website Management Systems
CENG 351 File Structures and Data Managemnet
Relational Algebra & Calculus
Relational Algebra Chapter 4 - part I.
Lecture 11: Functional Dependencies
Presentation transcript:

CS639: Data Management for Data Science Lecture 4: Relational Algebra Theodoros Rekatsinas

Announcements Assignment 1 Hints and Grading

Today’s Lecture The Relational Model & Relational Algebra Relational Algebra Pt. II

1. The Relational Model & Relational Algebra

What you will learn about in this section The Relational Model Relational Algebra: Basic Operators

Levels of abstraction

Database maps internally into this Motivation The Relational model is precise, implementable, and we can operate on it (query/update, etc.) Database maps internally into this procedural language.

The Relational Model: Schemata Relational Schema: Students(sid: string, name: string, gpa: float) Relation name String, float, int, etc. are the domains of the attributes Attributes

The Relational Model: Data Student sid name gpa 001 Bob 3.2 002 Joe 2.8 003 Mary 3.8 004 Alice 3.5 An attribute (or column) is a typed data entry present in each tuple in the relation The number of attributes is the arity of the relation

The Relational Model: Data Student sid name gpa 001 Bob 3.2 002 Joe 2.8 003 Mary 3.8 004 Alice 3.5 The number of tuples is the cardinality of the relation A tuple or row (or record) is a single entry in the table having the attributes specified by the schema

The Relational Model: Data Student sid name gpa 001 Bob 3.2 002 Joe 2.8 003 Mary 3.8 004 Alice 3.5 In practice DBMSs relax the set requirement, and use multisets. A relational instance is a set of tuples all conforming to the same schema

To Reiterate A relational schema describes the data that is contained in a relational instance Let R(f1:Dom1,…,fm:Domm) be a relational schema then, an instance of R is a subset of Dom1 x Dom2 x … x Domn In this way, a relational schema R is a total function from attribute names to types

Note here that order matters, attribute name doesn’t… One More Time A relational schema describes the data that is contained in a relational instance A relation R of arity t is a function: R : Dom1 x … x Domt  {0,1} I.e. returns whether or not a tuple of matching types is a member of it Then, the schema is simply the signature of the function Note here that order matters, attribute name doesn’t… We’ll (mostly) work with the other model (last slide) in which attribute name matters, order doesn’t!

A relational database A relational database schema is a set of relational schemata, one for each relation A relational database instance is a set of relational instances, one for each relation Two conventions: We call relational database instances as simply databases We assume all instances are valid, i.e., satisfy the domain constraints

Remember the CMS Relation DB Schema Relation Instances Note that the schemas impose effective domain / type constraints, i.e. Gpa can’t be “Apple” Relation DB Schema Students(sid: string, name: string, gpa: float) Courses(cid: string, cname: string, credits: int) Enrolled(sid: string, cid: string, grade: string) Sid Name Gpa 101 Bob 3.2 123 Mary 3.8 cid cname credits 564 564-2 4 308 417 2 Relation Instances sid cid Grade 123 564 A Students Courses Enrolled

2nd Part of the Model: Querying SELECT S.name FROM Students S WHERE S.gpa > 3.5; We don’t tell the system how or where to get the data- just what we want, i.e., Querying is declarative “Find names of all students with GPA > 3.5” To make this happen, we need to translate the declarative query into a series of operators… we’ll see this next!

Virtues of the model Physical independence (logical too), Declarative Simple, elegant clean: Everything is a relation

Relational Algebra

Relational Algebra (RA) Plan RDBMS Architecture How does a SQL engine work ? SQL Query Relational Algebra (RA) Plan Optimized RA Plan Execution Declarative query (from user) Translate to relational algebra expression Find logically equivalent- but more efficient- RA expression Execute each operator of the optimized plan!

Relational Algebra (RA) Plan RDBMS Architecture How does a SQL engine work ? SQL Query Relational Algebra (RA) Plan Optimized RA Plan Execution Relational Algebra allows us to translate declarative (SQL) queries into precise and optimizable expressions!

Relational Algebra (RA) Five basic operators: Selection: s Projection: P Cartesian Product:  Union:  Difference: - Derived or auxiliary operators: Intersection, complement Joins (natural,equi-join, theta join, semi-join) Renaming: r Division

Keep in mind: RA operates on sets! RDBMSs use multisets, however in relational algebra formalism we will consider sets! Also: we will consider the named perspective, where every attribute must have a unique name attribute order does not matter… Now on to the basic RA operators…

1. Selection (𝜎) 𝜎 𝑔𝑝𝑎 >3.5 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠) SQL: Students(sid,sname,gpa) SQL: Returns all tuples which satisfy a condition Notation: sc(R) Examples sSalary > 40000 (Employee) sname = “Smith” (Employee) The condition c can be =, <, , >, , <> SELECT * FROM Students WHERE gpa > 3.5; RA: 𝜎 𝑔𝑝𝑎 >3.5 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)

Another example: SSN Name Salary 1234545 John 20000 5423341 Smith 600000 4352342 Fred 500000 sSalary > 40000 (Employee) SSN Name Salary 5423341 Smith 600000 4352342 Fred 500000

2. Projection (Π) Π 𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠) SQL: Students(sid,sname,gpa) SQL: Eliminates columns, then removes duplicates Notation: P A1,…,An (R) Example: project social-security number and names: P SSN, Name (Employee) Output schema: Answer(SSN, Name) SELECT DISTINCT sname, gpa FROM Students; RA: Π 𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)

SSN Name Salary 1234545 John 200000 5423341 600000 4352342 Another example: P Name,Salary (Employee) Name Salary John 200000 600000

Note that RA Operators are Compositional! Students(sid,sname,gpa) SELECT DISTINCT sname, gpa FROM Students WHERE gpa > 3.5; Π 𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎 ( 𝜎 𝑔𝑝𝑎>3.5 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)) 𝜎 𝑔𝑝𝑎>3.5 (Π 𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎 ( 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)) How do we represent this query in RA? Are these logically equivalent?

3. Cross-Product (×) 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒 SQL: Students(sid,sname,gpa) People(ssn,pname,address) SQL: Each tuple in R1 with each tuple in R2 Notation: R1  R2 Example: Employee  Dependents Rare in practice; mainly used to express joins SELECT * FROM Students, People; RA: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒

× 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒 Another example: People Students ssn pname address 1234545 John 216 Rosse 5423341 Bob 217 Rosse sid sname gpa 001 John 3.4 002 Bob 1.3 × 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒 ssn pname address sid sname gpa 1234545 John 216 Rosse 001 3.4 5423341 Bob 217 Rosse 002 1.3

Renaming (𝜌) SQL: Changes the schema, not the instance Students(sid,sname,gpa) SQL: Changes the schema, not the instance A ‘special’ operator- neither basic nor derived Notation: r B1,…,Bn (R) Note: this is shorthand for the proper form (since names, not order matters!): r A1B1,…,AnBn (R) SELECT sid AS studId, sname AS name, gpa AS gradePtAvg FROM Students; RA: 𝜌 𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠) We care about this operator because we are working in a named perspective

𝜌 𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠) Another example: Students sid sname gpa 001 John 3.4 002 Bob 1.3 𝜌 𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠) Students studId name gradePtAvg 001 John 3.4 002 Bob 1.3

Natural Join (⋈) 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒 SQL: RA: Notation: R1 ⋈ R2 Students(sid,name,gpa) People(ssn,name,address) SQL: Notation: R1 ⋈ R2 Joins R1 and R2 on equality of all shared attributes If R1 has attribute set A, and R2 has attribute set B, and they share attributes A⋂B = C, can also be written: R1 ⋈𝐶 R2 Our first example of a derived RA operator: Meaning: R1 ⋈ R2 = PA U B(sC=D( 𝜌 𝐶→𝐷 (R1)  R2)) Where: The rename 𝜌 𝐶→𝐷 renames the shared attributes in one of the relations The selection sC=D checks equality of the shared attributes The projection PA U B eliminates the duplicate common attributes SELECT DISTINCT ssid, S.name, gpa, ssn, address FROM Students S, People P WHERE S.name = P.name; RA: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒

⋈ 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈𝑃𝑒𝑜𝑝𝑙𝑒 Another example: Students S People P sid S.name gpa 001 John 3.4 002 Bob 1.3 ssn P.name address 1234545 John 216 Rosse 5423341 Bob 217 Rosse ⋈ 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈𝑃𝑒𝑜𝑝𝑙𝑒 sid S.name gpa ssn address 001 John 3.4 1234545 216 Rosse 002 Bob 1.3 5423341

Natural Join Given schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⋈ S ? Given R(A, B, C), S(D, E), what is R ⋈ S ? Given R(A, B), S(A, B), what is R ⋈ S ?

Example: Converting SQL Query -> RA Students(sid,sname,gpa) People(ssn,sname,address) SELECT DISTINCT gpa, address FROM Students S, People P WHERE gpa > 3.5 AND sname = pname; Π 𝑔𝑝𝑎,𝑎𝑑𝑑𝑟𝑒𝑠𝑠 ( 𝜎 𝑔𝑝𝑎>3.5 (𝑆⋈𝑃))

Logical Equivalence of RA Plans Given relations R(A,B) and S(B,C): Here, projection & selection commute: 𝜎 𝐴=5 (Π 𝐴 (𝑅))= Π 𝐴 ( 𝜎 𝐴=5 (𝑅)) What about here? 𝜎 𝐴=5 (Π 𝐵 (𝑅)) ?= Π 𝐵 ( 𝜎 𝐴=5 (𝑅))

1. Union () and 2. Difference (–) R1  R2 Example: ActiveEmployees  RetiredEmployees R1 – R2 AllEmployees -- RetiredEmployees R1 R2 R1 R2

What about Intersection () ? It is a derived operator R1  R2 = R1 – (R1 – R2) Also expressed as a join! Example UnionizedEmployees  RetiredEmployees R1 R2

RA Expressions Can Get Complex! P name buyer-ssn=ssn pid=pid seller-ssn=ssn P ssn P pid sname=fred sname=gizmo Person Purchase Person Product

RA has Limitations ! Cannot compute “transitive closure” Find all direct and indirect relatives of Fred Cannot express in RA !!! Need to write C program, use a graph engine, or modern SQL… Name1 Name2 Relationship Fred Mary Father Joe Cousin Bill Spouse Nancy Lou Sister