Database Systems Relational Algebra
Query Languages A query – An expression that enables extracting data from a database A query language – A language for expressing queries
Relational Algebra Relational algebra is a language that enables expressing queries over relational databases The syntax is similar to algebra, but the expressions operate on tables The result of such an algebraic expression (query) is a table (relation) Unless stated otherwise, we assume that all relations are sets and all algebraic operators operate on sets
Algebraic Operators There are 5 basic operators plus 2 technical ones – Unary operators: Projection, Selection – Binary operators: Cartesian product, Subtraction, Union – Renaming attributes: a technical operation to enable composing operators – Assignment: a technical operation to enable execution in stages Complex operators that can be obtained by composing basic operators: – Intersection, Join (including its variants), Division, etc. For performance reasons, some complex operators might have direct implementations
Projection Projection eliminates some of the attributes from the records of a given relation In terms of tables manipulation, this means removing columns that do not appear in the projection index and then eliminating multiple lines formed due to the column elimination RomanHebNo i א 1 ii ב 2 T= RomanHeb i א ii ב π Heb,Roman T=
Selection Let T=T(A 1,…,A m ) be a table with schema consisting of attributes A 1,…,A m. For an expression θ, σ θ T consists of all records in T satisfying the expression θ θ may consist of the following: – Comparisons (through the operators =,,≥,≤,≠) between an attribute and a constant or between two attributes. When the attribute is a set, it is also possible to use set operators like etc. – Boolean operators, e.g., (A 1 =“Cat”) (A 2 ≤8) RomanHebNo i א 1 ii ב 2 iii ג 3 T=T= RomanHebNo i א 1 ii ב 2 σ No≤2 T=
Union, Subtraction and Intersection These operations are applied only to relations of the same schema and are identical to their counterparts from set theory – Union S T: obtains all records that appear in either S or T – Subtraction S\T: obtains all records that appear in S but not in T – Intersection: S T: obtains all records that appear in both S and T Intersection is not a basic operator – it can be expressed as S\(S\T) Examples: ColorAnimal BrownHorse WhiteGoat ColorAnimal BlackDog WhiteGoat ColorAnimal BrownHorse WhiteGoat BlackDog ColorAnimal BrownHorse ColorAnimal WhiteGoat T=T= S=S= ST=ST= S\T= ST=ST=
Cartesian Product A Cartesian product yields all combinations of records from the first relation with records from the second relation In terms of tables, we take all concatenations of rows from the first table with rows from the second table Whenever S and T have attributes with the same name, we distinguish between them either by adding the table name as a prefix of the attribute name, e.g., “S.Name” and “T.Name”, or by adding a sequence number, e.g., “Name1” and “Name2” ColorAnimal BlackDog WhiteGoat T=T= Size Big Small S=S= ColorAnimalSize BlackDogBig WhiteGoatBig BlackDogSmall WhiteGoatSmall S x T= What happens if one of the relations is empty?
Renaming This is not an algebraic operation, but rather a technical helper operation that is used to compose complex operations (examples appear shortly) For a given table T=T(A 1,…,A m ) with a schema consisting of attributes A 1,…,A m, the operator ρ A 1 → B 1,…, A m → B m (T) returns an identical table in which the attribute names were changed to B 1,…,B m ColorAnimal BlackDog WhiteGoat T=T= צבעחיה BlackDog WhiteGoat ρ Animal→ חיה,Color → צבע (T)=
Basic Operators The five operators - Projection, Selection, Cartesian Product, Subtraction, and Union – are basic, i.e., – None of these operators can be obtained from the other four – More complex operators (e.g., Join, Division, Intersection) can be obtained by composing some of the basic ones How can we show that an operator is basic? – We need to find a property that is satisfied by this operator and cannot be obtained by any composition of the others
Proving that an Operator is Basic Claim: Projection cannot be expressed by composing Select, Cartesian Product, Subtraction, and Union Proof sketch: Let R be a relation with n attributes A 1,…,A n. Then π A 1 R results in a relation with fewer than n attributes. However, the result of applying any of the operators Select, Cartesian Product, Subtraction, and Union on R (and possibly other relations) yields at least n attributes. – This can be shown by induction on the number of operators in the expression How can this proof be adjusted to show that Cartesian Product is a basic operator?
θ-Join Given two relations S(A 1,…,A n ) and T(B 1,…,B m ) and an expression θ on the attributes A 1,…,A n,B 1,…,B m, denote by S ⋈ θ T the result of the algebraic expression σ θ (SxT) Example BA DC DCBA S=S=T=T= S ⋈ B>C T= An SQL server walks into a bar. He approaches two tables at the far corner asking them: “do you mind if I join you”
Natural Join A very common operation on databases For the relations S(A 1,…,A n,B 1,…,B m ) and T(B 1,…,B m,C 1,…,C k ), denote by S ⋈ T the relation that includes all possible combinations of a record from S with a record from T whose common attributes are the same, in which only a single attribute (column) is kept for each pair of common attributes. More precisely, S ⋈ T= A 1,...,A n,S.B 1,...,S.B m,C 1,...,C k (S ⋈ (S.B 1 =T.B 1 ) ... (S.B m =T.B m ) T) ColorAnimal BlackDog WhiteGoat PinkElephant S=S= ShadeColor LightBlue DarkBlue LightPink DarkPink T=T= ShadeColorAnimal LightPinkElephant DarkPinkElephant S⋈T=S⋈T=
Semi-Join For the relations S(A 1,…,A n,B 1,…,B m ) and T(B 1,…,B m,C 1,…,C k ), denote by S ⋉ T the relation that includes all records in S for which there exists a record in T whose common attributes are the same More precisely, S ⋉ T= A 1,...,A n,B 1,...,B m (S ⋈ T) For performance reasons, this is usually implemented directly ColorAnimal BlackDog WhiteGoat PinkElephant S=S= ShadeColor LightBlue DarkBlue LightPink DarkPink T=T= ColorAnimal PinkElephant S⋉T=S⋉T=
Division For the relations S(A 1,…,A n,B 1,…,B m ) and T(B 1,…,B m ) (i.e., the attributes of T are a subset of the attributes of S), denote by S T the relation R(A 1,…,A n ) consisting of all records for which there exists a record in S corresponding to all records in T More precisely, S T is the maximal relation R such that RxT S I.e., S T= A 1,...,A n S\ A 1,...,A n ((( A 1,...,A n S) T)\S) ColorAnimal BlackDog WhiteDog PinkElephant S=S= Color Black White T=T= Animal Dog ST=ST=
Division - Example We’d like to obtain from table A the supplier number of all suppliers that sell all parts in table B pnosno P1S1 P2S1 P3S1 P4S1 P1S2 P2S2 P2S3 P2S4 P4S4 A= pno P2 sno S1 S2 S3 S4 AB=AB= B=
Division - Example We’d like to obtain from table A the supplier number of all suppliers that sell all parts in table B pnosno P1S1 P2S1 P3S1 P4S1 P1S2 P2S3 P2S4 P4S4 A= pno P2 P4 sno S1 S4 AB=AB= B=
Division - Example We’d like to obtain from table A the supplier number of all suppliers that sell all parts in table B pnosno P1S1 P2S1 P3S1 P4S1 P1S2 P2S3 P2S4 P4S4 A= pno P1 P2 P4 sno S1 AB=AB= B=
Relational Algebra – Summarizing Example Recall the train operation example Station Height S_Name S_Type Line L_Type L_Num Direction Serves Km Train T_Num Days Service T_Category Class Food Gives Arrives Platform D_TimeA_Time
Relational Algebra – Summarizing Example What Tables do we Extract? – What columns should exist for the relationship set “Serves”? – The key S_Name (of the “Station” entity set) – The key’s attributes L_Num and Direction (of “Line”) These triplet would serve as the key for “Serves” In addition, a column for the relation attribute Km – What columns should exist for the relationship set “Arrives”? The key T_Num of the entity set “Train” The key’s attributes for the aggregated relationship set “Serves”, i.e., S_Name, L_num, and Direction The three attributes of the relationship set “Arrives” itself – Platform, A_time, D_time
The Schemas From the previous slide (underlined attribute names represent keys) – Serves(S_Name, L_Num, Direction, Km) – Arrives(T_Num, S_Name, L_Num, Direction, Platform, – D_Time, A_Time) We will represent the multiple value attribute as a separate relation – Station(S_Name, Height) – Station_Type(S_Name, S_Type)
Sample Queries Which stations are served by the line 1-South? – Here, all information is in the table/relation “Serves” S_Name ( (L_Num=1) (Direction=“south”) (Serves)) Which lines serve stations below sea level? – Here, we need to join “Serves” and “Station” L_Num,Direction ( Height<0 (Station ⋈ Serves)) – We can also use semi-join L_Num,Direction (Serves ⋉ S_Name ( Height<0 (Station)))
Sample Queries What stations are served by more than one line? – Here, we need to examine two lines from “Serves” at a time S_Name ( (S_Name=S) ((L_Num L) (Direction D)) ( S_Name → S, L_Num → L, Direction → D, Km → K (Serves) Serves)) Suppose we do not care about different directions of the same line? S_Name ( (S_Name=S) (L_Num L) ( S_Name → S, L_Num → L, Direction → D, Km → K (Serves) Serves)) Which stations serve exactly one line (in any direction)? S_Name (Serves) \ S_Name ( (S_Name=S) (L_Num L) ( S_Name → S, L_Num → L, Direction → D, Km → K (Serves) Serves))
Sample Queries What is the name of the highest station? – It is in fact easier to find all other stations R = S_Name ( (Height<H) (Station S_Name → N, Height → H Station)) – Now, we can complete the query S_Name (Station) \ R
Example of Using Division Which trains (by number) serve all stations? T_Num,S_Name (Arrives) S_Name (Station) What if there are stations that do not appear in any line? How can we avoid referring to them? T_Num,S_Name (Arrives) S_Name (Serves) What if there are stations that serve a line but no train stops there? How can we ignore them? T_Num,S_Name (Arrives) S_Name (Arrives)
More Complex Examples Which lines share “change” stations? – First, we will find all couples for which the station is a “change” station R = (Line ⋈ Serves) ⋉ S_Name ( S_Type =“change” (Station_Type)) – Now, the Cartesian product RxR would include all combinations of couples for which the station is a “change” station. Suppose we distinguish between attributes of each using indices, e.g., S_Name 1,S_Name 2, etc. – We would like to select from the product only couples for which the station is the same S = (S_Name 1 =S_Name 2 ) (L_Num 1 L_Num 2 ) (R R) – Finally, we project to obtain only the attributes we are intereted in T = L_Type 1,L_Num 1,Direction 1,L_Type 2,L_Num 2,Direction 2 (S)
Which Queries Cannot be Expressed in RA? Aggregate functions: – How many lines travel “North”? – What is the average distance between stations in Line 1- South? – How many trains stop in each station on Monday? – These functions operate on an unknown number of parameters (counting, sum, average, …) from the set of values obtained from the relation – There are several extensions to RA that enable such queries as well as in SQL
Inexpressible Queries - Continued Transitive Closure – The following query cannot be expressed in RA: What are all the stations that can be reached from station S in a finite number of train changes? – Notice, for any given constant k, we can express the following query What are all the stations that can be reached from station S in at most k train changes? – The key here is whether k is bounded and given or not These type of queries cannot be expressed in standard ANSI SQL either without the help of the hosting language In contrast, some other query languages supports recursive queries, and in particular this type of queries
Attribute-less Relations In order to facilitate handling yes/no questions, and to simplify some other queries, often RA is extended to allow relations with no attributes (tables with no rows) What can such a relation contain? – It can be empty – It can include a single empty line This represents the value “true” vs. “false” in an empty relation, or “exists” vs. “non-exist” in an empty relation
Generalizing Algebraic Operators Projection – For a relation R, the empty set is considered a subset of the set of its attributes and the corresponding projection is denoted π λ R – The result is the empty relation if R is an empty relation and the single empty record if R is not empty Cartesian Product – When S has no attributes, RxS results in an empty relation if S is empty and in R if S includes the empty record Division – When R and S have the same attributes, R S includes the empty record if S R and is empty otherwise
Queries Equivalence Two expressions are equivalent if they return the same result when applied to any possible database content Why is this important? – For performance reasons, we may be able to translate one query into an equivalent one that is easier/faster to evaluate Example: – For the relation R(A,B,C), which of the following equivalent expressions can be evaluated more efficiently? A (R) A ( A,B ( R)) A ( A,C ( R))
Example Given the relations R(A,B), S(A,B), we will show that A R A S is equivalent to A (R S) – We will show that A R A S A (R S) – As well as that A R A S A (R S)
Example - Proof (1/2) We show that A R A S A (R S) For any record t, if t ( A R A S) then by the definition of union, either t A R or t A S (or both) Hence, either there exists a u R such that u[A]=t[A] or there exists a u S such that u[A]=t[A] (or both) In other words, t A (R S)
Example - Proof (2/2) We show that A R A S A (R S) If there exists a t A (R S), then by definition there exists a u R S such that u[A]=t[A] Hence, either there is a u R such that u[A]=t[A] or there is a u S such that u[A]=t[A] (or both) In other words, t ( A R A S)
On the Importance of Equivalence A typical SQL query One needs to ensure that the optimization does not alter the semantics SELECT DISTINCT A FROM R,S WHERE R.B=S.B AND C=5; Direct translation A ( (R.B=S.B) (C=5) (R S)) Optimization A (R ⋉ B ( C=5 S)))
Outerjoin Natural Join suffers from information loss – that is, the result does not include records that do not have a corresponding record in the second table – To that end, the missing attributes are padded with null More precisely, for the relations S(A 1,…,A n,B 1,…,B m ) and T(B 1,…,B m,C 1,…,C k ), denote by S ⋈ outer T the natural join of S and T such that we append null attributes to each record in S that does not have a corresponding record in T and vice versa That is, S ⋈ outer T=(S ⋈ T) ((S\(S ⋉ T)) C 1,...,C k T null ) ( A 1,...,A n S null (T\(T ⋉ S))) where T null is a relation with the same schema as T that includes one record with null values and similarly S null is a record with the same schema as S that includes one record with null values
Example ColorAnimal BlackDog WhiteGoat PinkElephant S= ShadeColor LightBlue DarkBlue LightPink DarkPink T= ShadeColorAnimal LightPinkElephant DarkPinkElephant NullBlackDog NullWhiteGoat LightBlueNull DarkBlueNull S ⋈ outer T= What happens when one of the relations is empty? There are also left-outer join in which only the records of the left relation are padded and right-outer join in which only the records of the right relation are padded
Null Values Once we allow null values, the definition of the RA operators need to be generalized to encompass them This definition is complicated and outside the scope of this course
Bag/Multi-set Semantics Until now, we assumed that relations are sets In practice, and in particular in SQL, we may wish to allow relations that include multiple entries of the same record To that end, the operators should be generalized accordingly
RA Operators in Bag Semantics Projection – we eliminate attributes without eliminating duplicates Selection – we return the corresponding records without eliminating duplicates Cartesian product RxS – if a record u appears n times in R and a record v appears m times in S, then their concatenation appears nm times in RxS Union S T – if a record u appears n times in R and m times in S, it will appear n+m times in S T Subtraction S\T – if a record u appears n times in R and m times in S, it will appear max{0,n-m} times in S\T Intersection S T – if a record u appears n times in R and m times in S, it will appear min{n,m} times in S T Operator δ – explicit duplicate elimination Unless otherwise stated, we assume a set semantics in this course