Database Systems Relational Algebra assoc. prof., dr. Vladimir Dimitrov web: is.fmi.uni-sofia.bg.

Database Systems Relational Algebra assoc. prof., dr. Vladimir Dimitrov e-mail: cht@fmi.uni-sofia.bg web: is.fmi.uni-sofia.bg

Contents An Example Database Schema An Algebra of Relational Operations Basics of Relational Algebra Basics of Relational Algebra Set Operations on Relations Set Operations on Relations Projection Projection Selection Selection Cartesian Product Cartesian Product Natural Joins Natural Joins Theta-Joins Theta-Joins Combining Operations to Form Queries Combining Operations to Form Queries Renaming Renaming Dependent and Independent Operations Dependent and Independent Operations A Linear Notation for Algebraic Expressions A Linear Notation for Algebraic Expressions Relational Operations on Bags Why Bags? Why Bags? Union, Intersection, and Difference of Bags Union, Intersection, and Difference of Bags Projection of Bags Projection of Bags Selection on Bags Selection on Bags Product of Bags Product of Bags Joins of Bags Joins of Bags Extended Operators of Relational Algebra Duplicate Elimination Duplicate Elimination Aggregation Operators Aggregation Operators Grouping Grouping The Grouping Operator The Grouping Operator Extending the Projection Operator Extending the Projection Operator The Sorting Operator The Sorting Operator Outerjoins Outerjoins Constraints on Relations Relational Algebra as a Constraint Language Relational Algebra as a Constraint Language Referential Integrity Constraints Referential Integrity Constraints Additional Constraint Examples Additional Constraint ExamplesSummary

Relational Algebra We begin a study of database programming, that is, how the user can ask queries of the database and can modify the contents of the database. Our focus is on the relational model, and in particular on a notation for describing queries about the content of relations called "relational algebra." While ODL uses methods that, in principle, can perform any operation on data, and the E/R model does not embrace a specific way of manipulating data, the relational model has a concrete set of "standard" operations on data. Surprisingly, these operations are not "Turing complete" the way ordinary programming languages are. Thus, there are operations we cannot express in relational algebra that could be expressed, for instance, in ODL methods written in C++. This situation is not a defect of the relational model or relational algebra, because the advantage of limiting the scope of operations is that it becomes possible to optimize queries written in a very high level language such as SQL. We begin by introducing the operations of relational algebra. This algebra formally applies to sets of tuples, i.e., relations. However, commercial DBMS's use a slightly different model of relations, which are bags, not sets. That is, relations in practice may contain duplicate tuples. While it is often useful to think of relational algebra as a set algebra, we also need to be conscious of the effects of duplicates on the results of the operations in relational algebra. In the final section of this chapter, we consider the matter of how constraints on relations can be expressed. Later we will see the languages and features that today's commercial DBMS's offer the user. The operations of relational algebra are all implemented by the SQL query language. These algebraic operations also appear in the OQL language, an object-oriented query language based on the ODL data model.

An Example Database Schema As we begin our focus on database programming in the relational model, it is useful to have a specific schema on which to base our examples of queries. Our chosen database schema draws upon the running example of movies, stars, and studios, and it uses normalized relations similar to the ones that we. However, it includes some attributes that we have not used previously in examples, and it includes one relation — MovieExec — that has not appeared before. The purpose of these changes is to give us some opportunities to study different data types and different ways of representing information.

Example database schema about movies Movie( TITLE: string, YEAR: integer, length: integer, inColor: boolean, studioName: string, producerC#: integer) Stars In( MOVIETITLE: string, MOVIEYEAR: integer, STARNAME: string) MovieStar( NAME: string, address: string, gender: char, birthdate: date) MovieExec( name: string, address: string, CERT#: integer, netWorth: integer) Studio( NAME: string, address: string, presC#: integer)

An Example Database Schema Our schema has five relations. The attributes of each relation are listed, along with the intended domain for that attribute. The key attributes for a relation are shown in capitals. although when we refer to them in text, they will be lower-case as they have been heretofore. For instance, all three attributes together form the key for relation StarsIn. Relation Movie has six attributes: title and year together constitute the key for Movie, as they have previously. Attribute title is a string, and year is an integer. The major modifications to the schema compared with what we have seen so far are: There is a notion of a certificate number for movie executives — studio presidents and movie producers. This certificate is a unique integer that we imagine is maintained by some external authority, perhaps a registry of executives or a "union." There is a notion of a certificate number for movie executives — studio presidents and movie producers. This certificate is a unique integer that we imagine is maintained by some external authority, perhaps a registry of executives or a "union." We use certificate numbers as the key for movie executives, although movie stars do not always have certificates and we shall continue to use name as the key for stars. That decision is probably unrealistic, since two stars could have the same name, but we take this road in order to illustrate some different options. We use certificate numbers as the key for movie executives, although movie stars do not always have certificates and we shall continue to use name as the key for stars. That decision is probably unrealistic, since two stars could have the same name, but we take this road in order to illustrate some different options. We introduced the producer as another property of movies. This information is represented by a new attribute, producerC#, of relation Movie. This attribute is intended to be the certificate number of the producer. Producers are expected to be movie executives, as are studio presidents. There may also be other executives in the MovieExec relation. We introduced the producer as another property of movies. This information is represented by a new attribute, producerC#, of relation Movie. This attribute is intended to be the certificate number of the producer. Producers are expected to be movie executives, as are studio presidents. There may also be other executives in the MovieExec relation. Attribute filmType of Movie has been changed from an enumerated type to a boolean-valued attribute called inColor : true if the movie is in color and false if it is in black and white. Attribute filmType of Movie has been changed from an enumerated type to a boolean-valued attribute called inColor : true if the movie is in color and false if it is in black and white. The attribute gender has been added for movie stars. Its type is "character," either M for male or F for female. Attribute birthdate, of type "date" (a special type supported by many commercial database systems or just a character string if we prefer) has also been added. The attribute gender has been added for movie stars. Its type is "character," either M for male or F for female. Attribute birthdate, of type "date" (a special type supported by many commercial database systems or just a character string if we prefer) has also been added. All addresses have been made strings, rather than pairs consisting of a street and city. The purpose is to make addresses in different relations comparable easily and to simplify operations on addresses. All addresses have been made strings, rather than pairs consisting of a street and city. The purpose is to make addresses in different relations comparable easily and to simplify operations on addresses.

An Algebra of Relational Operations To begin our study of operations on relations, we shall learn about a special algebra, called relational algebra, that consists of some simple but powerful ways to construct new relations from given relations. When the given relations are stored data, then the constructed relations can be answers to queries about this data. The development of an algebra for relations has a history, which we shall follow roughly in our presentation. Initially, relational algebra was proposed by T. Codd as an algebra on sets of tuples (i.e., relations) that could be used to express typical queries about those relations. It consisted of five operations on sets: union, set difference, and Cartesian product, with which you might already be familiar, and two unusual operations — selection and projection. To these, several operations that can be defined in terms of these were added; varieties of "join" are the most important. When DBMS's that used the relational model were first developed, their query languages largely implemented the relational algebra. However, for efficiency purposes, these systems regarded relations as bags, not sets. That is, unless the user asked explicitly that duplicate tuples be condensed into one (i.e., that "duplicates be eliminated"), relations were allowed to contain duplicates. Thus, we shall study the same relational operations on bags and see the changes necessary. Another change to the algebra that was necessitated by commercial implementations of the relational model is that several other operations are needed. Most important is a way of performing aggregation, e.g.. finding the average value of some column of a relation. We shall study these additional operations.

Why Bags Can Be More Efficient Than Sets As a simple example of why bags can lead to implementation efficiency, if you take the union of two relations but do not eliminate duplicates, then you can just copy the relations to the output. If you insist that the result be a set, you have to sort the relations, or do something similar to detect identical tuples that come from the two relations.

Basics of Relational Algebra An algebra, in general, consists of operators and atomic operands. For instance, in the algebra of arithmetic, the atomic operands are variables like x and constants like 15. The operators are the usual arithmetic ones: addition, subtraction, multiplication, and division. Any algebra allows us to build expressions by applying operators to atomic operands and/or other expressions of the algebra. Usually, parentheses are needed to group operators and their operands. For instance, in arithmetic we have expressions such as (x + y) * z or ((x + 7)/(y - 3)) + x. Relational algebra is another example of an algebra. Its atomic operands are: 1. Variables that stand for relations. 2. Constants, which are finite relations.

Basics of Relational Algebra As we mentioned, in the classical relational algebra, all operands and the results of expressions are sets. The operations of the traditional relational algebra fall into four broad classes: a) a) The usual set operations — union, intersection, and difference — applied to relations. b) b) Operations that remove parts of a relation: "selection" eliminates some rows (tuples), and "projection" eliminates some columns. c) c) Operations that combine the tuples of two relations, including "Cartesian product," which pairs the tuples of two relations in all possible ways, and various kinds of "join" operations, which selectively pair tuples from two relations. d) d) An operation called "renaming" that does not affect the tuples of a relation, but changes the relation schema, i.e., the names of the attributes and/or the name of the relation itself. We shall generally refer to expressions of relational algebra as queries. While we don't yet have the symbols needed to show many of the expressions of relational algebra, you should be familiar with the operations of group (a), and thus recognize (R U S) as an example of an expression of relational algebra. R and S are atomic operands standing for relations, whose sets of tuples are unknown. This query asks for the union of whatever tuples are in the relations named R and S.

Set Operations on Relations The three most common operations on sets arc union, intersection, and difference. We assume the reader is familiar with these operations, which are defined as follows on arbitrary sets R and S : R  S, the union of R and S, is the set of elements that are in R or S or both. An element appears only once in the union even if it is present in both R and S. R  S, the union of R and S, is the set of elements that are in R or S or both. An element appears only once in the union even if it is present in both R and S. R  S, the intersection of R and S, is the set of elements that are in both R and S. R  S, the intersection of R and S, is the set of elements that are in both R and S. R - S, the difference of R and S, is the set of elements that are in R but not in S. Note that R - S is different from S - R, the latter is the set of elements that are in S but not in R. R - S, the difference of R and S, is the set of elements that are in R but not in S. Note that R - S is different from S - R, the latter is the set of elements that are in S but not in R.

Set Operations on Relations When we apply these operations to relations, we need to put some conditions on R and S : 1. R and S must have schemas with identical sets of attributes, and the types (domains) for each attribute must be the same in R and S. 2. Before we compute the set-theoretic union, intersection, or difference of sets of tuples, the columns of R and S must be ordered so that the order of attributes is the same for both relations. Sometimes we would like to take the union, intersection, or difference of relations that have the same number of attributes, with corresponding domains, but that use different names for their attributes. If so, we may use the renaming operator to be discussed latter to change the schema of one or both relations and give them the same set of attributes.

ExamplesR nameaddressgenderbirthdate Carrie Fisher 123 Maple St., Hollywood F9/99/99 Mark Hamill 456 Oak Rd., Brentwood M8/88/88 S Nameaddressgenderbirthdate Carrie Fisher 123 Maple St., Hollywood F9/99/99 Harrison Ford 789 Palm Dr., Beverly Hills M7/77/77 R  S Nameaddressgenderbirthdate Carrie Fisher 123 Maple St., Hollywood F9/99/99

Examples R  S nameaddressgenderbirthdate Carrie Fisher 123 Maple St., Hollywood F9/99/99 Mark Hamill 456 Oak Rd., Brentwood M8/88/88 Harrison Ford 789 Palm Dr., Beverly Hills M7/77/77 R - S Nameaddressgenderbirthdate Mark Hamill 456 Oak Rd., Brentwood M8/88/88

Projection The projection operator is used to produce from a relation R a new relation that has only some of R 's columns. The value of expression π A 1,A 2,..., A n (R) is a relation that has only the columns for attributes A 1, A 2,..., A n of R. The schema for the resulting value is the set of attributes {A 1, A 2,..., A n }, which we conventionally show in the order listed.

Example Movie titleyearlengthinColorstudioNameproducerC# Star Wars 1977124trueFox12345 Mighty Ducks 1991104trueDisney67890 Wayne's World 199295trueParamount99999 π title, year, length (Movie) π inColor (Movie) titleyearlengthinColor Star Wars 1977124true Mighty Ducks 1991104 Wayne's World 199295

Selection The selection operator, applied to a relation R, produces a new relation with a subset of R 's tuples. The tuples in the resulting relation are those that satisfy some condition C that involves the attributes of R. We denote this operation σ C (R). The schema for the resulting relation is the same as R s schema, and we conventionally show the attributes in the same order as we use for R. C is a conditional expression of the type with which we are familiar from conventional programming languages; for example, conditional expressions follow the keyword if in programming languages such as C or Java. The only difference is that the operands in condition C are either constants or attributes of R. We apply C to each tuple t of R by substituting, for each attribute A appearing in condition C, the component of t for attribute A. If after substituting for each attribute of C the condition C is true, then t is one of the tuples that appear in the result of σ C (R) ; otherwise t is not in the result.

Example σ length≥100 (Movie) titleyearlengthinColorstudioNameproducerC# Star Wars 1977124trueFox12345 Mighty Ducks 1991104trueDisney67890 σ length≥100 AND studioName = 'Fox' (Movie) titleyearlengthinColorstudioNameproducerC# Star Wars 1977124trueFox12345

Cartesian Product The Cartesian product (or cross-product, or just product) of two sets R and S is the set of pairs that can be formed by choosing the first element of the pair to be any element of R and the second any element of S. This product is denoted R × S. When R and S are relations, the product is essentially the same. However, since the members of R and S are tuples, usually consisting of more than one component, the result of pairing a tuple from R with a tuple from S is a longer tuple, with one component for each of the components of the constituent tuples. By convention, the components from R precede the components from S in the attribute order for the result. The relation schema for the resulting relation is the union of the schemas for R and S. However, if R and S should happen to have some attributes in common, then we need to invent new names for at least one of each pair of identical attributes. To disambiguate an attribute A that is in the schemas of both R and S, we use R.A for the attribute from R and S.A for the attribute from S.

Example RS R × S ABBCDAR.BS.BCD 1225612256 3447812478 910111291011 34256 34478 3491011

Natural Joins More often than we want to take the product of two relations, we find a need to join them by pairing only those tuples that match in some way. The simplest sort of match is the natural join of two relations R and S, denoted R  S, in which we pair only those tuples from R and S that agree in whatever attributes are common to the schemas of R and S. More precisely, let A 1, A 2,..., A n be all the attributes that are in both the schema of R and the schema of S. Then a tuple r from R and a tuple s from S are successfully paired if and only if r and s agree on each of the attributes A 1, A 2,..., A n. If the tuples r and s are successfully paired in the join R  S, then the result of the pairing is a tuple, called the joined tuple, with one component for each of the attributes in the union of the schemas of R and S. The joined tuple agrees with tuple r in each attribute in the schema of R, and it agrees with s in each attribute in the schema of S. Since r and s are successfully paired, the joined tuple is able to agree with both these tuples on the attributes they have in common. The construction of the joined tuple is suggested next slide. Note also that this join operation is the same one that we used to recombine relations that had been projected onto two subsets of their attributes. There the motivation was to explain why BCNF decomposition made sense. We shall see another use for the natural join: combining two relations so that we can write a query that relates attributes of each.

Joining tuples R S r s joined tuple

Example RS R  S ABBCDABCD 122561256 344783478 91011 dangling tuple

Natural join of relations ABCBCDABCD 1232341234 6782351235 978781067810 Relation U Relation V 97810 Relation U  V

Theta-Joins The natural join forces us to pair tuples using one specific condition. While this way, equating shared attributes, is the most common basis on which relations are joined, it is sometimes desirable to pair tuples from two relations on some other basis. For that purpose, we have a related notation called the theta-join. Historically, the "theta" refers to an arbitrary condition, which we shall represent by C rather than. The natural join forces us to pair tuples using one specific condition. While this way, equating shared attributes, is the most common basis on which relations are joined, it is sometimes desirable to pair tuples from two relations on some other basis. For that purpose, we have a related notation called the theta-join. Historically, the "theta" refers to an arbitrary condition, which we shall represent by C rather than θ. The notation for a theta-join of relations R and S based on condition C is R  S. The result of this operation is constructed as follows: C 1. Take the product of R and S. 2. Select from the product only those tuples that satisfy the condition C. As with the product operation, the schema for the result is the union of the schemas of R and S, with " R. " or " S. " prefixed to attributes if necessary to indicate from which schema the attribute came.

Example AU.BU.CV.BV.CD 123234 123235 1237810 6787810 9787810 U  V A<D

Theta-Joins Notice that the schema for the result in previous slide consists of all six attributes, with U and V prefixed to their respective occurrences of attributes B and C to distinguish them. Thus, the theta-join contrasts with natural join, since in the latter common attributes are merged into one copy. Of course it makes sense to do so in the case of the natural join, since tuples don't pair unless they agree in their common attributes. In the case of a theta-join. there is no guarantee that compared attributes will agree in the result, since they may not be compared with =.

Example AU.BU.CV.BV.CD 1237810 U  V A<D AND U.B≠V.B

Combining Operations to Form Queries If all we could do was to write single operations on one or two relations as queries, then relational algebra would not be as useful as it is. However, relational algebra, like all algebras, allows us to form expressions of arbitrary complexity by applying operators either to given relations or to relations that are the result of applying one or more relational operators to relations. One can construct expressions of relational algebra by applying operators to subexpressions, using parentheses when necessary to indicate grouping of operands. It is also possible to represent expressions as expression trees; the latter often are easier for us to read, although they are less convenient as a machine-readable notation.

What are the titles and years of movies made by Fox that are at least 100 minutes long? π title, year  σ length≥100 σ studioName='Fox' Movies π title, year (σ length≥100 (Movies)  σ studioName='Fox' (Movies)) π title, year (σ length≥100 AND studioName='Fox' (Movies))

Find the stars of movies that are at least 100 minutes long Movies1(title, year, length, filmType, studioName) Movies2(title, year, starName) π starName (σ length≥100 (Movies1  Movies2))

Equivalent Expressions and Query Optimization All database systems have a query-answering system, and many of them are based on a language that is similar in expressive power to relational algebra. Thus, the query asked by a user may have many equivalent expressions (expressions that produce the same answer, whenever they are given the same relations as operands), and some of these may be much more quickly evaluated. An important job of the query "optimizer" discussed briefly is to replace one expression of relational algebra by an equivalent expression that is more efficiently evaluated. Optimization of relational algebra expressions is covered extensively latter.

Renaming In order to control the names of the attributes used for relations that are constructed by applying relational algebra operations, it is often convenient to use an operator that explicitly renames relations. We shall use the operator ρ S(A 1, A 2,..., A n ) (R) to rename a relation R. The resulting relation has exactly the same tuples as R, but the name of the relation is S. Moreover, the attributes of the result relation S are named A 1, A 2,..., A n, in order from the left. If we only want to change the name of the relation to S and leave the attributes as they are in R, we can just say ρ S (R).

Example RS R × ρ S(X, C, D) (S) ABBCDABXCD 1225612256 3447812478 910111291011 34256 34478 3491011

Dependent and Independent Operations Some of the operations that we have described can be expressed in terms of other relational algebra operations. For example, intersection can be expressed in terms of set difference: R  S = R -(R - S) That is, if R and S are any two relations with the same schema, the intersection of R and S can be computed by first subtracting S from R to form a relation T consisting of all those tuples in R but not S. We then subtract T from R, leaving only those tuples of R that are also in S. The two forms of join are also expressible in terms of other operations. Theta-join can be expressed by product and selection: R  S = σ C (R × S) The natural join of R and S can be expressed by starting with the product R × S. We then apply the selection operator with a condition C of the form R.A 1 = S.A 1 AND R.A 2 = S.A 2 AND … AND R.A n = S.A n where A 1, A 2,..., A n are all the attributes appearing in the schemas of both R and S. Finally, we must project out one copy of each of the equated attributes. Let L be the list of attributes in the schema of R followed by those attributes in the schema of S that are not also in the schema of R. Then R  S = π L (σ C (R × S)) The rewriting rules are the only "redundancies" among the operations that we have introduced. The six remaining operations — union, difference, selection, projection, product, and renaming — form an independent set, none of which can be written in terms of the other five.

Example U(A, B, C) V(B, C, D) U  V π A, U.B, U.C, D (σ U.B=V.B AND U.C=V.C (U × V)) U  V A<D AND U.B≠V.B A<D AND U.B≠V.B σ A<D AND U.B≠V.B (U × V)

A Linear Notation for Algebraic Expressions We used trees to represent complex expressions of relational algebra. Another alternative is to invent names for the temporary relations that correspond to the interior nodes of the tree and write a sequence of assignments that create a value for each. The order of the assignments is flexible, as long as the children of a node N have had their values created before we attempt to create the value for N itself. The notation we shall use for assignment statements is: 1. A relation name and parenthesized list of attributes for that relation. The name Answer will be used conventionally for the result of the final step; i.e., the name of the relation at the root of the expression tree. 2. The assignment symbol :=. 3. Any algebraic expression on the right. We can choose to use only one operator per assignment, in which case each interior node of the tree gets its own assignment statement. However, it is also permissible to combine several algebraic operations in one right side, if it is convenient to do so.

What are the titles and years of movies made by Fox that are at least 100 minutes long? π title, year  σ length≥100 σ studioName='Fox' Movies π title, year (σ length≥100 (Movies)  σ studioName='Fox' (Movies)) π title, year (σ length≥100 AND studioName='Fox' (Movies))

Example R(t, y, l, i, s, p) := σ length≥100 (Movie) S(t, y, l, i, s, p) := σ studioName='Fox' (Movie) T(t, y, l, i, s, p) := R  S Answer(title, year) := π t, i (T) R(t, y, l, i, s, p) := σ length≥100 (Movie) S(t, y, l, i, s, p) := σ studioName='Fox' (Movie) Answer(title, year) := π t, i (R  S)

Relational Operations on Bags While a set of tuples (i.e., a relation) is a simple, natural model of data as it might appear in a database, commercial database systems rarely, if ever, are based purely on sets. In some situations, relations as they appear in database systems are permitted to have duplicate tuples. Recall that if a "set" is allowed to have multiple occurrences of a member, then that set is called a bag or - multiset. We shall consider relations that are bags rather than sets; that is, we shall allow the same tuple to appear more than once in a relation. When we refer to a "set," we mean a relation without duplicate tuples; a "bag" means a relation that may (or may not) have duplicate tuples.

Example AB 12 34 12 12

Why Bags? When we think about implementing relations efficiently, we can see several ways that allowing relations to be bags rather than sets can speed up operations on relations. We mentioned how allowing the result to be a bag could speed up the union of two relations. For another example, when we do a projection, allowing the resulting relation to be a bag (even when the original relation is a set) lets us work with each tuple independently. If we want a set as the result, we need to compare each projected tuple with all the other projected tuples, to make sure that each projection appears only once. However, if we can accept a bag as the result, then we simply project each tuple and add it to the result; no comparison with other projected tuples is necessary.

Example projection on bag and set R π A,B (R) ABCABAB 1251212 3463434 12712 12812

Union, Intersection, and Difference of Bags When we take the union of two bags, we add the number of occurrences of each tuple. That is, if R is a bag in which the tuple t appears n times, and S is a bag in which the tuple t appears m times, then in the bag R  S tuple t appears n + m times. Note that either n or m (or both) can be 0. When we intersect two bags R and S, in which tuple t appears n and m times, respectively, in R  S tuple t appears min(n, m) times. When we compute R - S, the difference of bags R and S, tuple t appears in R – S max(0, n - m) times. That is, if t appears in R more times than it appears in S, then in R - S tuple t appears the number of times it appears in R, minus the number of times it appears in S. However, if t appears at least as many times in S as it appears in R, then t does not appear at all in R - S. Intuitively, occurrences of t in S each "cancel" one occurrence in R.

Example RS R  S R  S R - S ABABABACAB 1212121212 3434123412 123412 125612 S - R 34AB 3434 3456 56

Projection of Bags We have already illustrated the projection of bags. As we saw in an example that each tuple is processed independently during the projection. If R is the bag from the example and we compute the bag-projection π A, B (R), then we get the corresponding bag. If the elimination of one or more attributes during the projection causes the same tuple to be created from several tuples, these duplicate tuples are not eliminated from the result of a bag-projection. Thus, the three tuples (1, 2, 5), (1, 2, 7), and (1, 2, 8) of the relation R each gave rise to the same tuple (1, 2) after projection onto attributes A and B. In the bag result, there are three occurrences of tuple (1, 2), while in the set-projection, this tuple appears only once.

Bag Operations on Sets Imagine we have two sets R and S. Every set may be thought of as a bag: the bag just happens to have at most one occurrence of any tuple. Suppose we intersect R  S, but we think of R and S as bags and use the bag intersection rule. Then we get the same result as we would get if we thought of R and S as sets. That is, thinking of R and S as bags, a tuple t is in R  S the minimum of the number of times it is in R and S. Since R and S are sets, t can be in each only 0 or 1 times. Whether we use the bag or set intersection rules, we find that t can appear at most once in R  S, and it appears once exactly when it is in both R and S. Similarly, if we use the bag difference rule to compute R - S or S - R we get exactly the same result as if we used the set rule. However, union behaves differently, depending on whether we think of R and S as sets or bags. If we use the bag rule to compute R  S, then the result may not be a set, even if R and S are sets. In particular, if tuple t appears in both R and S, then t appears twice in R  S if we use the bag rule for union. But if we use the set rule then t appears only once in R  S. Thus, when taking unions, we must be especially careful to specify whether we are using the bag or set definition of union.

Selection on Bags To apply a selection to a bag, we apply the selection condition to each tuple independently. As always with bags, we do not eliminate duplicate tuples in the result.

Example R σ C≥6 (R) ABCABC 125346 346127 127127 127

Algebraic Laws for Bags An algebraic law is an equivalence between two expressions of relational algebra whose arguments are variables standing for relations. The equivalence asserts that no matter what relations we substitute for these variables, the two expressions define the same relation. An example of a well known law is the commutative law for union: R  S = S  R. This law happens to hold whether we regard relation-variables R and S as standing for sets or bags. However, there are a number of other laws that hold when relational algebra is applied to sets but that do not hold when relations are interpreted as bags. A simple example of such a law is the distributive law of set difference over union, (R  S) - T = (R - T)  (S - T). This law holds for sets but not for bags. To see why it fails for bags, suppose R, S, and T each have one copy of tuple t. Then the expression on the left has one t, while the expression on the right has none. As sets, neither would have t.

Product of Bags The rule for the Cartesian product of bags is the expected one. Each tuple of one relation is paired with each tuple of the other, regardless of whether it is a duplicate or not. As a result, if a tuple r appears in a relation R m times, and tuple s appears n times in relation S, then in the product R × S, the tuple rs will appear mn times.

Computing the product of bags RS R × S ABBCAR.BS.BC 12231223 12451223 451245 1245 1245 1245

Joins of Bags Joining bags also presents no surprises. We compare each tuple of one relation with each tuple of the other, decide whether or not this pair of tuples joins successfully. and if so we put the resulting tuple in the answer. When constructing the answer, we do not eliminate duplicate tuples.

Example R  S RS R.B<S.B ABBCABCAR.BS.BC 12231231245 12451231245 451245 1245

Extended Operators of Relational Algebra We presented the classical relational algebra, and then introduced the modifications necessary to treat relations as bags of tuples rather than sets. The ideas of these two sections serve as a foundation for most of modern query languages. However, languages such as SQL have several other operations that have proved quite important in applications. Thus, a full treatment of relational operations must include a number of other operators, which we introduce in this section. The additions: 1. The duplicate-elimination operator δ turns a bag into a set by eliminating all but one copy of each tuple. 2. Aggregation operators, such as sums or averages, are not operations of relational algebra, but are used by the grouping operator (described next). Aggregation operators apply to attributes (columns) of a relation, e.g., the sum of a column produces the one number that is the sum of all the values in that column. 3. Grouping of tuples according to their value in one or more attributes has the effect of partitioning the tuples of a relation into "groups." Aggregation can then be applied to columns within each group, giving us the ability to express a number of queries that are impossible to express in the classical relational algebra. The grouping operator γ is an operator that combines the effect of grouping and aggregation. 4. The sorting operator τ turns a relation into a list of tuples, sorted according to one or more attributes. This operator should be used judiciously, because other relational-algebra operators apply to sets or bags, but never to lists. Thus, τ only makes sense as the final step of a series of operations. 5. Extended projection gives additional power to the operator π. In addition to projecting out some columns, in its generalized form π can perform computations involving the columns of its argument relation to produce new columns. 6. The outerjoin operator is a variant of the join that avoids losing dangling tuples. In the result of the outerjoin, dangling tuples are "padded" with the null value, so the dangling tuples can be represented in the output.

Duplicate Elimination Sometimes, we need an operator that converts a bag to a set. For that purpose, we use δ(R) to return the set consisting of one copy of every tuple that appears one or more times in relation R.

Example R δ(R) ABAB 1212 3434 12 12

Aggregation Operators There are several operators that apply to sets or bags of atomic values. These operators are used to summarize or "aggregate" the values in one column of a relation, and thus are referred to as aggregation operators. The standard operators of this type are: 1. SUM produces the sum of a column with numerical values. 2. AVG produces the average of a column with numerical values. 3. MIN and MAX, applied to a column with numerical values, produces the smallest or largest value, respectively. When applied to a column with character-string values, they produce the lexicographically (alphabetically) first or last value, respectively. 4. COUNT produces the number of (not necessarily distinct) values in a column. Equivalently, COUNT applied to any attribute of a relation produces the number of tuples of that relation, including duplicates.

Example AB 12 34 12 12 1.SUM(B) = 2 + 4 + 2 + 2 = 10. 2.AVG(A) = (1 + 3 + 1 + 1) / 4 = 1.5. 3.MIN(A) = 1. 4.MAX(B) = 4. 5.COUNT(A) = 4.

Grouping Often we do not want simply the average or some other aggregation of an entire column. Rather, we need to consider the tuples of a relation in groups, corresponding to the value of one or more other columns, and we aggregate only within each group. As an example, suppose we wanted to compute the total number of minutes of movies produced by each studio, i.e., a relation such as: studiosumOfLengths Disney123456 MGM54321 …… Starting with the relation: Movie(title, year, length, inColor, studioName, producerC#) from our example database schema, we must group the tuples according to their value for attribute studioName. We must then sum the length column within each group. That is, we imagine that the tuples of Movie are grouped as suggested, and we apply the aggregation SUM(length) to each group independently.

A relation with imaginary division into groups studioName Disney Disney Disney MGM MGM º º º

The Grouping Operator We shall now introduce an operator that allows us to group a relation and/or aggregate some columns. If there is grouping, then the aggregation is within groups. The subscript used with the γ operator is a list L of elements, each of which is either: a) An attribute of the relation R to which the γ is applied: this attribute is one of the attributes by which R will be grouped. This element is said to be a grouping attribute. b) An aggregation operator applied to an attribute of the relation. To provide a name for the attribute corresponding to this aggregation in the result, an arrow and new name are appended to the aggregation. The underlying attribute is said to be an aggregated attribute. The relation returned by the expression γ L (R) is constructed as follows: 1. Partition the tuples of R into groups. Each group consists of all tuples having one particular assignment of values to the grouping attributes in the list L. If there are no grouping attributes, the entire relation R is one group. 2. For each group, produce one tuple consisting of: i. The grouping attributes' values for that group and ii. The aggregations, over all tuples of that group, for the aggregated attributes on list L.

δ is a Special Case of γ Technically, the δ operator is redundant. If R(A 1, A 2,..., A n ) is a relation, then δ(R) is equivalent to γ A 1, A 2,..., A n (R). That is, to eliminate duplicates, we group on all the attributes of the relation and do no aggregation. Then each group corresponds to a tuple that is found one or more times in R. Since the result of γ contains exactly one tuple from each group, the effect of this "grouping" is to eliminate duplicates. However, because S is such a common and important operator, we shall continue to consider it separately when we study algebraic laws and algorithms for implementing the operators. One can also see γ as an extension of the projection operator on sets. That is, γ A 1, A 2,..., A n (R) is also the same as π A 1, A 2,..., A n (R) if R is a set. However, if R is a bag, then γ eliminates duplicates while π does not. For this reason, γ is often referred to as generalized projection.

Example Starsln(title, year, starName) γ starName, MIN(year)->minYear, COUNT(title)->ctTitle (StarsIn) γ starName, MIN(year)->minYear, COUNT(title)->ctTitle StarsIn σ ctTitle≥3 π starName, minYear

Extending the Projection Operator Let us reconsider the projection operator π L (R). In the classical relational algebra, L is a list of (some of the) attributes of R. We extend the projection operator to allow it to compute with components of tuples as well as choose components. In extended projection, also denoted π L (R), projection lists can have the following kinds of elements: 1. A single attribute of R. 2. An expression x —> y, where x and y are names for attributes. The element x —> y in the list L asks that we take the attribute x of R and rename it y, i.e., the name of this attribute in the schema of the result relation is y. 3. 3. An expression E —> z, where E is an expression involving attributes of R, constants, arithmetic operators, and string operators, and z is a new name for the attribute that results from the calculation implied by E. For example, a a + b —> x as a list element represents the sum of the attributes a and b, renamed x. Element c || d —> e means concatenate the (presumably string-valued) attributes c and d and call the result e. The result of the projection is computed by considering each tuple of R in turn. We evaluate the list L by substituting the tuple's components for the corresponding attributes mentioned in L and applying any operators indicated by L to these values. The result is a relation whose schema is the names of the attributes on list L, with whatever renaming the list specifies. Each tuple of R yields one tuple of the result. Duplicate tuples in R surely yield duplicate tuples in the result, but the result can have duplicates even if R does not.

Example R π A,B+C->X (R) π B-A->X,C-B->Y (R) ABCAXXY 0120311 0120311 3453911

The Sorting Operator There are several contexts in which we want to sort the tuples of a relation by one or more of its attributes. Often, when querying data, one wants the result relation to be sorted. For instance, in a query about all the movies in which Scan Connery appeared, we might wish to have the list sorted by title, so we could more easily find whether a certain movie was on the list. We shall also see how execution of queries by the DBMS is often made more efficient if we sort the relations first. The expression τL(R), where R is a relation and L a list of some of R 's attributes, is the relation R, but with the tuples of R sorted in the order indicated by L. If L is the list A 1, A 2,..., A n, then the tuples of R are sorted first by their value of attribute A 1. Ties are broken according to the value of A 2 ; tuples that agree on both A 1 and A 2 are ordered according to their value of A 3, and so on. Ties that remain after attribute A n is considered may be ordered arbitrarily.

The Sorting Operator If R is a relation with schema R(A, B, C), then τ C, B (R) orders the tuples of R by their value of C, and tuples with the same C -value are ordered by their B value. Tuples that agree on both B and C may be ordered arbitrarily. The operator τ is anomalous, in that it is the only operator in our relational algebra whose result is a list of tuples, rather than a set. Thus, in terms of expressing queries, it only makes sense to talk about r as the final operator in an algebraic expression. If another operator of relational algebra is applied after τ, the result of the τ is treated as a set or bag, and no ordering of the tuples is implied. However, as we shall see, it sometimes speeds execution of the query if we sort intermediate results.

Outerjoins A property of the join operator is that it is possible for certain tuples to be "dangling"; that is, they fail to match any tuple of the other relation in the common attributes. Dangling tuples do not have any trace in the result of the join, so the join may not represent the data of the original relations completely. In cases where this behavior is undesirable, a variation on the join, called "outerjoin," has been proposed and appears in various commercial systems. We shall consider the "natural" case first, where the join is on equated values of all attributes in common to the two relations. The outerjoin R  º S is formed by starting with R  S, and adding any dangling tuples from R or S. The added tuples must be padded with a special null symbol, , in all the attributes that they do not possess but that appear in the join result. When we study SQL, we shall find that the null symbol  is written out, as NULL. You may use NULL in place of  here if you wish.

Example UV U  º V ABCBCDABCD 123231012310 456231112311 7896712456 789 6712

Outerjoins There are many variants of the basic (natural) outerjoin idea. The left outerjoin R  º L S is like the outerjoin, but only dangling tuples of the left argument R are padded with  and added to the result. The right outerjoin R  º R S is like the outerjoin, but only the dangling tuples of the right argument S are padded with  and added to the result.

Example UV U  º L V ABCBCDABCD 123231012310 456231112311 7896712456 789

Example UV U  º R V ABCBCDABCD 123231012310 456231112311 78967126712

Outerjoins In addition, all three natural outerjoin operators have theta-join analogs, where first a theta-join is taken and then those tuples that failed to join with any tuple of the other relation, when the condition of the theta-join was applied, are padded with  and added to the result. We use  º C to denote a theta-outerjoin with condition C. This operator can also be modified with L or R to indicate left- or right-outerjoin.

Example UV U  º V A>V.C A>V.C ABCBCDAU.BU.CV.BV.CD 12323104562310 45623114562311 78967127892310 7892311 123 6712

Constraints on Relations Relational algebra provides a means to express common constraints, such as the referential integrity constraints. In fact, we shall see that relational algebra offers us convenient ways to express a wide variety of other constraints. Even functional dependencies can be expressed in relational algebra, as we shall see. Constraints are quite important in database programming, and we shall cover how SQL database systems can enforce the same sorts of constraints as we can express in relational algebra.

Relational Algebra as a Constraint Language There are two ways in which we can use expressions of relational algebra to express constraints. 1. If R is an expression of relational algebra, then R =  is a constraint that says "The value of R must be empty," or equivalently "There are no tuples in the result of R." 2. If R and S are expressions of relational algebra, then R  S is a constraint that says "Every tuple in the result of R must also be in the result of S." Of course the result of S may contain additional tuples not produced by R.

Relational Algebra as a Constraint Language These ways of expressing constraints are actually equivalent in what they can express, but sometimes one or the other is clearer or more succinct. That is, the constraint R  S could just as well have been written R - S = . To see why, notice that if every tuple in R is also in S, then surely R - S is empty. Conversely, if R - S contains no tuples, then every tuple in R must be in S (or else it would be in R - S ). On the other hand, a constraint of the first form, R = , could just as well have been written R  . Technically,  is not an expression of relational algebra, but since there are expressions that evaluate to , such as R - R, there is no harm in using  as a relational-algebra expression. Note that these equivalences hold oven if R and S are bags, provided we make the conventional interpretation of R  S : each tuple t appears in S at least as many times as it appears in R. In the following sections, we shall see how to express significant constraints in one of these two styles. As we shall see, it is the first style — equal-to-the-emptyset — that is most commonly used in SQL programming. However, as shown above, we are free to think in terms of set-containment if we wish and later convert our constraint to the equal-to-the-emptyset style.

Referential Integrity Constraints A common kind of constraint, called "referential integrity", asserts that a value appearing in one context also appears in another, related context. We saw referential integrity as a matter of relationships "making sense." That is, if an object or entity A is related to object or entity B, then B must really exist. For example, in ODL terms, if a relationship in object A is represented physically by a pointer, then referential integrity of this relationship asserts that the pointer must not be null and must point to a genuine object. In the relational model, referential integrity constraints look somewhat different. If we have a value v in a tuple of one relation R, then because of our design intentions we may expect that v will appear in a particular component of some tuple of another relation S. An example will illustrate how referential integrity in the relational model can be expressed in relational algebra.

Example Movie(title, year, length, inColor, studioName, producerC#) MovieExec(name, address, cert#, netWorth) π producerC# (Movie)  π cert# (MovieExec) π producerC# (Movie) - π cert# (MovieExec)= 

Example StarsIn(movieTitle, movieYear, starName) Movie(title, year, length, inColor, studioName, producerC#) π movieTitle, movieYear (StarsIn)  π title, year (Movie)

Additional Constraint Examples The same constraint notation allows us to express far more than referential integrity. For example, we can express any functional dependency as an algebraic constraint, although the notation is more cumbersome than the FD notation introduced.

Example name -> address MovieStar(name, address, gender, birthdate) MS1 is ρ MS1(name,address,gender,birthdate) (MovieStar) MS2 is ρ MS2(name,address,gender,birthdate) (MovieStar) σ MS1.name=MS2.name AND MS1.address  MS2.address (MS1  MS2) = 

Additional Constraint Examples Some domain constraints can also be expressed in relational algebra. Often, a domain constraint simply requires that values for an attribute have a specific data type, such as integer or character string of length 30, so we may associate that domain with the attribute. However, often a domain constraint involves specific values that we require for an attribute. If the set of acceptable values can be expressed in the language of selection conditions, then this domain constraint can be expressed in the algebraic constraint language. σ gender  'F' AND gender  'M' (MovieStar) = 

Additional Constraint Examples Finally, there are some constraints that fall into none of the categories outlined, nor are they functional or multivalued dependencies. The algebraic constraint language lets us express many new kinds of constraints. We offer one example here.

Example MovieExec(name, address, cert#, netWorth) Studio(name, address, presC#) σ netWorth<10000000 (Studio  MovieExec) =  presC# = cert# π presC# (Studio)  π cert# (σ netWorth≥10000000 (MovieExec))

Summary Classical Relational Algebra: This algebra underlies most query languages for the relational model. Its principal operators are union, intersection, difference, selection, projection, Cartesian product, natural join, theta-join, and renaming. Selection and Projection: The selection operator produces a result consisting of all tuples of the argument relation that satisfy the selection condition. Projection removes undesired columns from the argument relation to produce the result. Joins: We join two relations by comparing tuples, one from each relation. In a natural join, we splice together those pairs of tuples that agree on all attributes common to the two relations. In a theta-join, pairs of tuples are concatenated if they meet a selection condition associated with the theta-join. Relations as Bags: In commercial database systems, relations are actually bags, in which the same tuple is allowed to appear several times. The operations of relational algebra on sets can be extended to bags, but there are some algebraic laws that fail to hold. Extensions to Relational Algebra: To match the capabilities of SQL or other query languages, some operators not present in the classical relational algebra are needed. Sorting of a relation is an example, as is an extended projection, where computation on columns of a relation is supported. Grouping, aggregation, and outerjoins are also needed.

Summary Grouping and Aggregation: Aggregations summarize a column of a relation. Typical aggregation operators are sum, average, count, minimum, and maximum. The grouping operator allows us to partition the tuples of a relation according to their value(s) in one or more attributes before computing aggregation(s) for each group. Outerjoins: The outerjoin of two relations starts with a join of those relations. Then, dangling tuples (those that failed to join with any tuple) from either relation are padded with null values for the attributes belonging only to the other relation, and the padded tuples are included in the result. Constraints in Relational Algebra: Many common kinds of constraints can be expressed as the containment of one relational algebra expression in another, or as the equality of a relational algebra expression to the empty set. These constraints include functional dependencies and referential integrity constraints, for example.

Database Systems Relational Algebra assoc. prof., dr. Vladimir Dimitrov web: is.fmi.uni-sofia.bg.

Similar presentations

Presentation on theme: "Database Systems Relational Algebra assoc. prof., dr. Vladimir Dimitrov web: is.fmi.uni-sofia.bg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Systems Relational Algebra assoc. prof., dr. Vladimir Dimitrov web: is.fmi.uni-sofia.bg.

Similar presentations

Presentation on theme: "Database Systems Relational Algebra assoc. prof., dr. Vladimir Dimitrov web: is.fmi.uni-sofia.bg."— Presentation transcript:

Similar presentations

About project

Feedback