Handout 2CIS 550, Fall CIS 550, Fall 2001 Handout 2. SQL, Relational Calculus and Datalog
Handout 2CIS 550, Fall What we cannot compute with RA Recursive queries. Given a relation Parent(Parent, Child) compute the Ancestor relation. (Can do this in Datalog.) Aggregate operations. E.g. ``The number of climbers who have climbed `Last Tango' '' or ``The average age of Computing with non 1NF relations e.g. lists, arrays, multisets, nested relations.
Handout 2CIS 550, Fall Basic Query relation-list A list of relation names (possibly with a range-variable after each name). target-list A list of attributes of relations in relation-list. * can be used to denote all atts. qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of combined using AND, OR and NOT. DISTINCT (optional) keyword indicates that the answer should not contain duplicates. Default is that duplicates are not eliminated! SELECT [DISTINCT] target-list FROM relation-list WHERE qualification
Handout 2CIS 550, Fall Conceptual Evaluation Strategy Compute the product of relation-list Discard tuples that fail qualification Project over attributes in target-list If DISTINCT then eliminate duplicates This is probably a very bad way of executing the query, and a good query optimizer will use all sorts of tricks to find efficient strategies to compute the same answer.
Handout 2CIS 550, Fall Sample tables Routes: RId RName Grade Rating Height 1 Last Tango II Garden Path I The Sluice I Picnic III Climbers: Climbs: Cid CName Skill Age CId RId Date Duration 123 Edmund EXP /10/ Arnold BEG /08/ Bridget EXP /08/ James MED /07/ /07/94 3
Handout 2CIS 550, Fall Select/project queries SELECT * FROM Routes WHERE Height < 200; RID RNAME GRADE RATING HEIGHT 1 Last Tango II Garden Path I The Sluice I 8 60 SELECT Grade, Height FROM Routes; GRADE HEIGHT I 100 I 60 III 400
Handout 2CIS 550, Fall Distinct Note that SQL did not eliminate duplicates. We need to request this explicitly. SELECT DISTINCT Grade, Height FROM Routes; GRAD HEIGHT I 60 II 100 III 400
Handout 2CIS 550, Fall String Matching Can be used in where clause. “_” denotes any character, “%” 0 or more characters. SELECT * FROM Routes WHERE RName LIKE 'L_%o' RId RName Grade Rating Height 1 Last Tango II
Handout 2CIS 550, Fall Arithmetic “as” can be used to label columns in the output; arithmetic can be used to compute results SELECT DISTINCT Grade, Height/10 as H FROM Routes; Grade H II 10 I 6 III 40
Handout 2CIS 550, Fall Set operations -- union SELECT CId FROM Climbers WHERE Age < 40 UNION SELECT CId FROM Climbs WHERE RID = 1 ; CID Duplicates do not occur in the union.
Handout 2CIS 550, Fall The UNION ALL operator preserves duplicates SELECT Cid FROM Climbers WHERE Age < 40 UNION ALL SELECT Cid FROM Climbs WHERE RID = 1 ; CID
Handout 2CIS 550, Fall What does “union compatible” mean? SELECT CId FROM Climbers UNION SELECT RId FROM Routes; Ok SELECT CName FROM Climbers UNION SELECT RId FROM Routes; Error
Handout 2CIS 550, Fall Intersection and difference SELECT CId FROM Climbers WHERE Age > 40 INTERSECT SELECT CId FROM Climbs WHERE RId = 1 ; SELECT CId FROM Climbers WHERE Age < 40 MINUS SELECT CId FROM Climbs WHERE RId = 1 ; CID CID 123
Handout 2CIS 550, Fall Nested queries We could also have written the previous queries as follows: SELECT CId FROM Climbers WHERE Age > 40 AND CId IN (SELECT CId FROM Climbs WHERE RId = 1) ; SELECT CId FROM Climbers WHERE Age < 40 AND CId NOT IN (SELECT CId FROM Climbs WHERE RId = 1) ;
Handout 2CIS 550, Fall Nested queries with correlation SELECT CId FROM Climbers c WHERE EXISTS (SELECT * FROM Climbs b WHERE c.CId=b.CId AND b.RID = 1); SELECT CId FROM Climbers c WHERE NOT EXISTS (SELECT * FROM Climbs b WHERE c.CId=b.CId); SELECT CId FROM Climbers c WHERE EXISTS UNIQUE (SELECT * FROM Climbs b WHERE c.CId=b.CId AND RID = 1);
Handout 2CIS 550, Fall More on set comparison ops Besides IN, NOT IN, EXISTS, NOT EXISTS, UNIQUE and NOT UNIQUE we can also say: ANY, ALL, where is any of What does the following mean in English? SELECT CName, Age FROM Climbers WHERE Age >= ALL (SELECT Age FROM Climbers) CName Age Edmund 80
Handout 2CIS 550, Fall Set comparison ops, cont. SELECT CName, Age FROM Climbers WHERE Age > ANY (SELECT Age FROM Climbers WHERE CName='Arnold') Cid CName Skill Age 123 Edmund EXP Bridget EXP James MED 27 What does the following mean in English?
Handout 2CIS 550, Fall Using expressions for relation names Consider the following query: “Find the names of climbers who have not climbed any route.” SELECT CName FROM (SELECT CId FROM Climbers MINUS SELECT CId FROM Climbs) Temp, Climbers WHERE Temp.CId = Climbers.CId; CNAME James
Handout 2CIS 550, Fall Products Note that the CID column name is duplicated in the output. SELECT * FROM Climbers,Climbs; CID CNAME SKILL AGE CID RID DAY DURATION 123 Edmund EXP OCT Arnold BEG OCT Bridget EXP OCT James MED OCT Edmund EXP NOV Arnold BEG NOV
Handout 2CIS 550, Fall Conditional join SELECT * FROM Climbers,Climbs WHERE Climbers.CId = Climbs.CId; CID CNAME SKIL AGE CID RID DAY DURATION 123 Edmund EXP OCT Edmund EXP NOV Bridget EXP DEC Arnold BEG AUG Bridget EXP JUN-94 3
Handout 2CIS 550, Fall Example 1 The names of climbers who have climbed route 1. SELECT CName FROM Climbers, Climbs WHERE Climbers.CId = Climbs.CId AND RId= 1; CNAME Edmund Bridget
Handout 2CIS 550, Fall Example 2 The names of climbers who have climbed the route named “Last Tango”. SELECT CName FROM Climbers, Climbs, Routes WHERE Climbers.CId = Climbs.CId AND Routes.RId = Climbs.RID AND RName = 'Last Tango'; CNAME Edmund Bridget
Handout 2CIS 550, Fall Example 3 The IDs of climbers who have climbed the same route twice. Note the use of aliases for relations. SELECT C1.CId FROM Climbs C1, Climbs C2 WHERE C1.CId = C2.CId AND C1.RId = C2.RId AND (C1.Day <> C2.Day OR C1.DURATION <> C2.DURATION)); CID 313
Handout 2CIS 550, Fall Example 4 Recall: The names of climbers who have not climbed any route SELECT CName FROM (SELECT CId FROM Climbers MINUS SELECT CId FROM Climbs) Temp, Climbers WHERE Temp.CId = Climbers.CId; CNAME James
Handout 2CIS 550, Fall Example 4, cont. A simpler alternative: SELECT CName FROM Climbers WHERE CId NOT IN (SELECT CId FROM Climbs); CNAME James
Handout 2CIS 550, Fall Universal Quantification SELECT CId FROM Climbs c1 WHERE NOT EXISTS (SELECT RId Routes not climbed FROM Routes r by c1. WHERE NOT EXISTS (SELECT * FROM Climbs c2 WHERE c1.CId=c2.CId and c2.RId=r.RId) The IDs of climbers who have climbed all routes.
Handout 2CIS 550, Fall Non-algebraic operations SQL has a number of operations that cannot be expressed in relational algebra. The first is to express arithmetic in queries. SELECT RName, Rating * Height AS Difficulty FROM Routes; RNAME DIFFICULTY Last Tango 1200 Garden Path 120 The Sluice 480 Picnic 1200
Handout 2CIS 550, Fall Arithmetic, cont Arithmetic (and other expressions) cannot be used at the top level. E.g. 2+2 isn't an SQL query. Question -- how would you get SQL to compute 2+2?
Handout 2CIS 550, Fall Counting Surprisingly, the answer to both of these is the following: SELECT COUNT(RId) FROM Routes; SELECT COUNT(Grade) FROM Routes; COUNT(GRADE) 4
Handout 2CIS 550, Fall Counting, cont. To fix this, we use the keyword “DISTINCT”: Can also use SUM, AVG, MIN and MAX. SELECT COUNT(DISTINCT Grade) FROM Routes; COUNT(GRADE) 3
Handout 2CIS 550, Fall Group by So far, these aggregate operators have been applied to all qualifying tuples. Sometimes we want to apply them to each of several groups of tuples. For example: “Print the number of routes in each grade.”
Handout 2CIS 550, Fall Group by Note that only the columns that appear in the GROUP BY statement and “aggregated” columns can appear in the output. So the following would generate an error. SELECT Grade, COUNT(*) FROM Routes GROUP BY Grade; GRADE COUNT(*) I 2 II 1 III 1 SELECT Grade, RName, COUNT(*) FROM Routes GROUP BY Grade;
Handout 2CIS 550, Fall Group by … having HAVING is to GROUP BY as WHERE is to FROM “HAVING” is used to restrict the groups that appear in the result. SELECT Height, AVG(Rating) FROM Routes GROUP BY Height HAVING Height < 300; HEIGHT AVG(RATING)
Handout 2CIS 550, Fall Another example SELECT Height, AVG(Rating) FROM Routes GROUP BY Height HAVING MAX(Rating) < 10; HEIGHT AVG(RATING)
Handout 2CIS 550, Fall Null Values The value of an attribute can be unknown (e.g., a rating has not been assigned) or inapplicable (e.g., no spouse). –SQL provides a special value null for such situations. The presence of null complicates many issues. E.g.: –Special operators needed to check if value is/is not null. –Is rating>8 true or false when rating is equal to null? What about AND, OR and NOT connectives? 3-valued logic (true, false and unknown). –Meaning of constructs must be defined carefully. (e.g., WHERE clause eliminates rows that don’t evaluate to true.)
Handout 2CIS 550, Fall Outer Join A variant of the join that relies on null values: Tuples of Climbers that do not match some tuple in Climbs would normally be excluded from the result; the “left” outer join preserves them with null values for the missing Climbs attributes. SELECT Climbers.CId, Climbs.RId FROM Climbers NATURAL LEFT OUTER JOIN Climbs
Handout 2CIS 550, Fall Result of left outer join CId CName Skill Age RId Date Duration 123 Edmund EXP /10/ Edmund EXP /08/ Arnold BEG /07/ Bridget EXP /08/ Bridget EXP /07/ James MED 27 Null values can be disallowed in a query result by specifying NOT NULL.
Handout 2CIS 550, Fall Summary SQL is “relationally complete”: all of the operators of the relational algebra can be simulated. Additional features: string comparisons, set membership, arithmetic and grouping.
Handout 2CIS 550, Fall Views in SQL A view is a query with a name that can be used in SELECT statements. Note that ExpClimbers is not a stored relation! CREATE VIEW ExpClimbers AS SELECT CId, CName, Age FROM Climbers WHERE Skill=‘EXP’; SELECT CName FROM ExpClimbers WHERE Age<50;
Handout 2CIS 550, Fall Querying views The system would perform the following translation: This is done using the relational algebra “operator tree” representation of the query, and relational algebra equivalences. SELECT CName FROM ExpClimbers WHERE Age<50; is translated to SELECT CName FROM Climbers WHERE Skilll=‘EXP’ and Age<50;
Handout 2CIS 550, Fall The “how” of translation The operator tree for is expanded to SELECT CName FROM ExpClimbers WHERE Age<50; ExpClimbers Climbers
Handout 2CIS 550, Fall Changing the database How do we initialize the database? How do we update and modify the database state? SQL supports an update language for insertions, deletions and modifications of tuples. –INSERT INTO R(A1,…,An) VALUES (V1,…,Vn); –DELETE FROM R WHERE ; –UPDATE R SET WHERE ;
Handout 2CIS 550, Fall Tuple insertion Recall our rock climbing database, with the following instance of Routes: To insert a new tuple into Routes: RId RName Grade Rating Height 1 Last Tango II Garden Path I The Sluice I Picnic III INSERT INTO Routes(RId, Rname, Grade, Rating, Height) VALUES (5, “Desperation”, III,12,600);
Handout 2CIS 550, Fall Tuple insertion, cont. Alternatively, we could omit the attributes since the order given matches the DDL for Routes: INSERT INTO Routes VALUES (5, “Desperation”, III,12,600); RId RName Grade Rating Height 1 Last Tango II Garden Path I The Sluice I Picnic III Desperation III
Handout 2CIS 550, Fall Set insertion Suppose we had the following relation and wanted to add all the routes with rating > 8: INSERT INTO HardClimbs(Route,Rating,FeetHigh) SELECT DISTINCT Rname, Grade, Rating, Height FROM Routes WHERE rating>8; HardClimbs: Route Rating FeetHigh SlimyClimb The Sluice 8 60 Route Rating FeetHigh SlimyClimb The Sluice 8 60 Last Tango
Handout 2CIS 550, Fall Deletion Deletion is set-oriented: the only way to delete a single tuple is to specify its key. Suppose we wanted to get rid of all tuples in HardClimbs that are in Routes: DELETE FROM HardClimbs WHERE Route in (SELECT Name FROM Routes) HardClimbs: Route Rating FeetHigh SlimyClimb 9 200
Handout 2CIS 550, Fall Modifying tuples Non-key values of a relation can be changed using UPDATE. Suppose we want to increase the age of all experienced climbers by 1: NOTE: SQL uses an “old-value” semantics. New values are calculated using the old state, not a partially modified state. UPDATE Climbers SET Age = Age+1 WHERE Skill = “EXP”;
Handout 2CIS 550, Fall Old-value semantics “Give a $1000 raise to every employee who earns less than their manager.” Old-value semantics: employees 1 and 3 are given a raise. Otherwise: employee 2 will get a raise if they are considered after employee 3 receives a raise! Emp Manager Salary , ,500 3 21,000
Handout 2CIS 550, Fall Modifying views Since the view definition is not stored, the view “changes” as the relations in the FROM clause change. We could also think of making changes to the view itself: Unfortunately, this particular view definition is not updatable! INSERT INTO ExpClimbers VALUES (7,‘Jean’, 48);
Handout 2CIS 550, Fall Modifying views, cont. This would imply the following insertion, since we are not given a value for skill: If the view were computed after this update, the new tuple would not appear because ‘EXP’= does not evaluate to true! INSERT INTO ExpClimbers VALUES (7,‘Jean’, 48); INSERT INTO Climbers VALUES (7,‘Jean’, , 48);
Handout 2CIS 550, Fall An updatable view The problem with ExpClimbers was the projection which eliminated an attribute used to create the view. CREATE VIEW OldClimbers AS SELECT * FROM Climbers WHERE Age>40;
Handout 2CIS 550, Fall Deleting using views We may also want to delete a tuple in the view: What about views involving joins? DELETE FROM ExpClimbers WHERE Cname=‘Jeremy’; would translate to DELETE FROM Climbers WHERE Cname=‘Jeremy’; CREATE VIEW ClimbInfo AS SELECT B.Cid, B.Cname, RID, Date,Duration FROM Climbers B, Climbs C WHERE C.Cid=B.Cid
Handout 2CIS 550, Fall When is a view updatable? For a view to be updatable: –it must involve a single relation R –the WHERE clause must not involve R in a subquery –the SELECT clause must include enough attributes that the missing ones can be filled with or default values.
Handout 2CIS 550, Fall Schema modification Requirements change over time, so it is useful to be able to add/delete columns, drop tables and drop views: –DROP TABLE Climbers; –DROP VIEW ExpClimbers; –ALTER TABLE Climbs ADD Weather CHAR(50); –ALTER TABLE Routes DROP Grade; Problem: Must validate changes against legacy applications and code! Views can be useful here.
Handout 2CIS 550, Fall Summary Views are useful for frequently executed queries and as a layer to shield applications from changes in the schema. SQL has an update language that allows set- oriented updates. Updates (insertions, deletions and modifications) change the database state.
Handout 2CIS 550, Fall Relational Calculus First-order logic (FOL) can also be thought of as a query language, and can be used in two ways: –Tuple relational calculus –Domain relational calculus The difference is the level at which variables are used: for attributes (domains) or for tuples. The calculus is non-procedural (declarative) as compared to the algebra.
Handout 2CIS 550, Fall Domain relational calculus Queries have form: { |p} where x 1,x 2, …, x n are domain variables and p is a predicate which may mention the variables x 1,x 2, …, x n Example: simple projection { | RI,G,R. Routes} Example: selection and projection: { | RI,G,R. Routes G >5.5}
Handout 2CIS 550, Fall DRC examples, cont Join: { | RI,RN,G,H,RI’,Da,Du. Routes Climbs RI=RI’} We could also have written the above as: { | RI,RN,G,H,Da,Du. Routes Climbs}
Handout 2CIS 550, Fall Predicate Logic - a quick review The syntax of predicate logic starts with variables, constants and predicates that can be built using a collection of boolean-valued operators (boolean expressions) Examples: 1=2, x y, prime(x), contains(t,”Joe”). Precisely what operations are available depends on the domain and on the query language. For now we will assume the following boolean expressions: – Rel, X op Y, X op constant, or constant op X, where op is , , , , , and X,Y,… are domain variables
Handout 2CIS 550, Fall Predicate Logic, cont. Starting with these basic predicates (also called atomic), we can build up new predicates by the following rules: –Logical connectives: If p and q are predicates, then so are p q, p q, p, and p q (x>2) (x<4) (x>2) (x>0) –Existential quantification: If p is a predicate, then so is x.p x. (x>2) (x<4) –Universal quantification: If p is a predicate, then so is x.p x.x>2 x. y.y>x
Handout 2CIS 550, Fall Logical Equivalences There are two logical equivalences that will be heavily used: –p q p q (Whenever p is true, q must also be true.) – x. p(x) x. p(x) (p is true for all x) The second will be especially important when we study SQL.
Handout 2CIS 550, Fall Free and bound variables A variable v is bound in a predicate p when p is of the form v… or v… A variable occurs free in p if it occurs in a position where it is not bound by an enclosing or Examples: –x is free in x>2 –x is bound in x.x>y –x is free in (x>17) ( x.x>2) Note that there are two occurrences of x in the last example.
Handout 2CIS 550, Fall Renaming variables When a variable is bound one can replace it with some other variable without altering the meaning of the expression, providing there are no name clashes Example: x.x>2 is equivalent to y.y>2
Handout 2CIS 550, Fall Some queries… Try the following examples: –The names and ages of climbers –The names and ages of climbers who have climbed route 214 –The names of climbers who have climbed “Last Tango” –The names of climbers who have climbed all routes with rating greater than 5.5 –The names of climbers who have climbed the same route twice
Handout 2CIS 550, Fall Safety There is a problem with what we have done so far. How should we treat a query like: { | Climbers>} This presumably means the set of all tuples (of the appropriate type) that are not climbers, which is presumably an infinite set. A query is safe if no matter how we instantiate the relations, it always produces a finite answer. Unfortunately, safety (a semantic condition) is undecidable. That is, there is no program which can look at the syntax of a query and decide if it is safe. A more restrictive syntactic condition (domain independence) can be used.
Handout 2CIS 550, Fall Translating from RA to DRC Recall that the relational algebra consists of , , , x, -. We need to work our way through the structure of an RA expression, translating each possible form. Let TR[e] be the translation of RA expression e into DRC. Relation names: For the RA expression R, the DRC expression is { | R}
Handout 2CIS 550, Fall Selection Suppose the RA expression is c (e’), where e’ is another RA expression with TR[e’]= { | p} Then the translation of c (e’) is { | p C’}, where C’ is the condition obtained from C by replacing each attribute with the corresponding variable. Example: TR[ #1=#2 #4>2.5 R] (where R has arity 4) is { | R x 1 =x 2 x 4 >2.5}
Handout 2CIS 550, Fall Projection If TR[e]= { | p} then TR[ i 1,i 2,…,i m (e)]= { | x j 1,x j 2, …, x j k.p}, where x j 1,x j 2, …, x j k are variables in x 1,x 2, …, x n that are not in x i 1,x i 2, …, x i m Example: With R as before, #1,#3 (R)={ | x 2,x 4. R}
Handout 2CIS 550, Fall Union We know that R and S in R S must be union compatible, so they must have the same arity. Therefore we can assume that for e 1 e 2, where e 1, e 2 are algebra expressions, TR[e 1 ]={ |p} and TR[e 2 ]={ |q}. Relabel the variables in the second so that TR[e 2 ]={ |q’}. This may involve relabeling bound variables in q to avoid clashes. Then TR[e1 e2]={ |p q’}. Example: TR[R S]= { | R S
Handout 2CIS 550, Fall Other binary operators Difference: The same conditions hold as for union. So TR[e 1 ]={ |p} and TR[e 2 ]={ |q}. Then TR[e 1 - e 2 ]= { |p q} Product: If TR[e 1 ]={ |p} and TR[e 2 ]={ |q}, then TR[e 1 e 2 ]= { | p q} Example: TR[R S]= { | R S }
Handout 2CIS 550, Fall Summary We’ve seen how to translate relational algebra into (domain) relational calculus. There are various syntactic restrictions for guaranteeing the safety of a DRC query. From any of these we can translate back into relational algebra It was this correspondence between an (implementable and optimizable) algebra and first-order logic that was responsible for the initial development of relational databases – a prime example of some theory leading to highly successful practical developments!
Handout 2CIS 550, Fall What we cannot compute with relational calculus/algebra Aggregate operations, e.g. “The number of climbers who have climbed ‘Last Tango’” or “The average age of climbers.” These are possible in SQL. Recursive queries. Given a relation Parent(Parent, Child) compute the ancestor relation. This appears to call for an arbitrary number of joins. It is known that it cannot be expressed in first-order logic, hence it cannot be expressed in relational algebra.
Handout 2CIS 550, Fall What we cannot compute with relational algebra, cont Computing with complex structures that are not (1NF) relations, e.g. lists, arrays, multisets. Of course, we can always compute such things if we can “talk to” a database from a full-blown (Turing complete) programming language, and we’ll see how to do this later. However, communicating with a database in this way may well be inefficient, and adding computational power to a query language remains an important research topic.
Handout 2CIS 550, Fall Datalog The general idea behind Datalog is to use Horn-clauses -- “if-then” rules -- as a query language for relational databases. Relations are represented by predicates, e.g. Climbers, Climbs and Routes are interpreted as predicates with fixed arity. Positional interpretation to arguments; e.g. Climbers(X,”Bridget”, “EXP”,”33”). The arguments can be constants (e.g. “Bridget”, “EXP”, and “33”) or variables (e.g. X). –Will use upper case for var, lower case for constants
Handout 2CIS 550, Fall Truth values A predicate is ground if all of its arguments are constants. Ground predicates have truth values, which mirror whether or not the “tuple” is in the relation. Climbers(313,bridget, exp,33) is true. Climbers(518,jeremy,exp,17) is false. Predicates can also be negated: NOT Climbers(518,jeremy,exp,17) is true. Climbers CId Cname Skill Age 123 edmund exp arnold beg bridget exp james med 27
Handout 2CIS 550, Fall “Arithmetic” Predicates We will want to mirror conditions, and will use the predicates. I.e. is true Note that in contrast to “relational” predicates, arithmetic predicates are infinite!
Handout 2CIS 550, Fall Datalog Rules A rule has form, where p is a relational predicate called the head and q is a conjunction of predicates (subgoals) called the body. Example: When a variable is not used, it can be replaced by “_” (anonymous variable). head EXPClimbers(I,N,A) Climbers(I,N,S,A) AND S=exp body EXPClimbers(N) Climbers(_,N,exp,_)
Handout 2CIS 550, Fall Some examples… The names of climbers older than 32. The names of climbers who have climbed route 1. The names of climbers with age less than 40 who have climbed a route with rating higher than 5. Note the positional interpretation of attributes! OLD(N) Climbers(I,N,S,A) AND A>32 Route1(N) Climbers(I,N,_,_) AND Climbs(I, 1,_,_) Rating5(N) Climbers(I,N,_,A) AND Climbs(I, R,_,_) AND Routes(R,_,_,Ra,_) AND Ra>5 AND A<40
Handout 2CIS 550, Fall Safety A rule is safe if every variable occurs at least once in a positive relational predicate in the body. Some unsafe rules: Some safe rules: (and all the ones we have seen so far). Likes(X,Y) Starved(X) Sedate(X) NOT Climbers(_,X,_,_) Likes(X,Y) Starved(X) AND Food(Y) Sedate(X) Person(X) AND NOT Climbers(_,X,_,_)
Handout 2CIS 550, Fall Datalog Query A query is a collection of one or more rules. A rule with an empty body is called a fact (positive ground relational predicate). Student (123,j.smith,compsci) Student(456,k.tappet,french) Offers(cookery,baking) Offers(compsci,compilers) Enroll(123,baking) Enroll(012,compilers) InterestedIn(X,S) Student(X,Y,S) InterestedIn(X,S) Enroll(X,Z) AND Offers(S,Z)
Handout 2CIS 550, Fall The query in relational algebra The previous query corresponds to the following relational algebra expression: What would you expect the output of the query to be?
Handout 2CIS 550, Fall Intensional versus Extensional Predicates Extensional predicates are those whose relations are stored in the db; intensional predicates are those which are computed by applying one or more rules. –Student, Offers, and Enroll are extensional –InterestedIn is intensional Extensional predicates can never appear in the head of a rule.
Handout 2CIS 550, Fall Another example Can this be translated to relational algebra? What do you expect the output of the query to be? Which are EDB predicates and which are IDB? Parent(mary,jane) Parent(jane,fred) Parent(ed,bob) Parent(bob,fred) Parent(fred,jill) Ancestor(X,Y) Parent(X,Y) Ancestor(X,Y) Parent(X,Z) AND Ancestor(Z,Y)
Handout 2CIS 550, Fall Meaning of Datalog rules Consider every possible assignment of values to variables. For every such assignment which makes all the subgoals true, the tuple corresponding to the head is true and added to the result. Example: X=012, Z=compilers, S=compsci Offers(compsci,compilers) Enroll(012,compilers) InterestedIn(X,S) Enroll(X,Z) AND Offers(S,Z) So InterestedIn(012,compsci) is added to result.
Handout 2CIS 550, Fall Another way to define meaning... The “assignment” method of defining meaning considers “meaningless” variable assignments. For example: X=compilers, Z=012, S=f.dunham Another method is to consider the set of tuples in each nonnegated relational subgoal, and look at “consistent” variable assignments. If all subgoals are true (negated as well as arithmentic), then the tuple in the head is added to the result. This will suggest an implementation using RA! InterestedIn(X,S) Enroll(X,Z) AND Offers(Z,S)
Handout 2CIS 550, Fall An interesting “incorrectly” written query It is easy to write queries that do not express your intension. E.g. Single(X) Person(X) AND NOT Married(X,Y) What does this query mean in English? If the intension was to get all people who are not married, how should the query have been written? The query also isn’t safe!
Handout 2CIS 550, Fall RA versus Datalog The Ancestor example is called recursive because the definition of ancestor depends on itself (directly). This cannot be simulated in RA, and we will need to add a fixpoint operator to the algebra to simulate it. If subgoals are not allowed to be negated, we cannot emulate set difference in Datalog. However, if subgoals can be negated we can simulate any RA expression in Datalog.
Handout 2CIS 550, Fall Simulating RA in Datalog Intersection: simulate by a rule with each relation as a subgoal. E.g. recall Climbers and Hikers from aprevious lecture. would be written as Difference: simulate by a rule with each relation as a subgoal, with second subgoal negated. So write Climbers - Hikers as CandH(I,N,S,A) Climbers(I,N,S,A) AND Hikers(I,N,S,A) CnotH(I,N,S,A) Climbers(I,N,S,A) AND NOT Hikers(I,N,S,A) Climbers Hikers
Handout 2CIS 550, Fall Simulating RA, contd. Union: simulate by two rules, each of which has a body consisting of one of the relations as its sole subgoal. So would be written as Projection: simulate by one rule, the head of which uses variables corresponding to the attributes being projected on. So would be written as Result(N,A) Climbers(_,N,_,A) CorH(I,N,S,A) Climbers(I,N,S,A) CorH(I,N,S,A) Hikers(I,N,S,A)
Handout 2CIS 550, Fall Simulating Selection If the condition is a conjunction of arithmetic atoms, this is easy: append each conjunct as a subgoal. So becomes OR can be simulated using union; recall the equivalence Result(I,N,S,A) Climbers(I,N,S,A) AND N= bridget AND Age>30
Handout 2CIS 550, Fall Simulating Selection, contd. Now, recall from logic that any expression involving and, or and not can be put into conjunctive normal form: an OR of conjuncts, each of which is the AND of comparisons.
Handout 2CIS 550, Fall Simulating Product and Join The product of two relations is expressed by a single rule with both relations as subgoals; all the variables in the relations appear in the head: R(A,B) S(A) AND T(B) A join is just an equality selection and projection on a product; the equal terms can be expressed by reusing variables: R(A,B,C) S(A,B) AND T(D,C) AND B=D or R(A,B,C) S(A,B) AND T(B,C)
Handout 2CIS 550, Fall Simulating multiple operation RA expressions Create the “operator tree”: Create an IDB predicate for each interior node, and write the corresponding rule. IDB corresponding to root is the result. Climbers Climbs
Handout 2CIS 550, Fall Datalog: Depends-On Graph Nodes correspond to relational predicates; there is an edge from A to B if B appears as a subgoal (positive or negative) in a rule with head A. The edge is annotated with “-” for negated subgoals: InterestedIn(X,S) Student(X,Y,S) InterestedIn(X,S) Enroll(X,Z) AND Offers(Z,S) InterestedIn StudentEnroll Offers
Handout 2CIS 550, Fall Cyclic Depends-On Graphs A cyclic Depends-On graph indicates a recursive query. Recall the ancestor example: Ancestor(X,Y) Parent(X,Y) Ancestor(X,Y) Parent(X,Z) AND Ancestor(Z,Y) Parent Ancestor
Handout 2CIS 550, Fall Evaluating Datalog in RA Assume for now non-recursive Datalog programs with no negated predicates. Evaluation of IDB predicates proceeds “bottom-up” from the leaves of the Depends-On graph so that subgoals are completely evaluated by the time they are used. (Note that the EDB predicates must be leaves in this graph!)
Handout 2CIS 550, Fall Evaluating Datalog in RA, cont. Procedure to evaluate one rule (no negation): –Take the product of the relational (non-aritmetic) subgoals –Form a selection of the product with a condition that equates positions with same variable and captures all arithmetic predicates –Project over variables appearing in the head For each IDB predicate R, take the union of the expressions of rules with head R.
Handout 2CIS 550, Fall Examples Result(N) Climbers(I,N,_,A) AND A 5 InterestedIn(X,S) Student(X,Y,S) InterestedIn(X,S) Enroll(X,Z) AND Offers(Z,S) NotSingle(X) Married(X,Y) NotSingle(X) Married(Y,X)
Handout 2CIS 550, Fall Handling negated subgoals Suppose NOT R(X,Y,Z) appears as a negated subgoal in some query. Let DOM={any symbol that appears in the rule} U {any symbol that appears in any relational instance appearing in a subgoal of the rule}. It is sufficient to “evaluate” NOT R(X,Y,Z) as (DOM X DOM X DOM)-R However, there are problems when this is combined with recursion ! (More later…)
Handout 2CIS 550, Fall Example NotSingle(X) Married(X,Y) NotSingle(X) Married(Y,X) Single(X) Person(X) AND NOT NotSingle(X) How do we evaluate this program? Let DOM={all symbols in Person}U{all symbols in NotSingle} The correct version of the example finding all single people in the database is:
Handout 2CIS 550, Fall Recursive queries (no negation) - “Naïve” evaluation Let R, S... be IDB predicates occurring in a single cycle in the Depends-on graph. R= , S= while there is a change to R, S,… do R=R {evaluation of R} S=S {evaluation of S}
Handout 2CIS 550, Fall Example Parent(mary,jane) Ancestor(X,Y) Parent(X,Y) Parent(jane,fred) Parent(ed,bob) Ancestor(X,Y) Parent(X,Z) Parent(bob,fred) AND Ancestor(Z,Y) Parent(fred,jill) 1. Ancestor = ø 2. Ancestor = {(mary,jane),(jane,fred),(ed,bob),(bob,fred), (fred,jill)} Evaluation of Ancestor:
Handout 2CIS 550, Fall Example, cont. 3. Ancestor = {(mary,jane),(jane,fred),(ed,bob),(bob,fred), (fred,jill), (mary, fred),(jane,jill),(ed.fred) (bob,jill)} 4. Ancestor = {(mary,jane),(jane,fred),(ed,bob),(bob,fred), (fred,jill), (mary, fred),(jane,jill),(ed,fred) (bob,jill), (mary,jill), (ed,jill)}
Handout 2CIS 550, Fall Negation in Recursive Rules Problem: what is the semantics? For example, suppose an IDB of R(0): P(X) R(X) AND NOT Q(X) Q(X) R(X) AND NOT P(X) There are two “correct” answers: {R(0), P(0)} and {R(0),Q(0)}. Both are minimal in the sense that we cannot throw out anything and get a correct answer (i.e. {R(0)} is inconsistent).
Handout 2CIS 550, Fall Stratified Negation: an overview A technique for assigning a single meaning to certain safe Datalog programs with negation. Works by dividing predicates up into “strata”, which are linearly ordered. Each strata must be completely evaluated before the next strata is evaluated. Would not be able to handle the previous example as the program has a cycle of “negative” arcs in the Depends-On graph.
Handout 2CIS 550, Fall Stratified Negation: the Details A program is stratified iff whenever there is a rule with head p and q occurs as a negated predicate, there is no path from p to q in the Depends-On graph. NotSingle(X) Married(X,Y) NotSingle(X) Married(Y,X) Single(X) Person(X) AND NOT NotSingle(X) Person NotSingle Married ¬ Single
Handout 2CIS 550, Fall Strata-labeling Algorithm for each predicate p do stratum(p)=1 repeat until no changes to any stratum or some stratum exceeds the number of predicates for each rule r with head p do begin for each negated subgoal of r with predicate q do stratum(p):= max(stratum(p), 1+stratum(q)) for each positive subgoal or r with predicate q do stratum(p):= max(stratum(p), stratum(q)) end for end repeat Stratum(1): Person, Married, NotSingle Stratum(2): Single
Handout 2CIS 550, Fall Summary: RA versus Datalog If subgoals can be negated we can simulate any RA expression in Datalog. (Without negated subgoals we cannot handle “-”.) To simulated “recursive” Datalog queries, we must add the fixpoint operator to the relational algebra. Datalog with negated subgoals AND recursion may yield programs with more than one “minimal model” -- stratified negation is a technique for evaluating some of these programs.