Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary.

Slides:



Advertisements
Similar presentations
1 Datalog: Logic Instead of Algebra. 2 Datalog: Logic instead of Algebra Each relational-algebra operator can be mimicked by one or several Database Logic.
Advertisements

Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Relational Calculus and Datalog
CHAPTER 3: DESCRIBING DATA SOURCES
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Advanced Views: XML-Relational Storage and Datalog Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 8, 2005.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Conjunctive Queries, Datalog, and Recursion Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 23, 2003 Some slide.
CSE 636 Data Integration Datalog Rules / Programs / Negation Slides by Jeffrey D. Ullman.
2005conjunctive1 Query languages, equivalence & containment  conjunctive queries – CQ’s  More expressive languages.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Recursive Views and Global Views Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 9, 2004 Some slide content.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Deductive Databases Chapter 25
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 12: Ontologies and Knowledge Representation PRINCIPLES OF DATA INTEGRATION.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model Pearson Education © 2014.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
INTRODUCTION TO THE THEORY OF COMPUTATION INTRODUCTION MICHAEL SIPSER, SECOND EDITION 1.
Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Relational Calculus Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 17, 2007 Some slide content courtesy.
Datalog Inspired by the impedance mismatch in relational databases. Main expressive advantage: recursive queries. More convenient for analysis: papers.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Relational Algebra.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Chapter 5 Notes. P. 189: Sets, Bags, and Lists To understand the distinction between sets, bags, and lists, remember that a set has unordered elements,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
For Monday Finish chapter 19 No homework. Program 4 Any questions?
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Lu Chaojun, SJTU 1 Extended Relational Algebra. Bag Semantics A relation (in SQL, at least) is really a bag (or multiset). –It may contain the same tuple.
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
Extensions of Datalog Wednesday, February 13, 2001.
CS589 Principles of DB Systems Fall 2008 Lecture 4c: Query Language Equivalence Lois Delcambre
Lecture 2 The Relational Model
Computing Full Disjunctions
Cse 344 January 29th – Datalog.
Datalog, View Unfolding, Recursion and Foundations of Integration
XML Views & Reasoning about Views
Logic Based Query Languages
Datalog Inspired by the impedance mismatch in relational databases.
Presentation transcript:

Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Agenda Terminology Conjunctive queries and Datalog Virtual Data Integration Architecture

(Known) Terminology (1) Relational database composed of a set of relations (or tables) Database schema includes a relational schema for each table + set of integrity constraints  Key constraints, functional dependencies, foreign key constraints The rows of a relational table are called tuples (or records) The NULL value means the values is not known or doesn’t exist  Equality test envolving NULL always returns FALSE A database instance is a particular snapshot of the contents of the DB

(Known) Terminology (2) Queries used to formulate users’ needs  Structured (SQL or XQuery) or unstructured (for the Web, e.g., list of keywords)  Q(D): result of applying query Q to the database D Views are created when we want to reuse the same query expression in other queries  Materialized view: the answer is computed and maintained as the DB changes  Queries can also be used to specify relationships between schemas of data sources in a data integration context Two different notations for queries:  SQL  Conjunctive queries, for formal purposes

Conjunctive queries Based on Mathematical Logic Conjunctive query has the form: Q(X):- R1(X1),....., Rn(Xn), c1,..., cm  R1(X1)....Rn(Xn) are the sub-goals (or conjuncts) and form the body of the query  Ris are database relations, Xi’s are tuples of variables and constants  The variables in X are called head variables or distinguished variables; the others are called existencial variables  The predicate Q denotes the answer relation of the query  The ci’s are interpreted atoms and have the form X θ Y, with X, Y either variables or constants (but at least one is a variable), and θ is an interpreted predicate such as =,, !=, >, >=

Semantics of conjunctive queries Semantics of a conjunctive query Q over DB instance D:  Ψ: any mapping that maps each of the variables in Q to constants in D  Ψ(Ri): result of applying Ψ to Ri(Xi), which consists of ground atoms (constants)  Ψ(Q): the result of applying Ψ to Q(X), which consists of ground atoms  If Each of Ψ(R1),...., Ψ(Rn) is in D, and For each 1<=j<=m, Ψ(cj) is satisfied,  Then, Ψ(Q) is in the answer to Q over D

Correspondence between SQL queries and conjunctive queries Base relations: Interview(candidate, name, recruiter, hireDecision, grade) EmployeePerformance(empID, name, reviewQuarter, grade, reviewer) SQL: Select recruiter, candidate From Interview, EmployeePerformance Where recruiter = name And grade < 2.5 Conjunctive query: Q1(Y,X):- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W <= 2.5

Safety, disjunction Conjunctive queries must be safe:  Every variable appearing in the head also appears in the body  Otherwise, the set of possible answers to the query may be infinite Disjunction can be expressed writing two or more conjunctive queries with the same head predicate: Q1(Y,X) :- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W<=2.5 Q1(Y,X) :- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W >=3.9

Negation Conjunctive queries with negated goals can also be considered: Q(X):- R1(X1),..., Rn(Xn), ¬S1(Y1),…., ¬Sm(Ym) The notion of safety is extended to:  Any variable appearing in the head of the query must also appear in a positive subgoal To produce an answer for the query, the mappings from the variables of Q to the constants in the database must satisfy: Ψ(S1(Y1)),.... Ψ(Sm(Ym))  D

Datalog program Set of rules, each of which is a conjunctive query Instead of computing a single answer query, computes a set of intensional relations (IDB relations), one of them is designated the query predicate Extensional Database (EDB) relations: are the database relations, given by a set of tuples; can occur only in the body of rules Intensional Database (IDB) relations: are defined by a set of rules; can occur both in the head and in the body of rules

Datalog Terminology An example datalog rule: idb(x,y)  r1(x,z), r2(z,y), z < 10  Irrelevant variables can be replaced by _ (anonymous var)  Extensional relations or database schemas (edbs) are relations only occurring in rules’ bodies, or as base relations: Ground facts only have constants, e.g., r1(‘abc’, 123)  Intensional relations (idbs) appear in the heads – these are basically views Distinguished variables are the ones output in the head head subgoals body

Example A database that includes a relation representing the edges in a graph: edge(X,Y) The Datalog query: r1: path(X,Y) :- edge(X,Y) r2: path(X,Y) :- edge(X,Z), path(Z,Y) computes the paths in the graph If r2 was replaced by: r3: path(X,Y) :- path(X,Z), path(Z,Y) It would produce the same result

Semantics of a Datalog program Start with empty extensions for the IDB predicates Choose a rule in the program and apply it to the current extension of the EDB and IDB relations Add the tuples computed for the head to its extension Continue applying the rules of the program until no new tuples are computed for the IDB relations The answer to the query is the extension of the query predicate

Example (cont) r1: path(X,Y) :- edge(X,Y) r2: path(X,Y) :- edge(X,Z), path(Z,Y) Start with: edge(1,2), edge(2,3), edge(3,4) Apply r1 returns: path(1,2), path(2,3), path(3,4) Apply r2 once: path(1,3), path(2,4) Apply r2 twice: path(1,4)

Datalog is Relationally Complete We can map RA  Datalog:  Selection  p : p becomes a datalog subgoal  Projection  A : we drop projected-out variables from head  Cross-product r  s: q(A,B,C,D) :- r(A,B),s(C,D)  Join r ⋈ s : q(A,B,C,D) :- r(A,B),s(C,D), condition  Union r U s: q(A,B) :- r(A,B) ; q(C, D) :- s(C,D)  Difference r – s: q(A,B) :- r(A,B), ¬ s(A,B) Great… But then why do we care about Datalog?

A Query We Can’t Answer in RA Recall our example of a binary relation for graphs or trees (similar to an XML Edge relation): edge(from, to) If we want to know what nodes are reachable: reachable(F, T, 1) :- edge(F, T)dist. 1 reachable(F, T, 2) :- edge(F, X), edge(X, T)dist. 2 reachable(F, T, 3) :- reachable(F, X, 2), edge(X, T)dist. 3 But how about all reachable paths? (Note this was easy in XPath over an XML representation -- //edge)

Recursive Datalog Queries Define a recursive query in datalog: reachable(F, T, 1) :- edge(F, T)distance 1 reachable(F, T, D + 1) :- reachable(F, X, D), edge(X, T)distance >1 What does this mean, exactly, in terms of logic?  There are actually three different (equivalent) definitions of semantics; we will use that of fixpoint: We start with an instance of data, then derive all immediate consequences We repeat as long as we derive new facts  In the RA, this requires a (restricted) while loop! “Inflationary semantics” (which terminates in time polynomial in the size of the database!)

Our Query in RA + while (inflationary semantics, no negation) Datalog: reachable(F, T, 1) :- edge(F, T) reachable(F, T, D+1) :- reachable(F, X, D), edge(X, T) RA procedure with while: reachable += edge ⋈ literal1 while change { reachable +=  F, T, D (  F ! X (edge) ⋈  T ! X,D ! D0 (reachable) ⋈ add1 ) } Note literal1(F,1) and add1(D0,D) are actually arithmetic and literal functions modeled here as relations.

A Special Type of Query: Conjunctive Queries A single Datalog rule with no “Ç,” “:,” “8” can express select, project, and join – a conjunctive query  Conjunctive queries are possible to reason about statically (Note that we can write CQ’s in other languages, e.g., SQL!) We know how to “minimize” conjunctive queries An important simplification that can’t be done for general SQL We can test whether one conjunctive query’s answers always contain another conjunctive query’s answers (for ANY instance)  Why might this be useful?

Example of Containment Suppose we have two queries: q1(S,C) :- Student(S,N),Takes(S,C),Course(C, X), inCIS(C),Course(C,’DB & Info Systems’) q2(S,C) :- Student(S,N),Takes(S,C),Course(C,X) Intuitively, q1 must contain the same or fewer answers vs. q2:  It has all of the same conditions, except one extra conjunction (i.e., it’s more restricted)  There’s no union or any other way it can add more data We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}

Agenda Terminology Conjunctive queries and Datalog Virtual Data Integration Architecture

Building a Data Integration System Create a middleware mediator or data integration system over the sources  Can be warehoused (a data warehouse) or virtual  Presents a uniform query interface and schema  Abstracts away multitude of sources; consults them for relevant data Unifies different source data formats (and possibly schemas) Sources are generally autonomous, not designed to be integrated  Sources may be local DBs or remote web sources/services  Sources may require certain input to return output (e.g., web forms) binding patterns describe these

Logical components of a virtual integration system Programs whose role is to send queries to a data source, receive answers and possibly apply some basic transformation to the answer Is built for the data integration application and contains only the aspects of the domain relevant to the application. Most probably will contain a subset of the attributes seen in sources Specify the properties of the sources the system needs to know to use their data. Main component is semantic mappings that specify how attributes in the sources correspond to attributes in the mediated schema, how to resolve differences in how data values are specified in different sources Other info is whether sources are complete

Example: a data integration scenario

Semantic mappings Semantic mappings in the source descriptions describe: the relationship between the sources and the mediated schema, for example:  mapping of source S1 states it contains movies; the attribute name in Movies maps to attribute title in the mediator Movie relation; the Actors relation in the mediated schema is a projection of the Movies source on the attributes name and actors Whether the sources are complete, for example:  Source S2 may not contain all the movie showing times in the entire country Can specify limited access patterns to the sources, e.g.,:  All the playing times sources require a movie title as input

Components of a data integration system

Query reformulation (1) Rewrite the user query that was posed in terms of the relations in the mediated schema, into queries referring to the schemas of data sources Result is called a logical query plan:  Set of queries that refer to the schemata of the data sources and whose combination will yield the answer to the original query

Query reformulation (2) Ex: SELECT title, startTime FROM Movie, Plays WHERE Movie.title = Plays.movie AND location= ‘New York’ AND director= ‘Woody Allen’ Tuples for Movie can be obtained from source S1 but attribute title needs to be reformulated to name Tuples for Plays can be obtained from S2 or S3. Since S3 is complete for showings in NY, we choose it Since source S3 requires the title of a movie as input and the title is not specified in the query, the query plan must first access S1 and then feed the movie titles returned from S1 as inputs to S3.

Query optimization Acepts a logical query plan as input and produces a physical query plan  Specifies the exact order in which sources are accessed, when results are combined, which algorithms are used for performing operations on the data Ex:  The optimizer will decide the join algorithm to combine results from S1 and S3. It may stream tuples from S1 and input them into S3, or it may batch them up before sending them to S3

Query execution Responsible for the execution of the physical query plan  Dispatches the queries to the individual sources through the wrappers and combines the results as specified by the query plan. Also may ask the optimizer to reconsider its plan based on its monitoring of the plan’s progress (e.g., if source S3 is slow)

Referências Chapter 2, Draft of the book on “Principles of Data Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation). Slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives) Raghu Ramakrishnan and Johannes Gehrke, “Database Management Systems”, 3rd edition, McGraw-Hill A. Silberchatz et al, “Database System Concepts”, 5th edition, McGraw-Hill S. Abiteboul et al, “Foundations of Databases”, Addison Weley, 1995