Describing and Using Query Capabilities of Heterogeneous Sources Vasilis Vassalos& Yannis Papakonstantinou Presented by Srujan Kothapally.

Slides:



Advertisements
Similar presentations
1 Datalog: Logic Instead of Algebra. 2 Datalog: Logic instead of Algebra Each relational-algebra operator can be mimicked by one or several Database Logic.
Advertisements

Theory of Computation CS3102 – Spring 2014 A tale of computers, math, problem solving, life, love and tragic death Nathan Brunelle Department of Computer.
10 October 2006 Foundations of Logic and Constraint Programming 1 Unification ­An overview Need for Unification Ranked alfabeths and terms. Substitutions.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
1. An Overview of Prolog.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Generalization and Specialization of Kernelization Daniel Lokshtanov.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Closure Properties of CFL's
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Copyright © Cengage Learning. All rights reserved.
Efficient Query Evaluation on Probabilistic Databases
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
A Semantic Characterization of Unbounded-Nondeterministic Abstract State Machines Andreas Glausch and Wolfgang Reisig 1.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Introduction to Computability Theory
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Firewall Policy Queries Author: Alex X. Liu, Mohamed G. Gouda Publisher: IEEE Transaction on Parallel and Distributed Systems 2009 Presenter: Chen-Yu Chang.
Catriel Beeri Pls/Winter 2004/5 type reconstruction 1 Type Reconstruction & Parametric Polymorphism  Introduction  Unification and type reconstruction.
Local-as-View Mediators Priya Gangaraju(Class Id:203)
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Boolean Unification EE219B Presented by: Jason Shamberger March 1, 2000.
Taylor Expansion Diagrams (TED): Verification EC667: Synthesis and Verification of Digital Systems Spring 2011 Presented by: Sudhan.
Normal forms for Context-Free Grammars
CS246 Query Translation. Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different.
1 Introduction to Computability Theory Lecture11: The Halting Problem Prof. Amos Israeli.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Propositional Calculus Math Foundations of Computer Science.
Chapter 4: A Universal Program 1. Coding programs Example : For our programs P we have variables that are arranged in a certain order: Y 1 X 1 Z 1 X 2.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
The ACL2 Proof Assistant Formal Methods Jeremy Johnson.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
Theory of Computing Lecture 17 MAS 714 Hartmut Klauck.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
1 Relational Algebra. 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of data from a database. v Relational model supports.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.
1 Relational Algebra and Calculas Chapter 4, Part A.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
Formal Specification of Intrusion Signatures and Detection Rules By Jean-Philippe Pouzol and Mireille Ducassé 15 th IEEE Computer Security Foundations.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
ECSE Software Engineering 1I HO 4 © HY 2012 Lecture 4 Formal Methods A Library System Specification (Continued) From Specification to Design.
Ch. 13 Ch. 131 jcmt CSE 3302 Programming Languages CSE3302 Programming Languages (notes?) Dr. Carter Tiernan.
Complexity & Computability. Limitations of computer science  Major reasons useful calculations cannot be done:  execution time of program is too long.
Advanced Engineering Mathematics, 7 th Edition Peter V. O’Neil © 2012 Cengage Learning Engineering. All Rights Reserved. CHAPTER 4 Series Solutions.
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
Raluca Paiu1 Semantic Web Search By Raluca PAIU
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Overview of the theory of computation Episode 3 0 Turing machines The traditional concepts of computability, decidability and recursive enumerability.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
CS 2130 Lecture 18 Bottom-Up Parsing or Shift-Reduce Parsing Warning: The precedence table given for the Wff grammar is in error.
Modeling Arithmetic, Computation, and Languages Mathematical Structures for Computer Science Chapter 8 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesTuring.
Introduction to Logic for Artificial Intelligence Lecture 2
Programming Languages Translator
Answering Queries using Templates with Binding Patterns
Containment Mappings Canonical Databases Sariaya’s Algorithm
Propositional Calculus: Boolean Algebra and Simplification
This Lecture Substitution model
INSTRUCTOR: MRS T.G. ZHOU
Presentation transcript:

Describing and Using Query Capabilities of Heterogeneous Sources Vasilis Vassalos& Yannis Papakonstantinou Presented by Srujan Kothapally

What is that the paper deals with? Information Integration systems have to cope with the different and limited query interfaces of the underlying information sources. This paper addresses different problems faced by the integration systems in handling wide range of sources.

It gives detailed view of different types of description used to know the query capabilities of each source. It also addresses different algorithms used for deciding how a query can be answered given the capabilities of the sources. Explains how a client query can be translated into the source specific format.

Mediator Wrapper1Wrapper 2Wrapper 3 Information Source 1 Information Source 2 Information Source 3 User/Application Integrated view

The mediators use the description to adapt to the query capabilities of the sources Example: consider a source that exports a lookup catalog lookup (Employee, Manager, Specialty) First the mediator retrieves the set of managers of Java employees, then the set of managers of Database employees, and finally it intersects the two sets.  The wrappers need descriptions of the source capabilities in order to translate the supported common language queries into queries and commands understood by the source interface.

The first program suggested for the descriptions of supported queries along with the translating actions was Yacc programs. These programs do not capture the logical properties of queries as they perceive them as mere strings. This behavior imposes limitations on both the mediators and the wrappers. To overcome the limitations of Yacc program, a more powerful description language was proposed i.e., Datalog variants

P-Datalog Source Description Language Are these languages expressive enough ? Given a description of the wrappers capabilities, how can we answer a client query using only queries supported by the wrappers. This problem is nothing Capabilities –Based Rewriting Problem.

A Datalog program is a natural encoding of conjunctive queries. In this language the source is described using a set of parameterized queries. Parameters, called tokens specify that some constant is expected in some fixed position in the query. It is assumed that a designated predicate ‘ans’ exists, which is the head of all the parameterized queries of the description

Consider an example of bibliographic information source. This source exports a ‘book’ predicate b (isbn, author, title, publisher, year, pages). It also exports ‘indexes’ on authors au_i (au_name, isbn), publishers pub_i (pub,isbn) and titles titl_i (t_word, isbn). Parameterized queries for the above example: ans (I, A, T, P, Y, Pg) b (I, A, T, P, Y, Pg), au_i ($c, I) ans (I, A, T, P, Y, Pg) b (I, A, T, P, Y, Pg), titl_i ($c, I) ans (I, A, T, P, Y, Pg) b (I, A, T, P, Y, Pg), pub_i ($c, I) The following query ans (I, A, T, P, Y, Pg) b (I, A, T, P, Y, Pg), au_i (‘Doe’, I) can be answered by the source

Example using IDB predicates The predicates that are defined using source predicates and other IDB predicates. Consider the same source as previous example but here the source can answer queries that specify any combination of the three indexes. The p-Datalog that describes this source: ans (I, A, T, P,Y, Pg) b (I, A, T, P, Y, Pg), ind 1 (I), ind 2 (I), ind 3 (I) (1) ind 1 (I) titl_i($c, I) ind 1 (I) E (2) ind 2 (I) au_i($c, I) (3) ind 2 (I) E ind 3 (I) pub_i ($c, I) (4) ind 3 (I) E The query ans (I, A, T, P,Y, Pg) b (I, A, T, P, Y, Pg), au_i (‘Doe’, I) can be answered by the source. p-Datalog rules that have the ans predicate in the head can be expanded into a possibly infinite set of conjunctive queries, but only some will refer to source predicate and these expansions are termed as terminal expansions

Example using recursive rules to describe a source that accepts an infinite set of query patterns Consider the same example but assume that there is an abstract index ab_i (ab_word, I) that indexes books based on words contained in their abstracts. The following p-Datalog program can be used to describe this source ans (I, A, T, P, Y, Pg) b (I, A, T, P, Y, Pg), ind(I) ind(I) ab_i ($c, I) ind(I) ind(I), ab_i ($c, I)

A parameterized Datalog rule or p-Datalog rule is an expression of the form p (u) p 1 (u),….,p n (u) where p, p 1,p 2,…., p,p are relation names and u, u 1, u 2,…u n tuples of constants, variables and tokens of appropriate arities. A p-Datalog program is a finite set of p-Datalog rules. P-Datalog as a source description language: let P be a p-Datalog program with a particular IDB predicate ans. The set of expansions Ep of P is the smallest set of rules such that Each rule of P that has ‘ans’ as the head predicate is in Ep; If r 1 : p q 1,..,q n is in E p,, r 2 : r s 1,….,s m is in P and a substitution is the most general unifier of some q and r then the resolvent of r 1 with r 2 using is in E p.. The set of terminal expansion T p of P is the subset of all expansions e E p containing only EDB predicates in the body. The set of queries described by P is the set of all rules p(r), where r T p and p assigns arbitrary constants to all tokens in r. The set of queries expressible by P is the set of all queries that are equivalent to some query described by P

Rectification For deciding expressibility as well as for solving the CBR problem the following rectified form of p-Datalog rules simplifies the algorithms. We assume the following conditions are satisfied: 1. No variable appears twice in subgoals of the query body or in the head of the query. Equalities between variables are made explicit through the use of an equality predicate e (X,Y). 2. No constants or tokens appear among the ordinary subgoals. Every constant or token is replaced by a unique variable that is equated to the constant or token through an e subgoal. 3. No variables appear only in an e subgoal of a query  The rules that obey the above conditions are called rectified rules and the process that transforms any rule to a rectified rule is rectification and the inverse procedure is de-rectification

Query Expressibility and CBR with p-Datalog descriptions QED algorithm is an extension of the classic algorithm for deciding query containment in a Datalog program. Consider the previous example but assume that the source contains a table on books b (isbn, author, publisher), a word index on titles, titl_i (t_word,isbn) and an author index au_i (au_name,isbn) ans(A,P) b (I, A, P), ind 1 (I 1 ), ind 2 (I 2 ), e(I,I 1 ), e(I,I 2 ) ind 1 (I) title_i (V, I), e (V, $c) ind 1 (I) E ind 2 (I) au_i (V,I), e (V, $c) ind 2 (I) E Let us consider the query Q ans (X,Y) b (I, X, Y), title_i (‘Zen’, I), au_i(‘Doe’,I) first we produce its rectified equivalent of Q: ans (X,Y) b (I, X, Y), title_i(V 1, I 1 ), au_i(V 2, I 2 ), e (V 1, ’Zen’), e (V 2, ’Doe’), e (I, I 1 ),e (I,, I 2 )

The Database for the example is b (i, x, y), title_i(v, i), au_i(v, i), e( i,i,i),e(v,’Zen),e(v,Doe) Set of facts and supporting sets derived: (5): <ind (i) { title_i (v, i),e (v,’Zen’) (6): The two containing queries correspond to (5) and (6). Indeed the corresponding query to (6) is ans (X, Y) b(I,X,Y), title_i (‘Zen', I), au_i(‘Doe’,I) which is equivalent to our given query.  QED algorithm discovers expressibility by matching the datalog program rules with the subgoals. Matching is done by creating a Database which contains a frozen fact for every subgoal of the query and then evaluates the p-Datalog program on it trying to produce the frozen head of the query..  It also keeps track of different ways to produce the same fact: that is achieved by annotating each produced fact f with its supporting facts, i.e, the facts of the canonical DB that were used in that derivation of f.

Theorem: (correctness) Algorithm QED terminates and produces the set of minimal containing queries of input query Q. Lemma : Q is expressible by P iff the set of supporting facts for the frozen head of Q is identical to the canonical DB for Q. Theorem : (complexity) Algorithm QED terminates in time exponential to the size of the description and the size of the query

Translation We could extend Algorithm QED to annotate each fact not only with its supporting set, but also with its proof tree. The wrapper then uses this parse tree to perform the actual translation of the user query in source specific queries and commands, by applying the translating actions that are associated with each rule of the description.

Addressing CBR Problem CBR problem is equivalent to the problem of answering the user query using an infinite set of views. CBR algorithm proceeds in two steps:  Firstly it uses QED algorithm to generate the finite set of expansions.  Secondly it uses an algorithm for answering queries using views to combine some of these expansions to answer the query.  The time complexity of CBR algorithm is non deterministic exponential in the size of the query and the description, which is a considerable improvement over the previously known solution.

Theorem : Assume we have a query Q and a p-Datalog description P, and let {Q i } be the result of applying algorithm QED on Q and P. There exists a rewriting Q’ of Q, such that Q’ Q, using any {Q j / Q j is expressible by P} if and only if there exists a rewriting Q”, such that Q” Q using only {Q i } Using the set {Q i } of minimal containing queries as input to one of these algorithms, we obtain a solution to the CBR problem for p-Datalog that is non- deterministic exponential, since/ {Q i } / is exponential in the size of p- Datalog description and the user query.

Expressive Power of p-Datalog There exist recursive sets of conjunctive queries that are not expressible by any p- Datalog description. Theorem : Let k be some integer and S be a database schema. There exists a p- Datalog program that describes all conjunctive queries over S with at most k variables.

Limitations of p- Datalog : In every p-Datalog description program P, the arity of the result is exactly the arity of the “ans” predicate. This restriction is somewhat artificial since we can define descriptions with more than one “answer” predicate. A more serious restriction is due to the fixed number of variables that occur in any one of the rules of the program. Theorem : Let the database schema S have a relation of arity atleast two. For every p- Datalog description P over S, there exists a Boolean query Q over S, such that Q is not expressible. The following theorem points out a serious limitation of p- Datalog descriptions.

The RQDL Description Language Relational Query Description Language is a datalog – based rule language used for description of query capabilities. It shows its advantages over datalog when it is used for descriptions that are not schema- specific. RQDL, unlike p-Datalog, can describe the set of all conjunctive queries The decision on whether a given conjunctive query is expressed by an RQDL description is reduced to deciding expressibility of the query by the resulting p-Datalog program which allows us to give a complete solution to the CBR problem for RQDL.

RQDL allows the use of predicate tokens in place of relation names to support schema independent descriptions. To allow tables of arbitrary arity and column names, RQDL provides special variables called vector variables, or simply vectors, that match with sets of relation attributes that appear in a query. Vectors can “carry” arbitrarily large sets of attributes. It is this property that eventually allows the description of large, interesting sets of conjunctive queries

Example Consider a source that accepts queries that refer to exactly one relation and pose exactly one selection condition over the source schema. ans() $r (V), item( V, _A, X’), e (X’, $c) The above RQDL description describes, among others, the query ans() b( f1 ; X, f2 : Z), e( X, ‘Data’)

Reducing RQDL to p-Datalog with Function Symbols To decide whether a query is expressible by an RQDL description requires “ matching” the RQDL description with the query. This is the challenging problem because vectors have to match with non atomic entities. Brute force approach is used for matching but this approach breaks down in the presence of unsafe rules that have vectors in the head. The algorithm avoids these problems by reducing the problem of query expressibility by RQDL descriptions to the problem of query expressibility by p-Datalog with function symbols. Reduction is based on the idea that every database DB can be reduced into an equivalent database DB’ (standard schema database) such that the attribute names and relation names of DB appear in the data of DB’

Example Consider the following database DB with schema b( au, id) and f( subj, id ).The standard schema consists of two relations, a tuple relation t( rel, tuple id) and an attribute relation a( tuple id, attr, value). f subj id b au id ‘Doe’ 1 ‘Sax’ 2 ‘Law’ 1 ‘Art’ 2

The corresponding standard schema database DB’ is bb(‘Doe’,1) bb(‘Sax’,2) ff(‘Law’,1) ff(‘Art’,2) rel tuple id t

b(‘Doe’,1) au ‘Doe’ b(‘Doe’,1) id 1 b(‘Sax’,2) au ‘Sax’ b(‘Sax’,2) id 2 f(‘Law’,1) subj ‘Law’ f(‘Law’,1) id 1 f(‘Art’,2) subj ‘Art’ f(‘Art’,2) id 2 a tuple id attr value

Reduction of CQ queries to standard schema queries The RQDL expressibility algorithm first reduces a given conjunctive query Q over some database DB into a corresponding query Q’ over the standard schema database DB’ Boolean query Q over the schema of previous example ans() b( au : X, id : S 1 ), f( subj : A, id : S 2 ), e(S 1, S 2 ), e ( A, ’Art’) Query Q is reduced into the following query Q’: t( ans, ans()) t( b, B), t( f, F), a( B, id, S 1 ), e(S 1, S 2 ), a( F, id, S 2 ), a( F, subj, A), e( A, ‘Art’)

Reduction of RQDL programs to standard schema Datalog programs Reduce RQDL descriptions into p-Datalog descriptions that do not use higher order features such as meta predicates and vectors. We reduce vectors to tuple identifiers, if a vector matches with the arguments of a subgoal, then the tuple identifier associated with this subgoal is finding all the attributes-variable pairs that the vector will match to. The reduction carefully produces the program which terminates despite the use of the u function. Theorem : Let P be an RQDl description and P’ its reduction in p- Datalog with functions. Let also DB be a canonical standard schema database of a query Q. Then P’ applied on DB terminates

Expressibilty and CBR with RQDL descriptions Theorem : Let us consider a query Q over some schema and RQDL description P. Also let Q’= {Q i } be the set of standard schema queries that is the reduction of Q and P’ be the standard schema reduction of P. Then Q is expressible by P if and only if each Q i is expressible by P’, that is iff set Q’ is expressible by P’. QED algorithm is used to answer the expressibility question in RQDL

The CBR Problem for RQDL CBR problem for a given query can be solved in two steps:  Generate the set of relevant described queries from the output of the algorithm QED, by glueing together the tuple and attr subgoals that have the same supporting set.  Given a query and a number of relevant described queries over the same schema, we can apply an answering queries using views algorithm, where the views are relevant described queries.

Conclusion Discussed the problems of  Describing the query capabilities of sources.  Using the descriptions for source wrapping and mediation.  Expressibility and CBR and provided algorithms to solve them.  P-Datalog cannot model a fully fledged relational DBMS.  Described RQDL, which is more expressive language than p-Datalog and also reduced RQDL descriptions into p- Datalog augmented with function symbols.

Questions?

Thank You!!