From Enterprise Information Integration to Community-Based Mediation Alin Deutsch, Yannis Katsis, Michalis Petropoulos Yannis Papakonstantinou A presentation.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Chapter 10: Designing Databases
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Copyright 2008 Tieto Corporation Database merge. Copyright 2008 Tieto Corporation Table of contents Please, do not remove this slide if you want to use.
Data Model driven applications using CASE Data Models as the nucleus of software development in a Computer Aided Software Engineering environment.
Efficient Query Evaluation on Probabilistic Databases
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
The Relational Model Class 2 Book Chapter 3 Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC) (From ER to Relational)
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 28 Database Systems I The Relational Data Model.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
A Next Wave of Challenges in the Junction of Information Management (esp. Integration) and the Web Yannis Papakonstantinou Associate Prof., CSE, UCSD.
Chapter 12: ADO.NET and ASP.NET Programming with Microsoft Visual Basic.NET, Second Edition.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
The Relational Model Lecture 3 Book Chapter 3 Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC) From ER to Relational.
Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006.
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou CSE 636 Data Integration, March.
Systems Analysis and Design in a Changing World, 6th Edition 1 Chapter 6.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Automatic Data Ramon Lawrence University of Manitoba
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006 SIGMOD.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
XRules An XML Business Rules Language Introduction Copyright © Waleed Abdulla All rights reserved. August 2004.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
1 The Relational Model. 2 Why Study the Relational Model? v Most widely used model. – Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. v “Legacy.
2Object-Oriented Analysis and Design with the Unified Process Objectives  Describe the differences and similarities between relational and object-oriented.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Linking Tasks, Data, and Architecture Doug Nebert AR-09-01A May 2010.
Geospatial Systems Architecture Todd Bacastow. Views of a System Architecture Enterprise Information Computational Engineering Technology.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
Metadata Mòrag Burgon-Lyon University of Glasgow.
Database Management Supplement 1. 2 I. The Hierarchy of Data Database File (Entity, Table) Record (info for a specific entity, Row) Field (Attribute,
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
The relational model A data model (in general) : Integrated collection of concepts for describing data (data requirements). Relational model was introduced.
Object storage and object interoperability
Introduction to Active Directory
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
ASET 1 Amity School of Engineering & Technology B. Tech. (CSE/IT), III Semester Database Management Systems Jitendra Rajpurohit.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Chapter 1: Introduction. 1.2 Database Management System (DBMS) DBMS contains information about a particular enterprise Collection of interrelated data.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
Chapter 1: Introduction
Chapter 19: Distributed Databases
Translation of ER-diagram into Relational Schema
CS 174: Server-Side Web Programming February 12 Class Meeting
Database Management System (DBMS)
Database Architecture
Yannis Papakonstantinou Associate Prof., CSE, UCSD
Information Integration
Chen Li Information and Computer Science
INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.
Presentation transcript:

From Enterprise Information Integration to Community-Based Mediation Alin Deutsch, Yannis Katsis, Michalis Petropoulos Yannis Papakonstantinou A presentation by on joint works with CSE Department

Data Integration Requirements & Desiderata (high level) Provide application with integrated database – single point of (query/update) access to the data Provide distribution and heterogeneity transparency –heterogenous formats, heterogenous interfaces, different rates of change (static versus dynamic), autonomous sources Decouple application logic from integration Easily add/change sources Customize the delivery of content

Integration Software Most-Generic Integration System Architecture... Information Source Information Source Information Source Client Application Client Application Client Application

Information + Service Source SIGMOD Community’s Architecture for Unified Access to Data & Services Local Common Model (XML) View + Services Mediator Integrated (XML) Global View / Ontology + Services Wrapper Cache & Replication (Web) Client Application Information + Service Source Wrapper (Web) Client Application (Web) Client Application Local Common Model (XML) View + Services

Approaches towards View-Based Data Integration Local As View (LAV) Global As View (GAV) GLAV=GAV+LAV Integration Specification Method Info Model & Query Language Relational (SQL) XML (XQuery) Object-Oriented Warehousing (materialized views) On-Demand (virtual views) Storage Method

Enterprise Information Integration Reaches Maturity Materialized View (Warehousing) approach well-adopted since mid/late 90s –GAV function role played by Extract-Transform-Load tools –Human Intervention Occasionally Needed in Cleaning Up Concordance tables for Object Identification Virtual View (Mediation) approach at early adoption –many years of research Distributed db’s, federated db’s, mediators –moving well into mainstream BEA AquaLogic (XML, Virtual, GAV view) IBM DB2 Enterprise

Current Enterprise Information Integration Deployments IntegrationAdmin Marketing Local View M Integrated Global View V(M, S, E) Sales Local View S Service Local View E View Builder (design time) Mediator Query Processor (run time) GAV View V SchemasData Small Domain Mostly Vertical Partition of Sources Primarily Application-Driven View or Identity View Integration Administrator/Developer in charge Enterprise

Opportunities and Needs Presented by “Motivated” Communities Emerging Myriads of Internet Communities of –Myriads of sources and clients –Source owners motivated to participate EII does not address needs –Expensive –Bottleneck of Single Integration Admin Make building corresponding portals similar to starting and participating in newsgroups Appropriate tools needed to enable source owner and client participation Communities

Information Source 1 Local XML View S 1 Client Application 1 Integrated XML View G Information Source n Local XML View S n GLAV: V 1 GLAV: V n GAV: V 1 a GAV: V m a Application View V 1 a (G) Application View V m a (G) Client Application m Integrated View Owner’s Domain Source Owner’s Domain Source Owner’s Domain Data Services 1 Data Services n A Community-Based Information Modeling Architecture Data Services

Visual Tools Matter! (example from the Enosys Query Builder) C:\Enosys\projects\allPONS.qpr* - Enosys Query Builder OPEN & VIEW SOURCE SCHEMAS IN XML DRAG & DROP TO CREATE TARGET XML VIEW TARGET SCHEMA (XML VIEW) AUTOMATICALLY GENERATED MAPS 1 2

C:\Enosys\projects\allPONS.qpr* - Enosys Query Builder XML RESULT XQUERY BASED ON DESIGN SPECS RUN & TEST XQUERY 3

Architecture for Large-Scale Data Integration System and Design Tools How can the user query and Browse the integrated data? QURSED What queries can my app issue? What integrated view services can I build? CLIDE How do I export my database services functionality? RIDE-Services Source Domain Web Domain Application Domain Integration Domain  Application Data Source Data Source Mediator Global View Schema Developer  Integration Engineer  Source Owner  Application Web Forms & Reports Source Schema …  Web Service Web Service Web Service Source Schema … How do I export my data? RIDE Web Services Cache (Metadata)

Dual Interactive Registration Problems New App ? New Query ? Source Services Register Source Given Global Register Source Given Global Schema, Constraints &Queries Guide the client in query/form writing Apps ? Queries ? Register Client Given Sources Register Client Given Sources Guide the source owner in registering a new source and services New Source and Services   Global View

Source Data Registration How do my source attributes map to global attributes –mappers & automatic matchers How do my data relate to queries & other sources –Inconsistencies? –What takes to contribute to queries? –How much should I clean up? Multiple ways of dealing with redundancy Apps ? Queries ? Server Side New Source  Global View

How to achieve this Goal Apps ? Queries ? Before New Source  Apps ? Queries ? Now New Source   Look at all sources & queries Decide how to register your source   Follow the suggestions of the interface  Global View Source Registration Tool

Our Goal in Source Registration Guide the source owner visually through the registration of the source so as to avoid/warn about (potential) inconsistencies and contribute information to the answer of the queries while exposing the minimum information possible and/or minimizing effort

The Problem ? Client Queries ? ? Mediator (Global DB) Sources(Actual Local DBs) 17

The Contribution Problem ? Client Queries ? ? Sources(Actual Local DBs) What is the contribution of source S to the result of the query Q? S Q 18 Mediator (Global DB)

The Problem Client Queries ? Sources(Actual Local DBs) What is the contribution of source S to the result of the query Q? S Q 19 Mediator (Global DB) Q:cars Q: cars carsreviews Q: cars JOIN reviews S is Self Sufficient w.r.t. Q S is Now Complementary w.r.t. Q

Relational Schemas: Local and Global attributes relations ? Relational Schemas Visual Representation make S1S1 Carmake Origin Sales auto S2S2 Id Model detail Id Engine Baseprice Source 1 Business Magazine Source 2 Car Magazine Global Car Portal car G Model Carmake brand Origin Doors Baseprice Carmake 20

Source Registration using GLAV Mappings Source Registration: Source Registration: Correspondence between a source schema and the global schema = Set of Mapping Constraints of the form (U  V) Open World Open World Global and Local As View (GLAV) ? CQ = over source schema CQ = over global schema  21

Target Constraints Constraints on the global schema = Set of Constraints of the form (U  V) Also Expresses Dependencies (PKs, Ref Integrity, …) Also Expresses Dependencies (PKs, Ref Integrity, …) ? CQ = over global schema 22

Visual Representation of Mappings (1) ? Visual Representation (IBM Clio) brand C O Business Magazine: Provides Carmake and Origin U 1 (C, O) :- make(C, O, S) V 1 (C, O) :- brand(C, O) (U 1  V 1 ) car G Model Carmake brand Origin Doors Baseprice make S1S1 Carmake Origin Sales OC make C O S O C 23

Visual Representation of Mappings (2) ? Visual Representation (IBM Clio) Car Magazine: Provides Model, Carmake and Baseprice auto I M C detail I E B car M C ? B auto S2S2 Id Model detail Id Engine Baseprice Carmake car G Model Carmake brand Origin Doors Baseprice 24

Example of Target Constraint ? (Model, Carmake) is a PK of car car G Model Carmake brand Origin Doors Baseprice  U 1 (M, C, D 1, B 1, D 2, B 2 ):-car(M, C, D 1, B 1 ), car(M, C, D 2, B 2 ) V 1 (M, C, D, B, D, B) :- car(M, C, D, B) (U 1  V 1 ) 25

Query Semantics Queries in UCQ = Set of Possible Global Instances Set of global instances that satisfy all constraints Query Answers = Set of Certain Answers The tuples appearing in the answer to Q for any possible global instance Possible global instances Answer to Q for any of the possible global instances  Certain Answers to Q  Answer to Q for any of the possible global instances Certain Answers to Q Q Q Possible global instances ? 26

Source Instance’s Contribution Answer to Q - For given instances of the sources Contribution to Q of Source Instance = The tuples in answer of Q not provided by the other sources 27

Source Registration’s Contribution Source Registration: Source Mappings Degrees of Source Registration’s Contribution  Self Sufficient  Now Complementary  Later Complementary  Unusable More contribution Less contribution 28

Self Sufficient Registration: Example ? Baseprices of Models car M3 BMW ? 45K car G Model Carmake brand Origin Doors Baseprice Green Registration is Self Sufficient car Model Carmake Doors Baseprice Example 29

Self Sufficient Registration: Definition Answer to Q  Source instance s.t. The source has a non empty contribution in the absence of the other sources Answer to Q -     Self Sufficient XXXX 30

Now Complementary Registration: Example ? Baseprices of Models by German manufacturers car M3 BMW ? 45K car G Model Carmake brand Origin Doors Baseprice brand BMW Germany Green Registration is Now Complementary brand Carmake Origin = ‘Germany’ car Model Carmake Doors Baseprice Example 31

Now Complementary Registration: Definition Answer to Q Not Self Sufficient &  Source instances s.t. The source has a non empty contribution in combination with the other existing sources Answer to Q -     Now Complementary 32

Later Complementary Registration: Example ? Baseprices of Models by German manufacturers car M3 BMW ? 45K car G Model Carmake brand Origin Doors Baseprice brand BMW Germany Green Registration is Later Complementary brand Carmake Origin = ‘Germany’ car Model Carmake Doors Baseprice Example 33

Later Complementary Registration: Definition Answer to Q Not Self Sufficient & Not Now Complementary &  Potential future sources & Source instances s.t. The source has a non empty contribution in combination with the future sources Answer to Q -    Later Complementary  Later Complementary 34

Unusable Registration: Example ? car G Model Carmake brand Origin Doors Baseprice brand Carmake Origin Origin of Carmakes Green Registration is Unusable Example 35

Unusable Registration: Definition Answer to Q Not Self Sufficient & Not Now Complementary & Not Later Complementary  The source has a empty contribution regardless of what sources enter the system Answer to Q - = =  Unusable  Unusable 36

Subtleties for Unusable Registrations ? Baseprices and Doors of Models car G Model Carmake brand Origin Doors Baseprice car M3 BMW ? 45K car M3 BMW 2 ? Green Registration is Unusable car Model Carmake Doors Baseprice Example 37

In presence of PK Unusable Example becomes Later Complementary ? Baseprices and Doors of Models car G Model Carmake brand Origin Doors Baseprice Green Registration is Later Complementary  car M3 BMW ? 45K car M3 BMW 2 ? M3 BMW M3 BMW car M3 BMW 2 45K car Model Carmake Doors Baseprice Example 38

Decidability Results NonePrimary keysPrimary keys + Referential Integrity Constraints Self Sufficient Yes No Now complementary Yes No Later complementary Yes ? Unusable Yes ? Target constraints DegreesDegreesDegreesDegrees Overview: What is decidable 39

Issues Unique client query Multiple client queries Vs Contribute to: - all queries? - one query? - specific queries? - some queries based on some ranking? Data independence Data dependence Vs M 1 : cars, refPrices M 2 : reviews Q: cars JOIN reviews JOIN refPrices e.g. DB 1 : cars, refPrices (Audis) DB 2 : reviews (Hondas) (M 2, Q) now-complementary but Certain Answers for Instances DB 1, DB 2 =   

Putting it all together Architecture ? Query Global Schema Local Schemas Q S1S1 SnSn S n+1 …M1M1 MnMn M n+1 S’ Guide the source owner visually through the registration of the source so as to raise contribution to the answer of the queries while exposing the minimum info possible and/or minimizing effort 4 categories: Self Sufficient / Now Complementary / Later Complementary / Unusable Query Answering / Mappings / Schemas Architecture Goal Registered sources New source Mappings … Contribution

Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality Query Without primary keys in the target Unusable BLUE: Map at least one of the groups car AutoTrader vin cmodel price ad * carId id *

Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema Query Unusable = cmo car AutoTrader vin cmodel price ad * carId id * car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality model Without primary keys in the target

Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema Query = cmo car AutoTrader vin cmodel price ad * carId id * car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality model = price Later Complementary Without primary keys in the target

Example 2 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality Query Unusable car AutoTrader vin cmodel price ad * carId id * With primary keys in the target

Example 2 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality Query = price = vin car AutoTrader vin cmodel price ad * carId id * Later Complementary With primary keys in the target

Lessons learned To merge data with that of other sources (become complementary): Pick a relation and provide… …all its attributes asked by the query …its primary key and one of its attributes asked by the query In absence of primary keys In presence of primary keys The number of choices increases in presence of primary keys   Foreign keys on the target affect the suggestions Target constraints make a difference

Large-Scale Data Integration Systems How can the user query and Browse the integrated data? QURSED What queries can the mediator answer for me? CLIDE How do I export my database services functionality? RIDE-Services Source Domain Web Domain Application Domain Integration Domain  Application Data Source Data Source Mediator Global View Schema Developer  Integration Engineer  Source Owner  Application Web Forms & Reports Source Schema …  Web Service Web Service Web Service Source Schema … How do I export my data? RIDE

Running Example Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1ComByCpu(cpu)  (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2ComNetByCpuRate(cpu, rate)  (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Parameterized Views DellCisco Schema Routers(rate, standard, price, type) Views V3RouByTypeW()  (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4RouByTypeWL()  (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Computers for a given cpu Computers & NetCards for a given cpu & rate Wired Routers Wireless Routers

Running Example Global schema puts together the Dell and Cisco schemas Resembles the schema of CNET.com portal Column Associations (Computers.cid, NetCards.cid) (NetCards.rate, Routers.rate) (NetCards.standard, Routers.standard) Global Schema V1  Application V3V2 DellCisco Mediator Global Schema Developer  V4

Sophisticated Mediators Make Feasibility Hard to Predict Feasible Queries FQ Equivalent CQ query rewritings using the views Might involve more than one views Order might matter V4 Mediator RouByTypeW L() Routers.* b 50 Wirele ss g 120 Wirele ss A B V2 ComNetByCpuRate(‘P4’, ‘10’) C D Computers.*NetCards.* A12 3 P A b US B B12 3 P B g US B Feasible ComNetByCpuRate(‘P4’, ‘54’) Computers.*NetCards.*Routers.* A12 3 P A b US B b 50 Wirele ss B12 3 P B g US B g 120 Wirele ss E Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers Query: Get all Computers Infeasible

Problem 1.Large number of sources 2.Large number of views 3.Mediator capabilities Developer formulates an application query  Is an application query feasible?  If not, how do I know which ones are feasible? Previous options: –The developer had to browse the view definitions and somehow formulate a feasible query –Or formulate queries until a feasible one is found (trial-and-error) No system-provided guidance

The CLIDE Solution  A query formulation interface, which interactively guides the user toward feasible queries by employing a coloring scheme CLIDE V1  Application V3V2 DellCisco Mediator Global Schema Developer  V4

QBE-Like Interfaces Microsoft SQL-Server

CLIDE Interface Table, selection, projection and join actions Color-based suggestions Feasibility Flag Projection Boxes Table Boxes Selection Boxes Feasibility Flag Table Alias

CLIDE Interface Yellow  required action –All feasible queries require this action White  optional action –Feasible queries can be formulated w/ or w/o these actions Snapshot 1

CLIDE Interface Snapshot 2 Blue  required choice of action –At least one feasible (next) query cannot be formulated unless this action is performed V1 Mediator ComByCpu(‘P 4’) cid cp u ram pric e A123P B123P ram pric e A B C

CLIDE Interface Join Lines: Only yellow and blue are displayed Must appear in Column Associations Snapshot 3

CLIDE Interface Snapshot 4

CLIDE Interface Snapshot 5 *  any other constant Red  prohibited action –Does not appear in any feasible query –Lead to “Dead End” state

CLIDE Interface Snapshot 6 V4 Mediator RouByTypeW L() Routers.* b 512 Wirele ss g Wirele ss A B V2 ComNetByCpuRate(‘P4’, rate) D E Computers.*NetCards.* A12 3 P A b 50 B12 3 P B g 120 ram pric e rate interfac e pric e USB USB120 F

CLIDE Facts Rapid Convergence –At every step, yellow and blue actions lead to a feasible query in a minimum number of steps Completeness of Suggestions –Every feasible query can be formulated by performing yellow and blue actions at every step Minimality of Suggestions –At every step, only a minimal number of actions are suggested, i.e., the ones that are needed to preserve completeness

Join Action Table Action Selection Action Interaction Graph Nodes are queries –One for each qCQ Edges are actions –Table, selection, projection and join actions Green nodes are feasible queries Infinitely big structure –All CQ queries –All possible combinations of actions formulating them Com1.cid=Net1.cidCom1.cpu=‘P4’Com1Com1.ramRou1 …… Com1.price … … …………… Net1 …

Interaction Graph: Colorable Actions Colorable actions A C label outgoing edges of the current node Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid … … … … … Com1.cpu … … … … Current Node

Interaction Graph: Colors Com1.cpu=* … … … … … … … … … … … … Current Node Net1Com1.cid=Net1.cid Com2.cid=Net1.cid Com2 Com2.cpu=‘P4’Net1.rate=‘54Mbps’ Net1.rate=’54Mbps’ … ………… …… Com1.cpu=* Rou1Net1.rate=Rou1.rate ………… Net1.rate=’54Mbps’ … Com1.cid=Net1.cid … Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid Com1.cpu Yellow action  –Every path from current node n to a feasible node contains  Blue action  –At least one feasible query cannot be formulated unless this action is performed (minimality) Red action  –No path to a feasible node contains 

Color Determined By a Finite Set of Feasible Queries Start by considering the closest feasible queries FQ C FQ C is sufficient to color actions in A C Theorem: Set of Closest Feasible Queries is Finite How far can closest feasible queries FQ C be? Based on Maximally Contained Queries FQ MC ? n … … … … … … Closest Feasible Queries FQ C Challenge: Infinitely Many Feasible Queries Radius Infinitely many feasible queries ? … …

Color Algorithm Assuming fixed SELECT clause (projection list) Covered extensively in literature –MiniCon, Bucket, InverseRules FQ MC is finite Maximally Contained Query Maximally Contained Queries FQ MC Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu Query: Q3 Get all Computers with a given cpu & ram Not Maximally Contained Maximally Contained Query Query: Q4 Get all Computers with a given ram

Color Algorithm Compute maximally contained queries FQ MC The radius p L is the longest path to a node n’ such that q(n’) in FQ MC All FQ C queries are reachable via a path of length p  p L Closest Feasible Queries FQ C Maximally Contained Queries FQ MC n … … … … … … Maximally Contained Queries FQ MC p L Radius …

Color Algorithm Theorem: All queries in FQ MC are in FQ C But not all queries in FQ C are in FQ MC More on Closest Feasible Queries Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC … … … … … … More feasible nodes n

Color Algorithm Naïve Approach –Start from n and explore paths up to length p L More on Closest Feasible Queries Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC … … … … … … n

Color Algorithm Collapse Aliases to compute FQ C \ FQ MC Check satisfiability Collapse Aliases Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC n … … … … … …

Color Algorithm Coloring Non-Projection Actions No interaction graph materialization Use of containment mapping from current query to the closest feasible ones An action  is colored –Yellow, if  is mapped into all queries in FQ C –Red, if  is not mapped into any query in FQ C –Blue, if  is mapped into at least one query q F in FQ C, no other action in A P is mapped into q F, and  is neither yellow nor red Coloring Projection Actions Never colored yellow Can be colored blue only if –the current query is feasible –it is not colored red Which ones are red? –Bring all projection atoms from views such that feasibility is preserved –If action  is not mapped into any query in FQ C, then  is red

Other Back-End Parameterized Views Back-End CLIDE Implementation Action Current Query Closest Feasible Queries Schemas Views MiniCon Containment Test Collapse Aliases Color Actions Front-End Developer  Maximally Contained Queries Optimal Maximally Contained Queries Colored Actions Column Associations MiniCon Outputs redundant and non-minimal queries Affects CLIDE’s rapid convergence and minimality properties Containment Test Well-known NP-complete problem Polynomial when query is acyclic Collapse Aliases / Color Actions Reuse containment mappings created by MiniCon

CLIDE Performance Querie s A-span = 7 B-span = 4 Selections = 4,6,8,10 A B1B1 … C1C1 B2B2 C1C1 A B K B1B1 … C1C1 C L … Schem a … B i … C i View s A B K B1B1 … C1C1 C L … … … B iM B i1 … C iM C i1 … Chains of Stars