From Enterprise Information Integration to Community-Based Mediation Alin Deutsch, Yannis Katsis, Michalis Petropoulos Yannis Papakonstantinou A presentation by on joint works with CSE Department
Data Integration Requirements & Desiderata (high level) Provide application with integrated database – single point of (query/update) access to the data Provide distribution and heterogeneity transparency –heterogenous formats, heterogenous interfaces, different rates of change (static versus dynamic), autonomous sources Decouple application logic from integration Easily add/change sources Customize the delivery of content
Integration Software Most-Generic Integration System Architecture... Information Source Information Source Information Source Client Application Client Application Client Application
Information + Service Source SIGMOD Community’s Architecture for Unified Access to Data & Services Local Common Model (XML) View + Services Mediator Integrated (XML) Global View / Ontology + Services Wrapper Cache & Replication (Web) Client Application Information + Service Source Wrapper (Web) Client Application (Web) Client Application Local Common Model (XML) View + Services
Approaches towards View-Based Data Integration Local As View (LAV) Global As View (GAV) GLAV=GAV+LAV Integration Specification Method Info Model & Query Language Relational (SQL) XML (XQuery) Object-Oriented Warehousing (materialized views) On-Demand (virtual views) Storage Method
Enterprise Information Integration Reaches Maturity Materialized View (Warehousing) approach well-adopted since mid/late 90s –GAV function role played by Extract-Transform-Load tools –Human Intervention Occasionally Needed in Cleaning Up Concordance tables for Object Identification Virtual View (Mediation) approach at early adoption –many years of research Distributed db’s, federated db’s, mediators –moving well into mainstream BEA AquaLogic (XML, Virtual, GAV view) IBM DB2 Enterprise
Current Enterprise Information Integration Deployments IntegrationAdmin Marketing Local View M Integrated Global View V(M, S, E) Sales Local View S Service Local View E View Builder (design time) Mediator Query Processor (run time) GAV View V SchemasData Small Domain Mostly Vertical Partition of Sources Primarily Application-Driven View or Identity View Integration Administrator/Developer in charge Enterprise
Opportunities and Needs Presented by “Motivated” Communities Emerging Myriads of Internet Communities of –Myriads of sources and clients –Source owners motivated to participate EII does not address needs –Expensive –Bottleneck of Single Integration Admin Make building corresponding portals similar to starting and participating in newsgroups Appropriate tools needed to enable source owner and client participation Communities
Information Source 1 Local XML View S 1 Client Application 1 Integrated XML View G Information Source n Local XML View S n GLAV: V 1 GLAV: V n GAV: V 1 a GAV: V m a Application View V 1 a (G) Application View V m a (G) Client Application m Integrated View Owner’s Domain Source Owner’s Domain Source Owner’s Domain Data Services 1 Data Services n A Community-Based Information Modeling Architecture Data Services
Visual Tools Matter! (example from the Enosys Query Builder) C:\Enosys\projects\allPONS.qpr* - Enosys Query Builder OPEN & VIEW SOURCE SCHEMAS IN XML DRAG & DROP TO CREATE TARGET XML VIEW TARGET SCHEMA (XML VIEW) AUTOMATICALLY GENERATED MAPS 1 2
C:\Enosys\projects\allPONS.qpr* - Enosys Query Builder XML RESULT XQUERY BASED ON DESIGN SPECS RUN & TEST XQUERY 3
Architecture for Large-Scale Data Integration System and Design Tools How can the user query and Browse the integrated data? QURSED What queries can my app issue? What integrated view services can I build? CLIDE How do I export my database services functionality? RIDE-Services Source Domain Web Domain Application Domain Integration Domain Application Data Source Data Source Mediator Global View Schema Developer Integration Engineer Source Owner Application Web Forms & Reports Source Schema … Web Service Web Service Web Service Source Schema … How do I export my data? RIDE Web Services Cache (Metadata)
Dual Interactive Registration Problems New App ? New Query ? Source Services Register Source Given Global Register Source Given Global Schema, Constraints &Queries Guide the client in query/form writing Apps ? Queries ? Register Client Given Sources Register Client Given Sources Guide the source owner in registering a new source and services New Source and Services Global View
Source Data Registration How do my source attributes map to global attributes –mappers & automatic matchers How do my data relate to queries & other sources –Inconsistencies? –What takes to contribute to queries? –How much should I clean up? Multiple ways of dealing with redundancy Apps ? Queries ? Server Side New Source Global View
How to achieve this Goal Apps ? Queries ? Before New Source Apps ? Queries ? Now New Source Look at all sources & queries Decide how to register your source Follow the suggestions of the interface Global View Source Registration Tool
Our Goal in Source Registration Guide the source owner visually through the registration of the source so as to avoid/warn about (potential) inconsistencies and contribute information to the answer of the queries while exposing the minimum information possible and/or minimizing effort
The Problem ? Client Queries ? ? Mediator (Global DB) Sources(Actual Local DBs) 17
The Contribution Problem ? Client Queries ? ? Sources(Actual Local DBs) What is the contribution of source S to the result of the query Q? S Q 18 Mediator (Global DB)
The Problem Client Queries ? Sources(Actual Local DBs) What is the contribution of source S to the result of the query Q? S Q 19 Mediator (Global DB) Q:cars Q: cars carsreviews Q: cars JOIN reviews S is Self Sufficient w.r.t. Q S is Now Complementary w.r.t. Q
Relational Schemas: Local and Global attributes relations ? Relational Schemas Visual Representation make S1S1 Carmake Origin Sales auto S2S2 Id Model detail Id Engine Baseprice Source 1 Business Magazine Source 2 Car Magazine Global Car Portal car G Model Carmake brand Origin Doors Baseprice Carmake 20
Source Registration using GLAV Mappings Source Registration: Source Registration: Correspondence between a source schema and the global schema = Set of Mapping Constraints of the form (U V) Open World Open World Global and Local As View (GLAV) ? CQ = over source schema CQ = over global schema 21
Target Constraints Constraints on the global schema = Set of Constraints of the form (U V) Also Expresses Dependencies (PKs, Ref Integrity, …) Also Expresses Dependencies (PKs, Ref Integrity, …) ? CQ = over global schema 22
Visual Representation of Mappings (1) ? Visual Representation (IBM Clio) brand C O Business Magazine: Provides Carmake and Origin U 1 (C, O) :- make(C, O, S) V 1 (C, O) :- brand(C, O) (U 1 V 1 ) car G Model Carmake brand Origin Doors Baseprice make S1S1 Carmake Origin Sales OC make C O S O C 23
Visual Representation of Mappings (2) ? Visual Representation (IBM Clio) Car Magazine: Provides Model, Carmake and Baseprice auto I M C detail I E B car M C ? B auto S2S2 Id Model detail Id Engine Baseprice Carmake car G Model Carmake brand Origin Doors Baseprice 24
Example of Target Constraint ? (Model, Carmake) is a PK of car car G Model Carmake brand Origin Doors Baseprice U 1 (M, C, D 1, B 1, D 2, B 2 ):-car(M, C, D 1, B 1 ), car(M, C, D 2, B 2 ) V 1 (M, C, D, B, D, B) :- car(M, C, D, B) (U 1 V 1 ) 25
Query Semantics Queries in UCQ = Set of Possible Global Instances Set of global instances that satisfy all constraints Query Answers = Set of Certain Answers The tuples appearing in the answer to Q for any possible global instance Possible global instances Answer to Q for any of the possible global instances Certain Answers to Q Answer to Q for any of the possible global instances Certain Answers to Q Q Q Possible global instances ? 26
Source Instance’s Contribution Answer to Q - For given instances of the sources Contribution to Q of Source Instance = The tuples in answer of Q not provided by the other sources 27
Source Registration’s Contribution Source Registration: Source Mappings Degrees of Source Registration’s Contribution Self Sufficient Now Complementary Later Complementary Unusable More contribution Less contribution 28
Self Sufficient Registration: Example ? Baseprices of Models car M3 BMW ? 45K car G Model Carmake brand Origin Doors Baseprice Green Registration is Self Sufficient car Model Carmake Doors Baseprice Example 29
Self Sufficient Registration: Definition Answer to Q Source instance s.t. The source has a non empty contribution in the absence of the other sources Answer to Q - Self Sufficient XXXX 30
Now Complementary Registration: Example ? Baseprices of Models by German manufacturers car M3 BMW ? 45K car G Model Carmake brand Origin Doors Baseprice brand BMW Germany Green Registration is Now Complementary brand Carmake Origin = ‘Germany’ car Model Carmake Doors Baseprice Example 31
Now Complementary Registration: Definition Answer to Q Not Self Sufficient & Source instances s.t. The source has a non empty contribution in combination with the other existing sources Answer to Q - Now Complementary 32
Later Complementary Registration: Example ? Baseprices of Models by German manufacturers car M3 BMW ? 45K car G Model Carmake brand Origin Doors Baseprice brand BMW Germany Green Registration is Later Complementary brand Carmake Origin = ‘Germany’ car Model Carmake Doors Baseprice Example 33
Later Complementary Registration: Definition Answer to Q Not Self Sufficient & Not Now Complementary & Potential future sources & Source instances s.t. The source has a non empty contribution in combination with the future sources Answer to Q - Later Complementary Later Complementary 34
Unusable Registration: Example ? car G Model Carmake brand Origin Doors Baseprice brand Carmake Origin Origin of Carmakes Green Registration is Unusable Example 35
Unusable Registration: Definition Answer to Q Not Self Sufficient & Not Now Complementary & Not Later Complementary The source has a empty contribution regardless of what sources enter the system Answer to Q - = = Unusable Unusable 36
Subtleties for Unusable Registrations ? Baseprices and Doors of Models car G Model Carmake brand Origin Doors Baseprice car M3 BMW ? 45K car M3 BMW 2 ? Green Registration is Unusable car Model Carmake Doors Baseprice Example 37
In presence of PK Unusable Example becomes Later Complementary ? Baseprices and Doors of Models car G Model Carmake brand Origin Doors Baseprice Green Registration is Later Complementary car M3 BMW ? 45K car M3 BMW 2 ? M3 BMW M3 BMW car M3 BMW 2 45K car Model Carmake Doors Baseprice Example 38
Decidability Results NonePrimary keysPrimary keys + Referential Integrity Constraints Self Sufficient Yes No Now complementary Yes No Later complementary Yes ? Unusable Yes ? Target constraints DegreesDegreesDegreesDegrees Overview: What is decidable 39
Issues Unique client query Multiple client queries Vs Contribute to: - all queries? - one query? - specific queries? - some queries based on some ranking? Data independence Data dependence Vs M 1 : cars, refPrices M 2 : reviews Q: cars JOIN reviews JOIN refPrices e.g. DB 1 : cars, refPrices (Audis) DB 2 : reviews (Hondas) (M 2, Q) now-complementary but Certain Answers for Instances DB 1, DB 2 =
Putting it all together Architecture ? Query Global Schema Local Schemas Q S1S1 SnSn S n+1 …M1M1 MnMn M n+1 S’ Guide the source owner visually through the registration of the source so as to raise contribution to the answer of the queries while exposing the minimum info possible and/or minimizing effort 4 categories: Self Sufficient / Now Complementary / Later Complementary / Unusable Query Answering / Mappings / Schemas Architecture Goal Registered sources New source Mappings … Contribution
Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality Query Without primary keys in the target Unusable BLUE: Map at least one of the groups car AutoTrader vin cmodel price ad * carId id *
Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema Query Unusable = cmo car AutoTrader vin cmodel price ad * carId id * car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality model Without primary keys in the target
Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema Query = cmo car AutoTrader vin cmodel price ad * carId id * car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality model = price Later Complementary Without primary keys in the target
Example 2 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality Query Unusable car AutoTrader vin cmodel price ad * carId id * With primary keys in the target
Example 2 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model condition quality Query = price = vin car AutoTrader vin cmodel price ad * carId id * Later Complementary With primary keys in the target
Lessons learned To merge data with that of other sources (become complementary): Pick a relation and provide… …all its attributes asked by the query …its primary key and one of its attributes asked by the query In absence of primary keys In presence of primary keys The number of choices increases in presence of primary keys Foreign keys on the target affect the suggestions Target constraints make a difference
Large-Scale Data Integration Systems How can the user query and Browse the integrated data? QURSED What queries can the mediator answer for me? CLIDE How do I export my database services functionality? RIDE-Services Source Domain Web Domain Application Domain Integration Domain Application Data Source Data Source Mediator Global View Schema Developer Integration Engineer Source Owner Application Web Forms & Reports Source Schema … Web Service Web Service Web Service Source Schema … How do I export my data? RIDE
Running Example Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1ComByCpu(cpu) (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2ComNetByCpuRate(cpu, rate) (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Parameterized Views DellCisco Schema Routers(rate, standard, price, type) Views V3RouByTypeW() (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4RouByTypeWL() (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Computers for a given cpu Computers & NetCards for a given cpu & rate Wired Routers Wireless Routers
Running Example Global schema puts together the Dell and Cisco schemas Resembles the schema of CNET.com portal Column Associations (Computers.cid, NetCards.cid) (NetCards.rate, Routers.rate) (NetCards.standard, Routers.standard) Global Schema V1 Application V3V2 DellCisco Mediator Global Schema Developer V4
Sophisticated Mediators Make Feasibility Hard to Predict Feasible Queries FQ Equivalent CQ query rewritings using the views Might involve more than one views Order might matter V4 Mediator RouByTypeW L() Routers.* b 50 Wirele ss g 120 Wirele ss A B V2 ComNetByCpuRate(‘P4’, ‘10’) C D Computers.*NetCards.* A12 3 P A b US B B12 3 P B g US B Feasible ComNetByCpuRate(‘P4’, ‘54’) Computers.*NetCards.*Routers.* A12 3 P A b US B b 50 Wirele ss B12 3 P B g US B g 120 Wirele ss E Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers Query: Get all Computers Infeasible
Problem 1.Large number of sources 2.Large number of views 3.Mediator capabilities Developer formulates an application query Is an application query feasible? If not, how do I know which ones are feasible? Previous options: –The developer had to browse the view definitions and somehow formulate a feasible query –Or formulate queries until a feasible one is found (trial-and-error) No system-provided guidance
The CLIDE Solution A query formulation interface, which interactively guides the user toward feasible queries by employing a coloring scheme CLIDE V1 Application V3V2 DellCisco Mediator Global Schema Developer V4
QBE-Like Interfaces Microsoft SQL-Server
CLIDE Interface Table, selection, projection and join actions Color-based suggestions Feasibility Flag Projection Boxes Table Boxes Selection Boxes Feasibility Flag Table Alias
CLIDE Interface Yellow required action –All feasible queries require this action White optional action –Feasible queries can be formulated w/ or w/o these actions Snapshot 1
CLIDE Interface Snapshot 2 Blue required choice of action –At least one feasible (next) query cannot be formulated unless this action is performed V1 Mediator ComByCpu(‘P 4’) cid cp u ram pric e A123P B123P ram pric e A B C
CLIDE Interface Join Lines: Only yellow and blue are displayed Must appear in Column Associations Snapshot 3
CLIDE Interface Snapshot 4
CLIDE Interface Snapshot 5 * any other constant Red prohibited action –Does not appear in any feasible query –Lead to “Dead End” state
CLIDE Interface Snapshot 6 V4 Mediator RouByTypeW L() Routers.* b 512 Wirele ss g Wirele ss A B V2 ComNetByCpuRate(‘P4’, rate) D E Computers.*NetCards.* A12 3 P A b 50 B12 3 P B g 120 ram pric e rate interfac e pric e USB USB120 F
CLIDE Facts Rapid Convergence –At every step, yellow and blue actions lead to a feasible query in a minimum number of steps Completeness of Suggestions –Every feasible query can be formulated by performing yellow and blue actions at every step Minimality of Suggestions –At every step, only a minimal number of actions are suggested, i.e., the ones that are needed to preserve completeness
Join Action Table Action Selection Action Interaction Graph Nodes are queries –One for each qCQ Edges are actions –Table, selection, projection and join actions Green nodes are feasible queries Infinitely big structure –All CQ queries –All possible combinations of actions formulating them Com1.cid=Net1.cidCom1.cpu=‘P4’Com1Com1.ramRou1 …… Com1.price … … …………… Net1 …
Interaction Graph: Colorable Actions Colorable actions A C label outgoing edges of the current node Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid … … … … … Com1.cpu … … … … Current Node
Interaction Graph: Colors Com1.cpu=* … … … … … … … … … … … … Current Node Net1Com1.cid=Net1.cid Com2.cid=Net1.cid Com2 Com2.cpu=‘P4’Net1.rate=‘54Mbps’ Net1.rate=’54Mbps’ … ………… …… Com1.cpu=* Rou1Net1.rate=Rou1.rate ………… Net1.rate=’54Mbps’ … Com1.cid=Net1.cid … Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid Com1.cpu Yellow action –Every path from current node n to a feasible node contains Blue action –At least one feasible query cannot be formulated unless this action is performed (minimality) Red action –No path to a feasible node contains
Color Determined By a Finite Set of Feasible Queries Start by considering the closest feasible queries FQ C FQ C is sufficient to color actions in A C Theorem: Set of Closest Feasible Queries is Finite How far can closest feasible queries FQ C be? Based on Maximally Contained Queries FQ MC ? n … … … … … … Closest Feasible Queries FQ C Challenge: Infinitely Many Feasible Queries Radius Infinitely many feasible queries ? … …
Color Algorithm Assuming fixed SELECT clause (projection list) Covered extensively in literature –MiniCon, Bucket, InverseRules FQ MC is finite Maximally Contained Query Maximally Contained Queries FQ MC Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu Query: Q3 Get all Computers with a given cpu & ram Not Maximally Contained Maximally Contained Query Query: Q4 Get all Computers with a given ram
Color Algorithm Compute maximally contained queries FQ MC The radius p L is the longest path to a node n’ such that q(n’) in FQ MC All FQ C queries are reachable via a path of length p p L Closest Feasible Queries FQ C Maximally Contained Queries FQ MC n … … … … … … Maximally Contained Queries FQ MC p L Radius …
Color Algorithm Theorem: All queries in FQ MC are in FQ C But not all queries in FQ C are in FQ MC More on Closest Feasible Queries Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC … … … … … … More feasible nodes n
Color Algorithm Naïve Approach –Start from n and explore paths up to length p L More on Closest Feasible Queries Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC … … … … … … n
Color Algorithm Collapse Aliases to compute FQ C \ FQ MC Check satisfiability Collapse Aliases Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC n … … … … … …
Color Algorithm Coloring Non-Projection Actions No interaction graph materialization Use of containment mapping from current query to the closest feasible ones An action is colored –Yellow, if is mapped into all queries in FQ C –Red, if is not mapped into any query in FQ C –Blue, if is mapped into at least one query q F in FQ C, no other action in A P is mapped into q F, and is neither yellow nor red Coloring Projection Actions Never colored yellow Can be colored blue only if –the current query is feasible –it is not colored red Which ones are red? –Bring all projection atoms from views such that feasibility is preserved –If action is not mapped into any query in FQ C, then is red
Other Back-End Parameterized Views Back-End CLIDE Implementation Action Current Query Closest Feasible Queries Schemas Views MiniCon Containment Test Collapse Aliases Color Actions Front-End Developer Maximally Contained Queries Optimal Maximally Contained Queries Colored Actions Column Associations MiniCon Outputs redundant and non-minimal queries Affects CLIDE’s rapid convergence and minimality properties Containment Test Well-known NP-complete problem Polynomial when query is acyclic Collapse Aliases / Color Actions Reuse containment mappings created by MiniCon
CLIDE Performance Querie s A-span = 7 B-span = 4 Selections = 4,6,8,10 A B1B1 … C1C1 B2B2 C1C1 A B K B1B1 … C1C1 C L … Schem a … B i … C i View s A B K B1B1 … C1C1 C L … … … B iM B i1 … C iM C i1 … Chains of Stars