Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006.

Slides:



Advertisements
Similar presentations
Interaction Design: Visio
Advertisements

Heuristic Search techniques
COMP 482: Design and Analysis of Algorithms
Scalable and Dynamic Quorum Systems Moni Naor & Udi Wieder The Weizmann Institute of Science.
NP-Hard Nattee Niparnan.
13-Optimization Assoc.Prof.Dr. Ahmet Zafer Şenalp Mechanical Engineering Department Gebze Technical.
ICDT'2001, London, UK1 Minimizing View Sets without Losing Query-Answering Power Chen Li Stanford University joint work with Mayank Bawa and Jeff Ullman.
Fundamental tools: clustering
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Computer science is a field of study that deals with solving a variety of problems by using computers. To solve a given problem by using computers, you.
The Out of Kilter Algorithm in Introduction The out of kilter algorithm is an example of a primal-dual algorithm. It works on both the primal.
SASH Spatial Approximation Sample Hierarchy
June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
NuCAD ACG - Adjacent Constraint Graph for General Floorplans Hai Zhou and Jia Wang ICCD 2004, San Jose October 11-13, 2004.
Query Incentive Networks Jon Kleinberg and Prabhakar Raghavan - Presented by: Nishith Pathak.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou CSE 636 Data Integration, March.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Sequential Redundancy Removal w/o State Space Exploration A. Mehrotra, S. Qadeer, V. Singhal, R. Brayton, A. Aziz, A. Sangiovanni-Vincentelli, “Sequential.
Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006 SIGMOD.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
CS 61B Data Structures and Programming Methodology Aug 11, 2008 David Sun.
Data Structures Week 9 Towards Weighted BFS So, far we have measured d s (v) in terms of number of edges in the path from s to v. Equivalent to assuming.
Mobility Limited Flip-Based Sensor Networks Deployment Reporter: Po-Chung Shih Computer Science and Information Engineering Department Fu-Jen Catholic.
CAFE router: A Fast Connectivity Aware Multiple Nets Routing Algorithm for Routing Grid with Obstacles Y. Kohira and A. Takahashi School of Computer Science.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
Representing and Using Graphs
Network Optimization Problems
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
Efficient Fault-Tolerant Certificate Revocation Rebecca Wright Patrick Lincoln Jonathan Millen AT&T Labs SRI International.
1 11 Channel Assignment for Maximum Throughput in Multi-Channel Access Point Networks Xiang Luo, Raj Iyengar and Koushik Kar Rensselaer Polytechnic Institute.
Marina Drosou, Evaggelia Pitoura Computer Science Department
From Enterprise Information Integration to Community-Based Mediation Alin Deutsch, Yannis Katsis, Michalis Petropoulos Yannis Papakonstantinou A presentation.
Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
CPSC 322, Lecture 5Slide 1 Uninformed Search Computer Science cpsc322, Lecture 5 (Textbook Chpt 3.5) Sept, 13, 2013.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Answering Queries Using Views Presented by: Mahmoud ELIAS.
Beard & McLain, “Small Unmanned Aircraft,” Princeton University Press, 2012, Chapter 12: Slide 1 Chapter 12 Path Planning.
The NP class. NP-completeness
Capability-Sensitive Query Processing on Internet Sources
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Parallel Tasks Decomposition
Solver & Optimization Problems
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
The minimum cost flow problem
Haim Kaplan and Uri Zwick
ME 521 Computer Aided Design 15-Optimization
Chapter 5. Optimal Matchings
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Searching for Solutions
1.206J/16.77J/ESD.215J Airline Schedule Planning
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Troposphere (sic) Fliers
Computational Advertising and
Clustering.
Unit III – Chapter 3 Path Testing.
Presentation transcript:

Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006

2 Large-Scale Data Integration Systems Source Domain Web Domain End User  Application Domain Integration Domain  Application Data Source Data Source Mediator Integrated Schema Developer  Integration Engineer  Source Owner  Application Web Forms & Reports Source Schema …  Web Service Web Service Web Service Source Schema … Dell Computers Cisco Routers HP Printers Dell Computers by CPU Cisco Routers by Rate HP Printers by Speed CNET Computer PCWorld Portals Compatible Combinations of Computers, Routers and Printers CNET’s Top Combinations CNET’s Search Desktops PCWorld’s Product Finder

3 Large-Scale Data Integration Systems What queries can the mediator answer for me? CLIDE Source Domain Web Domain End User  Application Domain Integration Domain  Application Data Source Data Source Mediator Integrated Schema Developer  Integration Engineer  Source Owner  Application Web Forms & Reports Source Schema …  Web Service Web Service Web Service Source Schema …

4 Running Example Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1ComByCpu(cpu)  (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2ComNetByCpuRate(cpu, rate)  (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Parameterized Views DellCisco Schema Routers(rate, standard, price, type) Views V3RouWired()  (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4RouWireless()  (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Conjunctive Queries CQ Equality & Comparison Conditions Parameters Computers for a given cpu Computers & NetCards for a given cpu & rate Wired Routers Wireless Routers

5 Running Example Integrated schema puts together the Dell and Cisco schemas Attribute Associations (Computers.cid, NetCards.cid) (NetCards.rate, Routers.rate) (NetCards.standard, Routers.standard) Integrated Schema V1  Application V3V2 DellCisco Mediator Integrated Schema Developer  V4

6 Sophisticated Mediators Make Feasible Queries Hard to Predict Feasible Queries FQ Equivalent CQ query rewritings using the views Might involve more than one views Order might matter V4 Mediator RouWireless() Routers.* 10.11b50Wireless 54.11g120Wireless A B V2 ComNetByCpuRate(‘P4’, ‘10’) C D Computers.*NetCards.* A123P A bUSB B123P B gUSB Feasible ComNetByCpuRate(‘P4’, ‘54’) Computers.*NetCards.*Routers.* A123P A bUSB10.11b50Wireless B123P B gUSB54.11g120Wireless E Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers V1 Mediator Query: Get all Computers Infeasible

7 Problem 1.Large number of sources 2.Large number of views (web-services) 3.Mediator capabilities Developer formulates an application query  Is an application query feasible?  If not, how do I know which ones are feasible? Previous options: –The developer had to browse the view definitions and somehow formulate a feasible query –Or formulate queries until a feasible one is found (trial-and-error) No system-provided guidance

8 The CLIDE Solution  A query formulation interface, which interactively guides the developer toward feasible queries by employing a coloring scheme CLIDE V1  Application V3V2 DellCisco Mediator Integrated Schema Developer  V4

9 QBE-Like Interfaces Microsoft SQL-Server

10 CLIDE Interface Table, selection, projection and join actions Feasibility Flag Color-based suggestions Selection Boxes Feasibility Flag

11 Example Interaction Yellow  required action –All feasible queries require this action White  optional action –Feasible queries can be formulated w/ or w/o these actions Snapshot 1

12 Example Interaction Snapshot 2 Blue  required choice of action –At least one feasible query cannot be formulated unless this action is performed V1 Mediator ComByCpu(‘P4’) cidcpuramprice A123P B123P ramprice A B C

13 Example Interaction Blue  required choice of action –At least one of them is required to be taken in order to reach a feasible query Join Lines: Only yellow and blue are displayed Must appear in Attribute Associations Snapshot 2

14 CLIDE Properties Completeness of Suggestions –Every feasible query can be formulated by performing yellow and blue actions at every step Minimality of Suggestions –At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness Rapid Convergence By Following Suggestions –The shortest sequence of actions from a query to any feasible query consists of suggested actions

15 Example Interaction Snapshot 3 *  any other constant Red  prohibited action –Does not appear in any feasible query –Lead to “Dead End” state

16 Example Interaction Computers.*NetCards.* A123P A b50 B123P B g120 Snapshot 4 V4 Mediator RouWireless() Routers.* 10.11b512Wireless 54.11g1024Wireless A B V2 ComNetByCpuRate(‘P4’, rate) D E rampricerateinterfaceprice USB USB120 F

17 CLIDE Properties Completeness of Suggestions –Every feasible query can be formulated by performing yellow and blue actions at every step Minimality of Suggestions –At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness Rapid Convergence By Following Suggestions –The shortest sequence of actions from a query to any feasible query consists of suggested actions

18 Join Action Table Action Selection Action Interaction Graph Nodes are queries –One for each qCQ Edges are actions –Table, selection, projection and join actions Green nodes are feasible queries Infinitely big structure –All CQ queries –All possible combinations of actions formulating them Com1.cid=Net1.cidCom1.cpu=‘P4’Com1Com1.ramRou1 …… Com1.price … … …………… Net1 …

19 Com1.cpu=* Interaction Graph: Colors Com1.cpu=* … … … … … … … … … … … … Current Node Net1Com1.cid=Net1.cid Com2.cid=Net1.cid Com2 Com2.cpu=‘P4’Net1.rate=‘54Mbps’ Net1.rate=’54Mbps’ … ………… …… Com1.cpu=*Rou1Net1.rate=Rou1.rate ………… Net1.rate=’54Mbps’ … Com1.cid=Net1.cid … Net1 Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid Com1.cpu Yellow action  –Every path from current node n to a feasible node contains  Blue action  –At least one path from current node n to a feasible node n F contains  –There is not path from n to n F that contains a feasible node Red action  –No path to a feasible node contains  Current Node Com1.cid=Net1.cid Com2 Rou1 Net1.rate=’54Mbps’

20 CLIDE Properties Rapid Convergence By Following Suggestions –The shortest path from the current node to any feasible node consists of suggested actions

21 CLIDE Architecture Back-End invoked every time the user performs an action –i.e., the user arrives at a new node in the interactions graph Back-End Closest Feasible Queries Algorithm User Closest Feasible Queries FQ C Current Query Color Algorithm Colored Actions + Feasibility Flag Aliases Collapse Rule Maximally-Contained Rewriter ViewsSchemas Attribute Associations Minimal Feasible Extension Queries  Front-End Actions Parameters Algorithm Seed Queries SQ

22 Color Determined By a Finite Set of Feasible Queries Finite Set of Feasible Queries Closest Feasible Queries FQ C FQ C is sufficient to color actions n … … … … … … Challenge: Infinitely Many Feasible Queries Radius ? Solution: Challenge: How far can the Closest Feasible Queries FQ C be? Closest Feasible Queries FQ C

23 Closest Feasible Queries FQ C Algorithm Compute maximally contained queries FQ MC Radius p L is the longest path to a maximally contained query Theorem:All FQ C queries are reachable via a path of length p  p L Solution: Based on Maximally Contained Queries FQ MC n … … … … … … p L Radius Challenge: The set of nodes within p L can be too many to explore Maximally Contained Queries FQ MC Closest Feasible Queries FQ C

24 Closest Feasible Queries FQ C Algorithm Theorem: All queries in FQ MC are in FQ C But not all queries in FQ C are in FQ MC Collapse Aliases to compute FQ C \ FQ MC Solution: Find the Closest Feasible Queries Directly n … … … … … … Maximally Contained Queries FQ MC Closest Feasible Queries FQ C More feasible nodes

25 Remaining Issues of the Algorithm Color remaining actions white or red Handling projections Dealing with parameterized queries

26 CLIDE Implementation Issues Maximally contained queries need to be minimized (affect CLIDE’s rapid convergence and minimality)  Implemented the minimization module from scratch  MiniCon computes mappings between the query and the views  Exposed these mappings to guide minimization Optimizations to bring the response time below 3 sec  Amortize the computation across multiple interactions steps by observing that query is changed minimally at every step  Instead of calling MiniCon from scratch at every step, we incrementally compute the closest feasible queries

27 Thank you Play with our demo:

28 Maximally Contained Queries FQ MC Assuming fixed SELECT clause (projection list) Covered extensively in literature –MiniCon, Bucket, InverseRules Algorithms FQ MC is finite Maximally Contained Query Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu Query: Q3 Get all Computers with a given cpu & ram Not Maximally Contained Maximally Contained Query Query: Q4 Get all Computers with a given ram

29 Closest Feasible Queries FQ C Algorithm Collapse Aliases to compute FQ C \ FQ MC Check satisfiability Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC n … … … … … … Solution: Collapse Aliases

30 Color Algorithm Yellow and Blue An action  is colored based on which closest feasible queries it appear in Yellow, if  appears in all queries in FQ C Blue, if  appears in at least one (but not all) query in FQ C White and Red Attach Maximum Projection Lists to Closest Feasible Queries –Projections that can be added to a feasible query, without compromising feasibility Projection  is white if in the maximum projection list Color selections based on projections

31 CLIDE Implementation & Optimizations Views expansion introduce redundancy –Affects CLIDE’s rapid convergence and minimality Efficient containment test crucial to redundancy removal Maximally-Contained Rewriter Feasible Extension Queries + Maximum Projection Lists Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Maximally-Contained Feasible Queries over Views + Containment Mappings MiniCon Containment Mappings Logging Redundant Queries Removal Minimal Feasible Extension Queries FQ ME + Maximum Projection Lists Redundant Actions Removal Views Expansion Back-End Closest Feasible Queries Algorithm Closest Feasible Queries FQ C Current Query Color Algorithm Colored Actions + Feasibility Flag Aliases Collapse Rule Maximally-Contained Rewriter ViewsSchemas Attribute Associations Minimal Feasible Extension Queries Front-End Parameters Algorithm Seed Queries SQ

32 CLIDE Performance Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A B1B1 … C1C1 B2B2 C1C1 A B K B1B1 … C1C1 C L … Schema … B i … C i Views A B K B1B1 … C1C1 C L … … … B iM B i1 … C iM C i1 … Chains of Stars