Discovering Queries based on Example Tuples

Slides:



Advertisements
Similar presentations
Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Advertisements

Output URL Bidding Panagiotis Papadimitriou, Hector Garcia-Molina, (Stanford University) Ali Dasdan, Santanu Kolay (Ebay Inc) Related papers: VLDB 2011,
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Aho-Corasick String Matching An Efficient String Matching.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “QUERY OPTIMIZATION” Academic Year 2014 Spring.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Graph Query Reformulation with Diversity – Davide Mottin, Francesco Bonchi, Francesco Gullo 1 Graph Query Reformulation with Diversity Davide Mottin, University.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Johannes Kepler University Linz Department of Business Informatics Data & Knowledge Engineering Altenberger Str. 69, 4040 Linz Austria/Europe
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
CS4432: Database Systems II Query Processing- Part 2.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Ten Thousand SQLs Kalmesh Nyamagoudar 2010MCS3494.
In this session, you will learn to: Query data by using joins Query data by using subqueries Objectives.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Presented by: Dardan Xhymshiti Fall  Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation)  Conference: VLDB  Type:
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
CS4432: Database Systems II Query Processing- Part 1 1.
Execution Plans Detail From Zero to Hero İsmail Adar.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
IFS180 Intro. to Data Management Chapter 10 - Unions.
Chapter 6 - Database Implementation and Use
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
Database Management System
TT-Join: Efficient Set Containment Join
Access Path Selection in a Relational Database Management System
CARPENTER Find Closed Patterns in Long Biological Datasets
Query Optimization Techniques
Automatic Physical Design Tuning: Workload as a Sequence
Advance Database Systems
MCN: A New Semantics Towards Effective XML Keyword Search
Query Optimization Techniques
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Discovering Queries based on Example Tuples Yanyan Shen1, Kaushik Chakrabati2, Surajit Chaudhuri2, Bolin Ding2, Lev Novik2 1National University of Singapore, 2Microsoft Corporation

Complex Database Schema Source #tables #columns #text columns #foreign-key references IMDB 21 101 42 22 Axon (customer-support) 100 1263 614 63 CRM (customer-relationship) 347 5595 1074 586

Challenge: Querying Complex Databases SQL SELECT CustName, DevName, AppName FROM Customer, Sales, Device, App WHERE Sales.CustId=Custom.CustId AND Sales.DevId=Device.DevId AND Sales.AppId=App.AppId Target schema Relevant tables Join path Any help to formulate a SQL query?

Can Keyword Search Help? Sales Customer Input: Mike ThinkPad Office SId CustId DevId AppId s1 c1 d1 a1 s2 c2 d2 a2 s3 c3 d3 a3 CustId CustName c1 Mike Jones c2 Mary Smith c3 Bob Evans *search for sales tuples Output: matched rows Employee Device App Mike Jones s1 ThinkPad X1 Office 2013 EmpId EmpName e1 Mike Stone e2 Mary Lee e3 Bob Nash DevId DevName d1 ThinkPad X1 d2 iPad Air d3 Nexus 7 AppId AppName a1 Office 2013 a2 Evernote a3 Dropbox Mike Stone o1 ThinkPad X1 Office 2013 Owner ESR OId EmpId DevId AppId o1 e1 d1 a1 o2 e2 d3 a3 o3 e3 d2 a2 ESRId EmpId AppId Desc sr1 e1 a1 Office crash sr2 e2 a3 Dropbox can’t sync Where is schema information? Ambiguity

Output(Project join query) Our Proposal *Who bought which product with which app installed. Mike Mary ThinkPad iPad Office Dropbox Bob Input (Example table) Output(Project join query) Customer A Device B App C CustId CustName DevId DevName AppId AppName Sales SId CustId DevId AppId

Roadmap Motivation & proposal Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

Problem Statement Input: an example table T Output: project join query such that (valid): every row 𝑟 in T is present in the query result (minimal): removing any edges or nodes from the join tree will lead to an invalid query Mike Mary ThinkPad iPad Office Dropbox Bob minimal Not minimal Developer

Solution Overview Candidate Query Generation Candidate Query Verification Candidate Projection Column Retrieval Schema Graph Traversal Example Table Result Queries IR Engine maintaining inverted index on text columns (CI) Database Schema Database Instance

Roadmap Motivation & proposal Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

Candidate Query Generation Mike Mary ThinkPad iPad Office Dropbox Bob Candidate Projection Column Retrieval For each column in the example table, find candidate projection columns in the database satisfying column constraint: contain all the keywords in the column Input column Candidate projection columns A Customer.CustName Employee.EmpName B Device.DevName C App.AppName ESR.Desc

Candidate Query Generation Mike Mary ThinkPad iPad Office Dropbox Bob Candidate Query Enumeration Follow candidate network generation algorithm[1]  No join is required! CQ1 CQ2 Sales Owner CQ3 Owner A A B C A B C B Customer Device App Employee Device Employee Device App C ESR CQ4 Owner CQ5 Owner Generate join tree 𝐽 Generate mapping 𝜙 Check minimal: - Every leaf node contains a column that is mapped by an input column C B A B App Device Employee Device App C ESR A ESR Employee [1] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. VLDB 2002.

Roadmap Motivation & proposal Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

Algorithm 1: VerifyAll Iterate over candidate queries in outer loop and rows in ET in inner loop (or vice versa) and verify whether a candidate query 𝑪𝑸 contains a row 𝒓 in its output. A candidate is valid iff it contains all the rows in ET. Performing (CQ,r)-verification is expensive! VerifyAll is wasteful as most candidate queries are invalid! Mary iPad Mike ThinkPad Office Dropbox Bob Non-empty result implies 𝐶 𝑄 2 satisfies row 1 Empty result implies 𝐶 𝑄 2 fails for row 2 𝐶 𝑄 2 ,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName,’Mary’) AND CONTAINS(DevName,’iPad’) 𝐶 𝑄 2 ,1 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName,’Mike’) AND CONTAINS(DevName,’ThinkPad’) AND CONTAINS(AppName,’Office’)

Opportunity of Pruning Mike Mary ThinkPad iPad Office Dropbox Bob (CQ2,2) fails implies (CQ5, 2) fails 𝐶 𝑄 2 ,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName,’Mary’) AND CONTAINS(DevName,’iPad’) Failure dependency Verifying candidates with smaller join trees is more beneficial! 𝐶 𝑄 5 ,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App, ESR WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND ESR.AppId=App.AppId AND CONTAINS(EmpName,’Mary’) AND CONTAINS(DevName,’iPad’)

Algorithm 2: SimplePrune Order candidate queries in increasing join tree size Keep a list of CQ-row verifications performed so far that failed Iterate over ordered candidate queries in the outer loop and rows in the inner loop. When verify candidate Q, check if its failure result can be implied by the verifications in the list. If so, prune Q immediately. Otherwise, verify Q for all the rows.

Observation limited pruning! Mike Mary ThinkPad iPad Office Dropbox Bob limited pruning!

Opportunity Mike Mary ThinkPad iPad Office Dropbox Bob Evaluating common sub-structure on certain row may prune multiple invalid candidates!

Filter Filter success and failure Filter evaluation query Owner Employee Device Owner Employee Device A B A B 𝜙(A)= Employee.EmpName 𝜙(B)= Device.DevName 𝜙(C)= App.AppName 𝜙’(A)= Employee.EmpName 𝜙’(B)= Device.DevName 𝜙’(C)= * 𝜙’(A)= Employee.EmpName 𝜙’(B)= Device.DevName 𝜙’(C)= * A B C A B C Mike Thinkpad Office Mary iPad Filter success and failure Filter evaluation query 𝐹 1 succeeds, 𝐹 2 fails

Dependency Properties of Filters Filter-candidate dependency 𝐹 1 fails implies 𝐶 𝑄 2 is invalid F1 Owner Employee Device A B Inter-filter failure dependency F2 Owner Employee Device A B A B C C 𝐹 1 fails implies 𝐹 2 fails Mary iPad App A B C Mary iPad Inter-filter success dependency 𝐹 2 succeeds implies 𝐹 1 succeeds

Adaptive Filter Selection J1 J2 J3 J4 Owner Employee Device A B Owner App Device B C Owner Employee App A C ESR App C Employee A (J1,1) (J1,2) (J1,3) (J2,1) (J2,2) (J2,3) (J3,1) (J3,2) (J3,3) (J4,1) (J4,2) (J4,3) CQ2 CQ3 CQ4 5 evaluations!

Adaptive Filter Selection J1 J2 J3 J4 Owner Employee Device A B Owner App Device B C Owner Employee App A C ESR App C Employee A (J1,1) (J1,2) (J1,3) (J2,1) (J2,2) (J2,3) (J3,1) (J3,2) (J3,3) (J4,1) (J4,2) (J4,3) CQ2 CQ3 CQ4 2 evaluations!

Filter Selection Problem Given the set of filters for all the candidate queries, select a set of filters with minimized cost such that all the candidate queries are verified as valid/invalid after evaluating the selected filters. Cost of 𝐹 𝑖 : # of joins in the join tree of 𝐹 𝑖 Problem Complexity: NP-hard Greedy algorithm: approx. ratio:

Roadmap Motivation Problem statement Solution Experimental results Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

Experiment Settings Dataset: IMDB Example table generation Parameters: #rows, #columns, sparsity, value length for non-empty cells Implementations VerifyAll SimplePrune Filter Weave[2] Measures Number of verifications performed Execution time [2] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-drive schema mapping. SIGMOD 2012.

Results on Various Example Tables Vary #rows Filter performs 5X fewer verifications than VerifyAll and 2X fewer than SimplePrune Filter is robust to #rows, i.e. requires similar #verifications Filter runs 4X faster than VerifyAll and 3X faster than SimplePrune

Comparison with Weave Filter requires 10X fewer verifications Filter runs 4X faster than Weave

Conclusion Develop a new search interface for discovering queries Address challenges in query discovery Verify candidate queries efficiently Filter selection problem Greedy strategy

Thanks! Q&A