Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun.

Slides:



Advertisements
Similar presentations
What is a Database By: Cristian Dubon.
Advertisements

What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Database Security CS461/ECE422 Spring Overview Database model – Relational Databases Access Control Inference and Statistical Databases Database.
Database Management Systems 3ed, Online chapter, R. Ramakrishnan and J. Gehrke1 Query-by-Example (QBE) Online Chapter Example is the school of mankind,
SQL Review.
Structure Query Language (SQL) COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
Chapter 7 Sampling Distributions
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.
Case-based Reasoning System (CBR)
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Introduction to Structured Query Language (SQL)
TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,
Concepts of Database Management Sixth Edition
A Guide to SQL, Seventh Edition. Objectives Retrieve data from a database using SQL commands Use compound conditions Use computed columns Use the SQL.
Chapter 11.1 and 11.2 Data Manipulation: Relational Algebra and SQL Brian Cobarrubia Introduction to Database Management Systems October 4, 2007.
Microsoft Access 2010 Chapter 7 Using SQL.
Automated Diagnosis of Software Configuration Errors
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
Chapter 3 Single-Table Queries
“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.
Feasibility Study.
An-Najah National University Faculty Of Engineering Industrial Engineering Department Implementation Of Quality Function Deployment On Engineering Faculty.
Analyzing Data For Effective Decision Making Chapter 3.
Question 10 What do I write?. Spreadsheet Make sure that you have got a printout of your spreadsheet - no spreadsheet, no marks!
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Using SAS® Information Map Studio
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
IST 210: ORGANIZATION OF DATA Chapter 1. Getting Started IST210 1.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
INFO425: System Design INFORMATION X Chapter 8 Evaluating Alternatives for Requirements, Environment, and Implementation Evaluating Alternatives.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
Concepts of Database Management Seventh Edition
1 TAC2000/ Protocol Engineering and Application Research Laboratory (PEARL) Structured Query Language Introduction to SQL Structured Query Language.
Using Special Operators (LIKE and IN)
Concepts of Database Management Seventh Edition
 Agenda 2/20/13 o Review quiz, answer questions o Review database design exercises from 2/13 o Create relationships through “Lookup tables” o Discuss.
Midterm Stats Min: 16/38 (42%) Max: 36.5/38 (96%) Average: 29.5/36 (78%)
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Concepts of Database Management Eighth Edition Chapter 3 The Relational Model 2: SQL.
User Interfaces 4 BTECH: IT WIKI PAGE:
Concepts of Database Management Seventh Edition Chapter 3 The Relational Model 2: SQL.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2009.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A Guide to SQL, Eighth Edition Chapter Four Single-Table Queries.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 5 SQL.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Chapter 3 Single Table Queries. 2 Simple Queries Query - a question represented in a way that the DBMS can understand Basic format SELECT-FROM Optional.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Survey Training Pack Session 3 – Questionnaire Design.
MIDTERM REVIEW IST 210 Organization of Data IST210 1.
SQL: Interactive Queries (2) Prof. Weining Zhang Cs.utsa.edu.
Chapter 1. Getting Started IST 210: Organization of Data IST2101.
Retrieving Information Pertemuan 3 Matakuliah: T0413/Current Popular IT II Tahun: 2007.
Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.
Algorithms and Problem Solving
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
CS 405G: Introduction to Database Systems
Lecture 12: Data Wrangling
SQL: Structured Query Language
M1G Introduction to Database Development
Probabilistic Databases
Presentation transcript:

Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun

Goal: making it easier for non-expert users to write correct SQL queries 2 Non-expert database end-users –Business analysts, scientists, marketing managers, etc. can describe what the query task is do not know how write a correct query This paper: bridge the gap!

An example SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1 namestu_id Alice1 Bob2 Charlie3 Dan4 stu_idcourse_idscore Table: student Table: enrolled nameMAX(score) Alice100 Charlie88 Output table Find the name and the maximum course score of each student enrolled in more than 1 course. The correct SQL query:

Existing solutions for querying a database General programming languages + powerful − learning barriers GUI tools + easy to use − limited in customization and personalization − hard to discover desired features in complex GUIs 4

SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1 namestu_id Alice1 Bob2 Charlie3 Dan4 stu_idcourse_idscore Table: student Table: enrolled nameMAX(score) Alice100 Charlie88 Output table SQLSynthesizer Our solution: programming by example

How do end-users use SQLSynthesizer? 6 Real, large database tables Desired output result Small, representative Input-output examples SQLSynthesizer

SQLSynthesizer’s advantages 7 Fully automated −Only requires input-output examples −No need of annotations, hints, or specification of any form Support a wide range of SQL queries −Beyond the “select-from-where” queries [Tran’09]

Expressiveness Ease of Use GUI tools SQLSynthesizer Programming languages Comparison of solutions

Outline Motivation A SQL Subset Synthesis Approach Evaluation Related Work Conclusion 9

Designing a SQL subset 10 The full SQL language A SQL Subset pages specification PSPace-Completeness [Sarma’10] Some features are rarely used SQLSynthesizer’s focus: a widely-used SQL subset

How to design a SQL subset? 11 The full SQL language A SQL Subset Previous approaches: –Decided by the paper authors [Kandel’11] [Tran’09] Our approach: –Ask experienced IT professionals for the most widely-used SQL features

Our approach in designing a SQL subset 1.Online survey: eliciting design requirement 2. Designing the SQL subset 3. Follow-up interview: obtaining feedback −Ask each participant to select 10 most widely-used SQL features −Got 12 responses −Ask each participant to rate the sufficiency of the subset Supported SQL features 1)SELECT.. FROM…WHERE 2)JOIN 3)GROUP BY / HAVING 4)Aggregators (e.g., MAX, COUNT, SUM, etc) 5)ORDER BY Supported in the previous work [Tran’09] Not sufficient at all Completely sufficient 0 5 Average rating: 4.5

Our approach in designing a SQL subset 1.Online survey: eliciting design requirement 1.Designing the SQL subset 2.Follow-up interview: obtaining feedback −Ask each participant to select 10 most widely-used SQL features −Got 12 respondents −Ask each participant to rate the sufficiency of the subset Supported SQL features -SELECT.. FROM…WHERE -JOIN -GROUP BY / HAVING -Aggregators (e.g., MAX, COUNT, SUM, etc) -ORDER BY Supported in the previous work [Tran’09] Not sufficient at all Completely sufficient 0 5 Average rating: 4.5 The SQL subset is enough to write most common queries.

Outline Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 14

SQLSynthesizer Workflow 15 Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples Input tables Output table

SQLSynthesizer Workflow 16 Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples Input tables Output table A SQL query

SQLSynthesizer Workflow 17 Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples Input tables Output table Filter ProjectCombine

SQLSynthesizer Workflow 18 Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples Input tables Output table Filter Project SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1 Query condition Join condition Projection columns Join condition Query condition Projection columns A complete SQL: Combine

Multiple solutions 19 Input tables Output table Query 2 Query 1 Query 3 …

Multiple solutions 20 Input tables Output table Filter Project Combine … … Computes all solutions, ranks them, and shows them to the user.

Key techniques 21 Input tables Output table Filter Project Join condition Query condition Projection columns Combine 1.Combine: Exhaustive search over legal combinations (e.g., cannot join columns with different types) 2.Filter: A machine learning approach to infer query conditions 3.Project: Exhaustive search over legal columns (e.g., cannot apply AVG to a string column) Key contribution

22 Cast as a rule learning problem: Finding rules that can perfectly divide a search space into a positive part and a negative part All rows in the joined table Rows contained in the output table The rest of the rows Learning query conditions

Search space: the joined table 23 namestu_id Alice1 Bob2 Charlie3 Dan4 stu_idcourse_idscore Table: student Table: enrolled Namestu_idcourse_idscore Alice Alice Bob Charlie Charlie Charlie Join on the stu_id column The joined table (inferred in the Combine step)

Finding rules selecting rows contained in the output table 24 namestu_idcourse_idscore Alice Alice Bob Charlie Charlie Charlie The joined table nameMAX(score) Alice100 Charlie88 Output table

Finding rules selecting rows containing the output table 25 namestu_idcourse_idscore Alice Alice Bob Charlie Charlie Charlie The joined table namestu_idcourse_idscore Alice Alice Charlie Charlie Charlie namestu_idcourse_idscore Alice Alice Charlie Charlie Charlie450568

Finding rules selecting rows containing the output table namestu_idcourse_idscore Alice Alice Bob Charlie Charlie Charlie The joined table No good rules! namestu_idcourse_idscore Alice Alice Charlie Charlie Charlie450568

Solution: computing additional features Key idea: –Expand the search space with additional features Enumerate all possibilities that a table can be aggregated Precompute aggregation values as features namestu_idcourse_idscore Alice Alice Bob Charlie Charlie Charlie Suppose grouping it by stu_id MAX(score)SUM (score) COUNT (course_id) additional features... The joined table

Finding rules without additional features 28 namestu_idcourse_idscore Alice Alice Bob Charlie Charlie Charlie The joined table No good rules! namestu_idcourse_idscore Alice Alice Charlie Charlie Charlie450568

Finding rules with additional features The joined table namestu_idcourse_idscoreCOUNT(course_id)MIN(score) Alice Alice Bob Charlie Charlie Charlie COUNT(course_id) > 1 (after groupping by stu_id ) SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1 after the table is grouped by stu_id … namestu_idcourse_idscore Alice Alice Charlie Charlie Charlie450568

Ranking multiple SQL queries Occam’s razor principle: rank simpler queries higher –A simpler query is less likely to overfit the examples Approximate a query’s complexity by its text length 30 nameagescore Alice20100 Bob2099 Charlie3099 name Alice Bob Query 1: select name from student where age < 30 Query 2: select name from student where name = ‘Alice’ || name = ‘Bob’ Input table: student Output table

Outline Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 31

Research Questions Success ratio in synthesizing SQL queries? What is the tool time cost? How much human effort is needed in writing examples? Comparison to existing techniques. 32

Benchmarks 23 SQL query related exercises from a classic textbook –All exercises in chapters 5.1 and forum questions –Can be answered by using standard SQL (Most questions are vendor-specific) –2 questions contain example tables 33

Evaluation Procedure Rank of the correct SQL query Tool time cost Manual cost –Example size, time cost, and the number of interaction rounds (All experiments are done by the second author) 34 Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples

Results: success ratio SQL questions Fail on 8 questions Succeed on 20 questions 5 forum questions 23 textbook exercises Succeed on 5 forum questions Succeed on 15 exercises Fail on 8 exercises Require writing sub-queries, which are not supported in SQLSynthesizer The correct query ranks 1 st in all succeeded questions

Result: tool time cost On average, 8 seconds per benchmark –Min: 1 second, max: 120 seconds –Roughly proportional to the #table and #column 36

Results: manual cost Example size –22 rows, on average (min: 8 rows, max: 52 rows) Time cost in writing examples –3.6 minutes per benchmark, on average (min: 1 minute, max: 7 minutes) Number of interaction rounds –2.3 rounds per benchmark, on average (min: 1 round, max: 5 rounds) 37

Comparison with an existing approach Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning features 38 nameagescore Alice20100 Bob2099 Charlie3080 name Alice Bob select name from student where age < 30 Table: student ageMAX(score) select age, max(score) from student group by age Output table

Query-by-Output vs. SQLSynthesizer 39 Query-by-Output SQLSynthesizer 28 SQL questions Fail on 26 questions Succeed on 2 questions Succeed on 20 questions Fail on 8 questions - Many realistic SQL queries use aggregation features. - Users are unlikely to get stuck on simple “select-from-where” queries.

Experimental Conclusions Good success ratio (71%) Low tool time cost –8 seconds on average Reasonable manual cost –3.6 minutes on average –2.3 interaction rounds Outperform an existing technique –Success ratio: QBO (7%) vs. SQLSynthesizer (71%) 40

Outline Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 41

Related Work Reverse engineering SQL queries Query-by-Examples [Zloof’75] A new GUI with a domain-specific language to write queries Query-by-Output [Tran’09] Uses data values as features, and supports a small SQL subset. View definition Synthesis [Sarma’10] Theoretical analysis, and is limited to 1 input/output table. Automated program synthesis PADS [Fisher’08], Wrangler [Kandel’11], Excel Macro [Harris’11], SQLShare [Howe’11], SnippetSuggest [Khoussainova’11], SQL Inference from Java code [Cheung’13] − Targets different problems, or requires different input. − Inapplicable to SQL synthesis 42

Outline Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 43

Contributions A programming-by-example technique –Synthesize SQL queries from input-output examples –Core idea: using machine learning to infer query conditions Experiments that demonstrate its usefulness –Accurate and efficient Inferred correct answers for 20 out of 28 SQL questions 8 seconds for each question –Outperforms an existing technique The SQLSynthesizer implementation 44

[Backup Slides] 45

The most widely-used SQL features 46 Number of votes 21 features Aggregation features The standard select.. from.. where.. feature Joining features Existential features Value matching features

Design a SQL subset 47 Special joins Wildcard matching Sub-query Covered features Uncovered features Covered 15 features