Programming by Examples applied to Data Wrangling Invited SYNT July 2015 Sumit Gulwani.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Synthesizing Number Transformations from Input-Output Examples Rishabh Singh and Sumit Gulwani.
From Verification to Synthesis Sumit Gulwani Microsoft Research, Redmond August 2013 Marktoberdorf Summer School Lectures: Part 1.
Chapter 5: Introduction to Information Retrieval
(non-programmers with access to computers)
FlashExtract : A General Framework for Data Extraction by Examples
Learning Semantic String Transformations from Examples Rishabh Singh and Sumit Gulwani.
Data Manipulation using Programming by Examples and Natural Language Invited Upenn April 2015 Sumit Gulwani.
ISBN Chapter 3 Describing Syntax and Semantics.
James Martin CpE 691, Spring 2010 February 11, 2010.
Program Verification as Probabilistic Inference Sumit Gulwani Nebojsa Jojic Microsoft Research, Redmond.
Information Retrieval in Practice
Describing Syntax and Semantics
Tutorial 5 Creating Advanced Queries and Enhancing Table Design
Modern Information Retrieval Chapter 4 Query Languages.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 16 Slide 1 User interface design.
Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Cultivating Research Taste (illustrated via a journey in Program Synthesis research) Programming Languages Mentoring Workshop 2015 Sumit Gulwani Microsoft.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Programming by Examples Marktoberdorf Lectures August 2015 Sumit Gulwani.
Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun.
End-User Programming (using Examples & Natural Language) Sumit Gulwani Microsoft Research, Redmond August 2013 Marktoberdorf Summer.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Dimensions in Synthesis Part 3: Ambiguity (Synthesis from Examples & Keywords) Sumit Gulwani Microsoft Research, Redmond May 2012.
Querying Structured Text in an XML Database By Xuemei Luo.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Chapter 6: Information Retrieval and Web Search
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Formal Methods in Invited CBSoft Sep 2015 Sumit Gulwani Data Wrangling & Education.
FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1.
Predicting a Correct Program in PBE Rishabh Singh, Microsoft Research Sumit Gulwani, Microsoft Research.
Automating String Processing in Spreadsheets using Input-Output Examples Sumit Gulwani Microsoft Research, Redmond.
Compositional Program Synthesis from Natural Language and Examples Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft.
FlashMeta Microsoft PROSE SDK: A Framework for Inductive Program Synthesis Oleksandr Polozov University of Washington Sumit Gulwani Microsoft Research.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Programming by Examples Marktoberdorf Lectures August 2015 Sumit Gulwani.
Dagstuhl Seminar Oct 2015 Sumit Gulwani Applications of Inductive Programming in Data Wrangling.
Software Design and Development Languages and Environments Computing Science.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Deductive Techniques for synthesis from Inductive Specifications Dagstuhl Seminar Oct 2015 Sumit Gulwani.
Sumit Gulwani Spreadsheet Programming using Examples Keynote at SEMS July 2016.
Sumit Gulwani Programming by Examples Applications, Algorithms & Ambiguity Resolution Keynote at IJCAR June 2016.
Tackling Ambiguity in PBE Rishabh Singh
Information Retrieval in Practice
Outline Core Synthesis Architecture [1 hour by Sumit]
Search Engine Architecture
Programming by Examples
Usability Design Space in Programming by Examples
By Dr. Abdulrahman H. Altalhi
Programming by Examples
Programming by Examples
Programming by Examples
Lecture 12: Data Wrangling
Objective of This Course
Dtk-tools Benoit Raybaud, Research Software Manager.
Presentation transcript:

Programming by Examples applied to Data Wrangling Invited SYNT July 2015 Sumit Gulwani

1 The New Opportunity End Users (non-programmers with access to computers) Software developer 2 orders of magnitude more end users Struggle with simple repetitive tasks Need domain-specific expert systems Traditional customer for PL technology

Excel help forums

Typical help-forum interaction 300_w5_aniSh_c1_b  w5 =MID(B1,5,2) 300_w30_aniSh_c1_b  w30 =MID(B1,FIND(“_”,$B:$B)+1, FIND(“_”,REPLACE($B:$B,1,FIND(“_”,$B:$B),””))-1) =MID(B1,5,2)

Flash Fill (Excel 2013 feature) demo “Automating string processing in spreadsheets using input-output examples”; POPL 2011; Sumit Gulwani

Data locked up in silos in various formats –Great flexibility in organizing (hierarchical) data for viewing but challenging to manipulate and reason about the data. A typical data wrangling workflow might involve: –Extraction, Transformation, Querying, Formatting Data scientists spend 80% of their time wrangling data. Programming-by-examples (PBE) can provide an easier and faster data wrangling experience. 5 Data Wrangling

To get Started! Data Science Class Assignment

FlashExtract Demo 7 “FlashExtract: A Framework for data extraction by examples”; PLDI 2014; Vu Le, Sumit Gulwani

FlashExtract

10 PBE Architecture Inductive Spec Program Search Algorithm (Example based specification)

Examples: Conjunction of (input state, output value) Inductive Spec generalizes Examples in 2 ways. Generalization 1: Conjunction of (input state, output property) Motivation: Output properties are easier to specify intent. 11 Inductive Specification

12 Output properties Elements belonging to the output list Elements not belonging to the output list Contiguous subsequence of the output list Prefix of the output list Task

13 Output properties Task Prefix of the output table (seq of records) We do not require explicit (magenta) record boundaries in which case the spec is: Prefixes of projections of the output table

Examples: Conjunction of (input state, output state) Inductive Spec generalizes Examples in 2 ways. Generalization 1: Conjunction of (input state, output property) Motivation: Output properties are easier to specify intent. Generalization 2: Boolean comb of (input state, output property) Motivation: Arises internally as part of problem reduction 14 Inductive Specification

15 PBE Architecture Inductive Spec Program Search Algorithm DSL Challenge 1: Designing efficient search algorithm.

Balanced Expressiveness –Expressive enough to cover wide range of tasks –Restricted enough to enable efficient search Operators should have a small set of inverses –To enable efficient problem reduction Natural computation patterns –Increased user understanding/confidence –Enables selection between programs, editing 16 Domain-specific Language (DSL)

Consider the tasks: 1. [String s -> Substring] (arises in FlashFill) 2. [Long String s ->List of Substrings] (arises in FlashExtract) Regular expression suffices for both, but is not ideal. Difficult to synthesize Difficult to explain to the user We propose abstractions that involve simpler regexes. 17 DSL for Substring Extraction

Consider the tasks: 1. [String s -> Substring] (arises in FlashFill) 2. [Long String s ->List of Substrings] (arises in FlashExtract) DSL for Task 1, i.e., [String s -> Substring] := let p1 = [s -> index] in let p2 = [s -> index] in SubStr(s, p1, p2) DSL for [String s -> index] := Constant | Pos(s, regex1, regex2, k) // k th position in s whose left/right side matches with regex1/regex2 18 DSL for Substring Extraction | let t = Suffix(s,p1) in [t -> index]

Let w = SubStr(s, p, p’) where p = Pos(s, r 1, r 2, k) and p’ = Pos(s, r 1 ’, r 2 ’, k’) 19 The SubStr Operator s p p’ w matches r 1 matches r 2 matches r 1 ’ matches r 2 ’

Consider the tasks: 1. [String s -> Substring] (arises in FlashFill) 2. [Long String s ->List of Substrings] (arises in FlashExtract) DSL for Task 2, i.e., [String s -> List of substrings] := let L = Filter(Split(s,”\n”), [Line -> Bool]) in Map(L, [String -> Substring]) DSL for [Line t -> bool] := MatchRegex(t, regex) | MatchRegex(t.previous, regex) 20 DSL for Substring Extraction

21 PBE Architecture Inductive Spec Program Search Algorithm DSL Inverse semantics of operators for problem reduction Challenge 1: Designing efficient search algorithm.

22 Search Algorithm

DSL for [String s -> List of substrings] : let L = Filter(Split(s,”\n”), [Line -> Bool]) in Map(L, [String -> Substring] ) 23 Problem Reduction Spec for [String ->List of substrings] Spec for [Line ->Bool] Spec for [String ->Substring]

DSL for [String s -> Substring] := let p1 = [s -> index] in let p2 = [s -> index] in SubStr(s, p1, p2) 24 Problem Reduction Redmond, WA Spec for p1 [String -> Index] Spec for p2 [String -> Index] Spec for [String -> Substring]

25 PBE Architecture Inductive Spec Program Search Algorithm DSL Inverse semantics of operators for problem reduction Challenge 1: Designing efficient search algorithm. Challenge 2: Ambiguous/under-specified intent may result in unintended programs.

Synthesize multiple programs & rank them. Basic ranking scheme Define a partial order over program expressions. –Prefer shorter programs. –Prefer programs with fewer constants. Machine-learning based ranking Score using a weighted combination of program features. –Weights are learned using training data. 26 Ranking “Predicting a correct program in Programming by Example”; CAV 2015 Rishabh Singh, Sumit Gulwani

27 Comparison of Ranking Strategies over FlashFill Benchmarks StrategyAverage # of examples required Basic4.17 Learning1.48 “Predicting a correct program in Programming by Example”; CAV 2015 Rishabh Singh, Sumit Gulwani Basic Learning

FlashFill Ranking Demo 28

PBE Architecture Inductive Spec Program Search Algorithm DSL Inverse semantics of operators for problem reduction Ranking Function 28 Challenge 1: Designing efficient search algorithm. Challenge 2: Ambiguous/under-specified intent may result in unintended programs. Top-k s

The Inductive Synthesis Problem Definition: Inductive Spec x DSL x Ranking function -> Top k-Programs Solution Strategy: Divide-and-conquer based on inverse semantics PBE Architecture Inductive Spec Programs Search Algorithm DSL Inverse semantics of operators for problem reduction Ranking Function 28 “FlashMeta: A Framework for Inductive Program Synthesis” [Submitted to OOPSLA 2015]; Alex Polozov, Sumit Gulwani Top-k

Project FlashFill FlashExtractText FlashRelate FlashNormalize FlashExtractWeb 31 Comparison of FlashMeta with hand-tuned implementations OriginalFlashMeta N/A2.5 OriginalFlashMeta N/A1.5 Lines of Code (K) Development time (months) Running time of FlashMeta implementations vary between x of the corresponding original implementation. Faster because of some free optimizations Slower because of larger feature sets & a generalized framework “FlashMeta: A Framework for Inductive Program Synthesis” [Submitted to OOPSLA 2015]; Alex Polozov, Sumit Gulwani

PBE Architecture Inductive Spec Top-k Programs Search Algorithm DSL Inverse semantics of operators for problem reduction Ranking Function 28 Challenge 1: Designing efficient search algorithm. Challenge 2: Ambiguous/under-specified intent may result in unintended programs.

“It's a great concept, but it can also lead to lots of bad data. I think many users will look at a few "flash filled" cells, and just assume that it worked. … Be very careful.” 33 Need for a better User Interaction Model! “most of the extracted data will be fine. But there might be exceptions that you don't notice unless you examine the results very carefully.”

Make it easy to inspect output correctness –User can accordingly provide more examples Show programs –in any desired programming language; in English –Enable effective navigation between programs Computer initiated interactivity (Active learning) –Highlight less confident entries in the output. –Ask directed questions based on distinguishing inputs. 34 User Interaction Models for Ambiguity Resolution “User Interaction Models for Disambiguation in Programming by Example”, [Submitted to UIST 2015] Mayer, Soares, Grechkin, Le, Marron, Polozov, Singh, Zorn, Gulwani

FlashExtract Demo (User Interaction Models) 35

Extraction FlashExtract: Extract data from text files, web pages [PLDI 2014; Powershell convertFrom-string cmdlet] FlashRelate: Extract data from spreadsheets [PLDI 2015] Transformation Flash Fill: Excel feature for Syntactic String Transformations [POPL 2011] Semantic String Transformations [VLDB 2012] Number Transformations [CAV 2013] FlashNormalize: Text normalization [IJCAI 2015] Querying NLyze: an Excel programming-by-natural-lang add-in [SIGMOD 2014] Formatting Table re-formatting [PLDI 2011] FlashFormat: a Powerpoint add-in [AAAI 2014] 36 PBE tools for Data Manipulation

FlashRelate Demo 37 “FlashRelate: Extracting Relational Data from Semi-Structured Spreadsheets Using Examples”; PLDI 2015; Barowy, Gulwani, Hart, Zorn

Vu Le Collaborators Dan Barowy Ted Hart Maxim Grechkin Alex Polozov Dileep Kini Rishabh Singh Mikael Mayer Mark Marron Gustavo Soares Ben Zorn

Other application domains (E.g., robotics). Integration with existing programming environments. Multi-modal intent specification using combination of Examples and NL. 39 Other Directions

Data manipulation is challenging! –Data scientists spend 80% time cleaning data. –99% of end users are non-programmers. PBE can enable easy and fast data wrangling! Cross-disciplinary inspiration –Theory/Logical Reasoning (Search algo) –Language Design (DSL) –Machine Learning (Ranking) –HCI (User interaction models) 40 Data Manipulation using Programming-by-Examples