Dagstuhl Seminar Oct 2015 Sumit Gulwani Applications of Inductive Programming in Data Wrangling.

Slides:



Advertisements
Similar presentations
Synthesizing Number Transformations from Input-Output Examples Rishabh Singh and Sumit Gulwani.
Advertisements

From Verification to Synthesis Sumit Gulwani Microsoft Research, Redmond August 2013 Marktoberdorf Summer School Lectures: Part 1.
Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall2-1.
CS1100: Computer Science and Its Applications Building Flexible Models in Microsoft Excel.
Tutorial 8: Developing an Excel Application
(non-programmers with access to computers)
FlashExtract : A General Framework for Data Extraction by Examples
Learning Semantic String Transformations from Examples Rishabh Singh and Sumit Gulwani.
What’s New in Office Visio 2007 Microsoft Office Visio 2007 drawing and diagramming software makes it easy for IT and business professionals to.
Data Manipulation using Programming by Examples and Natural Language Invited Upenn April 2015 Sumit Gulwani.
Exploring Microsoft Excel 2002 Chapter 7 Chapter 7 List and Data Management: Converting Data to Information By Robert T. Grauer Maryann Barber Exploring.
University of Nevada, Reno College of Business Administration What are we going to learn 9/27 – 9/29? 1. Answer questions about MS Access queries. 2. Understand.
 The Rise of Computer Science ◦ Machine Language (1 st Gen) ◦ Assembly Language (2 nd Gen) ◦ Third Generation Languages (FORTRAN, BASIC, Java, C++, etc.)
Lecture Microsoft Access and Relational Database Basics.
Usable Synthesis Sumit Gulwani Microsoft Research, Redmond Usable Verification Workshop November 2010 MSR Redmond.
Copyright 2003 The McGraw-Hill Companies, Inc CHAPTER Application Software computing ESSENTIALS    
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 16 Slide 1 User interface design.
1 Value Stream Mapping Sustainable Operations Professor Mellie Pullman.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.
1 CSC 221: Introduction to Programming Fall 2012 course overview  What did you set out to learn?  What did you actually learn?  Where do you go from.
Platforms for Learning in Computer Science July 28, 2005.
Cultivating Research Taste (illustrated via a journey in Program Synthesis research) Programming Languages Mentoring Workshop 2015 Sumit Gulwani Microsoft.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Mobile search engine for a smart phone / navigation system can be used to search and compare hundreds of stores and their products in seconds. © 2001 –
Office Live Workspace Visio 2007 Outlook 2007 Groove 2007 Access 2007 Excel 2007 Word 2007.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Programming by Examples Marktoberdorf Lectures August 2015 Sumit Gulwani.
Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun.
End-User Programming (using Examples & Natural Language) Sumit Gulwani Microsoft Research, Redmond August 2013 Marktoberdorf Summer.
English Multimedia Wang, Yueh-chiu National Penghu University.
Dimensions in Synthesis Part 3: Ambiguity (Synthesis from Examples & Keywords) Sumit Gulwani Microsoft Research, Redmond May 2012.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Relational Databases (MS Access)
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Chapter 4c, Database H Definition H Structure H Parts H Types.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
ITGS Databases.
Formal Methods in Invited CBSoft Sep 2015 Sumit Gulwani Data Wrangling & Education.
Predicting a Correct Program in PBE Rishabh Singh, Microsoft Research Sumit Gulwani, Microsoft Research.
Automating String Processing in Spreadsheets using Input-Output Examples Sumit Gulwani Microsoft Research, Redmond.
Compositional Program Synthesis from Natural Language and Examples Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft.
FlashMeta Microsoft PROSE SDK: A Framework for Inductive Program Synthesis Oleksandr Polozov University of Washington Sumit Gulwani Microsoft Research.
Learning Objectives Understand the concepts of Information systems.
Programming by Examples Marktoberdorf Lectures August 2015 Sumit Gulwani.
Chapter 8-1 Chapter 8 Accounting Information Systems Information Technology Auditing Dr. Hisham madi.
Pages Appendix B: Review of ExcelChapter 3 Market Trends & Analysis IBM 320 CAL POLY POMONA IBM320 Market Trends and Analysis Maha Ghosn.
Do Now You have 10 minutes to finish your About Me essay. When you are done, print out both your new About Me Ad and your typed essay.
Programming by Examples applied to Data Wrangling Invited SYNT July 2015 Sumit Gulwani.
Deductive Techniques for synthesis from Inductive Specifications Dagstuhl Seminar Oct 2015 Sumit Gulwani.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Sumit Gulwani Spreadsheet Programming using Examples Keynote at SEMS July 2016.
Sumit Gulwani Programming by Examples Applications, Algorithms & Ambiguity Resolution Keynote at IJCAR June 2016.
Tackling Ambiguity in PBE Rishabh Singh
Outline Core Synthesis Architecture [1 hour by Sumit]
Programming by Examples
CSC 221: Computer Programming I Spring 2010
CSC 221: Computer Programming I Fall 2005
Usability Design Space in Programming by Examples
Programming by Examples
Programming by Examples
Programming by Examples
Lecture 12: Data Wrangling
Agenda About Excel/Calc Spreadsheets Key Features
Introduction to Databases & SQL
Data Wrangling as the key to success with Data Lake
Presentation transcript:

Dagstuhl Seminar Oct 2015 Sumit Gulwani Applications of Inductive Programming in Data Wrangling

Vu Le Collaborators Dan Barowy Ted Hart Alex Polozov Dileep Kini Rishabh Singh Mikael Mayer Gustavo Soares Ben Zorn Bill Harris

2 Reference “Programming by Examples (and its applications in Data Wrangling)”, Gulwani; 2016; In Verification and Synthesis of Correct and Secure Systems; IOS Press [based on Marktoberdorf Summer School 2015 Lecture Notes]

The New Opportunity End Users (non-programmers with access to computers) Software developer Two orders of magnitude more End Users Struggle with simple repetitive tasks Inductive Programming can play a significant role! (in conjunction with ML, HCI) Traditional customer for PL technology

Spreadsheet help forums

Typical help-forum interaction 300_w5_aniSh_c1_b  w5 =MID(B1,5,2) 300_w30_aniSh_c1_b  w30 =MID(B1,FIND(“_”,$B:$B)+1, FIND(“_”,REPLACE($B:$B,1,FIND(“_”,$B:$B),””))-1) =MID(B1,5,2)

Flash Fill (Excel 2013 feature) demo “Automating string processing in spreadsheets using input-output examples”; POPL 2011; Sumit Gulwani

InputOutput (Nearest lower half hour) 0d 5h 26m5:00 0d 4h 57m4:30 0d 4h 27m4:00 0d 3h 57m3:30 7 Number Transformations Synthesizing Number Transformations from Input-Output Examples; CAV 2012; Singh, Gulwani InputOutput (Round to 2 decimal places) Excel/C#: Python/C: Java: #.00.2f #.##

8 Semantic String Transformations Input v 1 Input v 2 Output (Price + Markup*Price) Stroller10/12/2010$ * Bib23/12/2010$ *3.56 Diapers21/1/2011 Wipes2/4/2009 Aspirator23/2/2010 IdNameMarkup S33Stroller30% B56Bib45% D32Diapers35% W98Wipes40% A46Aspirator30% IdDatePrice S3312/2010$ S3311/2010$ B5612/2010$3.56 D321/2011$21.45 W984/2009$5.12 CostRec Table MarkupRec Table Learning Semantic String Transformations from Examples; VLDB 2012; Singh, Gulwani

9 Data is the new Oil Sources: Digital revolution Cloud computing, IoT Social media Raw data needs to be extracted and refined to enable monetization! New currency of the digital world that enables business decisions, advertising, recommendations.

Raw data locked up in many formats. Flexible organization for viewing but challenging to manipulate. 10 Data Wrangling Extraction Transformation Formatting Inductive Programming can enable easier & faster wrangling! Data scientists spend 80% of their time wrangling data. Processed data for drawing insights and driving decisions.

To get Started! Data Science Class Assignment

FlashExtract Demo Recently shipped inside two Microsoft products –Powershell convertFrom-string cmdlet –Azure OMS custom field extractor 12 “FlashExtract: A Framework for data extraction by examples”; PLDI 2014; Vu Le, Sumit Gulwani

FlashExtract

Trifacta: small, guided steps Start with: End goal: Trifacta provides a series of small transformations: From: Skills of the Agile Data Wrangler (tutorial by Hellerstein and Heer) 1. Split on “:” Delimiter 2. Delete Empty Rows 3. Fill Values Down 4. Pivot Number on Type FlashRelate Table Re-formatting

FlashRelate Demo 16 “FlashRelate: Extracting Relational Data from Semi-Structured Spreadsheets Using Examples”; PLDI 2015; Barowy, Gulwani, Hart, Zorn

PROJCATSPONSORDEPTELTSDATE SPECOOHInfinitiDesignElt 111/10 SPECOOHInfinitiDesingElt 2 SPECPrintDesignElt 311/30 SPECPrintDesignElt 411/30 SPECPrintInfinitiDesignElt 511/30 17 Table Layout Transformations SPEC OOH InfinitiDesignElt 111/10 InfinitiDesingElt 2 Print DesignElt 311/30 DesignElt 411/30 InfinitiDesignElt 511/30 “Spreadsheet Table Transformations from Examples”, PLDI 2011; Harris, Gulwani

18 Table Layout Transformations AliceArt&DesB CreateArtA EnglishA Geo.A* BobArt&DesC CreateArtB EnglishC Geo.C Art&DesCreateArtEnglishGeo. AliceBAAA* BobCBCC “Spreadsheet Table Transformations from Examples”, PLDI 2011; Harris, Gulwani

Extraction FlashExtract: Extract data from text files, web pages [PLDI 2014] –Powershell convertFrom-string cmdlet –Azure OMS custom field extractor Transformation Flash Fill: Excel 2013 feature for Syntactic String Transforms [POPL 2011, CAV 2015] Semantic String Transformations [VLDB 2012] Number Transformations [CAV 2013] FlashNormalize: Text normalization [IJCAI 2015] Formatting FlashRelate: Extract data from spreadsheets [PLDI 2015] Table layout transformations [PLDI 2011] FlashFormat: a Powerpoint add-in [AAAI 2014] 19 PBE tools for Data Manipulation

Data wrangling is a killer application for inductive programming today! Ambiguity resolution needs to be an integral part of inductive programming techniques. 20 Two key messages

21 Inductive Programming Example-based specification Program Search Algorithm Ambiguous/under-specified intent may result in unintended programs!

Ranking –Synthesize multiple programs and rank them. 22 Dealing with Ambiguity

Prefer programs with simpler Kolmogorov complexity Prefer fewer constants. Prefer smaller constants. 23 Basic ranking scheme InputOutput Rishabh SinghRishabh Ben ZornBen 1 st Word If (input = “Rishabh Singh”) then “Rishabh” else “Ben” “Rishabh”

Prefer programs with simpler Kolmogorov complexity Prefer fewer constants. Prefer smaller constants. 24 Challenges with Basic ranking scheme InputOutput Rishabh SinghSingh, Rishabh Ben ZornZorn, Ben 2 nd Word + “, ‘’ + 1 st Word “Singh, Rishabh” How to select between Fewer larger constants vs. More smaller constants? Idea: Associate numeric weights with constants.

Prefer programs with simpler Kolmogorov complexity Prefer fewer constants. Prefer smaller constants. 25 Challenges with Basic ranking scheme How to select between Same number of same-sized constants? Idea: Examine data features (in addition to program features) InputOutput Missing page numbers, , st Number from the beginning 1 st Number from the end

26 Machine learning based ranking scheme Rank score of a program: Weighted combination of various features. Weights are learned using machine learning. Program features Number of constants Size of constants Features over user data: Similarity of generated output (or even intermediate values) over various user inputs IsYear, Numeric Deviation, Number of characters IsPersonName “Predicting a correct program in Programming by Example”; [CAV 2015] Rishabh Singh, Sumit Gulwani

“It's a great concept, but it can also lead to lots of bad data. I think many users will look at a few "flash filled" cells, and just assume that it worked. … Be very careful.” 27 Need for a fall-back mechanism “most of the extracted data will be fine. But there might be exceptions that you don't notice unless you examine the results very carefully.”

Ranking –Synthesize multiple programs and rank them. User Interaction Models –Communicate actionable information to the user. 28 Dealing with Ambiguity

Make it easy to inspect output correctness –User can accordingly provide more examples Show programs –in any desired programming language; in English –Enable effective navigation between programs Computer initiated interactivity (Active learning) –Highlight less confident entries in the output. –Ask directed questions based on distinguishing inputs. 29 User Interaction Models for Ambiguity Resolution “User Interaction Models for Disambiguation in Programming by Example”, [UIST 2015] Mayer, Soares, Grechkin, Le, Marron, Polozov, Singh, Zorn, Gulwani

FlashExtract Demo (User Interaction Models) 30

Data wrangling is a killer application for inductive programming today! –99% of end users are non-programmers. –Data scientists spend 80% time in cleaning data. Ambiguity resolution needs to be an integral part of inductive programming techniques. –Cross-disciplinary inspiration from ML and HCI 31 Two key messages “Programming by Examples (and its applications in Data Wrangling)”, 2016; Gulwani [based on Marktoberdorf Summer School 2015 Lecture Notes]