Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Advertisements

Objectives Understand the software development lifecycle Perform calculations Use decision structures Perform data validation Use logical operators Use.
Chapter 1. The Phases of Software Development. Data Structure 2 Chapter outline  Objectives  Use Javadoc to write a method’s complete specification.
CHAPTER 1: AN OVERVIEW OF COMPUTERS AND LOGIC. Objectives 2  Understand computer components and operations  Describe the steps involved in the programming.
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
Chapter 10.
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Working with JavaScript. 2 Objectives Introducing JavaScript Inserting JavaScript into a Web Page File Writing Output to the Web Page Working with Variables.
© The McGraw-Hill Companies, 2006 Chapter 9 Software quality.
1 Computers and Representations Ascii vs. Binary Files Over the last few million years, Earth has experienced numerous ice ages when vast regions of the.
1 Chapter 4 The Fundamentals of VBA, Macros, and Command Bars.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
A452 – Programming project – Mark Scheme
Chapter 1 Program Design
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
THE VU AGENDA EXCELLENT, ENGAGED AND ACCESSIBLE Victoria University Alesco Custom Business Rules.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Chapter 9 Introduction to ActionScript 3.0. Chapter 9 Lessons 1.Understand ActionScript Work with instances of movie clip symbols 3.Use code snippets.
Fundamentals of Python: From First Programs Through Data Structures
Teaching and Learning with Technology  Allyn and Bacon 2002 Administrative Software Chapter 5 Teaching and Learning with Technology.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
1 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and Custom Forms.
Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.
Fundamentals of Python: First Programs
 By the end of this, you should be able to state the difference between DATE and INFORMAITON.
Introduction to Python
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
1 Computers and Representations Ascii vs. Binary Files Over the last few million years, Earth has experienced numerous ice ages when vast regions of the.
Introduction to Java Applications Part II. In this chapter you will learn:  Different data types( Primitive data types).  How to declare variables?
CSC-682 Cryptography & Computer Security Sound and Precise Analysis of Web Applications for Injection Vulnerabilities Pompi Rotaru Based on an article.
Creating your first C++ program
XP Tutorial 10New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with JavaScript Creating a Programmable Web Page for North Pole.
Input, Output, and Processing
Teaching and Learning with Technology to edit Master title style  Allyn and Bacon 2002 Teaching and Learning with Technology lick to edit Master title.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Introduction to Programming with RAPTOR
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Data TypestMyn1 Data Types The type of a variable is not set by the programmer; rather, it is decided at runtime by PHP depending on the context in which.
1 CSE 2337 Introduction to Data Management Access Book – Ch 1.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
A Balanced Introduction to Computer Science, 3/E David Reed, Creighton University ©2011 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
CSC 1010 Programming for All Lecture 3 Useful Python Elements for Designing Programs Some material based on material from Marty Stepp, Instructor, University.
Karen Cannell APEX: Tight Tabular Forms Karen Cannell
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Introduction to Computer Programming using Fortran 77.
Program Design. Simple Program Design, Fourth Edition Chapter 1 2 Objectives In this chapter you will be able to: Describe the steps in the program development.
Python’s Standard Library Part II Dennis Tran. Output Formatting The repr module provides a version of repr() customized for abbreviated displays of large.
An Introduction to Programming with C++ Sixth Edition Chapter 5 The Selection Structure.
1 JavaScript and Dynamic Web Pages Lecture 7. 2 Static vs. Dynamic Pages  A Web page uses HTML tags to identify page content and formatting information.
XP Tutorial 10New Perspectives on HTML, XHTML, and DHTML, Comprehensive 1 Working with JavaScript Creating a Programmable Web Page for North Pole Novelties.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Validation Controls Assist your users with providing the correct type of input for your application Assist your users with providing the correct type of.
Topics Designing a Program Input, Processing, and Output
A Data Model to Help End Users Shape Effective Software
Creating your first C program
Spreadsheets, Modelling & Databases
Topics Designing a Program Input, Processing, and Output
Topics Designing a Program Input, Processing, and Output
Chapter 17 JavaScript Arrays
Presentation transcript:

Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2 Target audience In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces. Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). For professional programmers, programs are a deliverable. For EUPs, programs are a means to an end. motivation ● overview ● algorithm ● evaluation ● conclusion

3 Current practice: Storing data as strings Typical tools of EUPs: –Excel“text” cells –Access/MSSQL“varchar” fields –FrontPage/Dreamweaver“textfield” inputs Validation involves… –Learning an exotic new notation (VBScript, regexps, etc) –Writing cumbersome expressions in that notation Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations. motivation ● overview ● algorithm ● evaluation ● conclusion

4 Our Topes system to date… Formats are presented in human-readable notation in our format editor –Format = sequence of parts with constraints on parts –Constraints can be “often” true (rather than “always”) The format is automatically converted to a context-free grammar, with constraints attached to productions. At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value. motivation ● overview ● algorithm ● evaluation ● conclusion

5 Needed: Format inference Problem: To date, the system offers limited support for helping users to get started. Although users do not need to learn specialized notation, there is still the cognitive work of… –examining data –breaking it into parts –representing parts in the format Solution: Infer a boilerplate format from examples motivation ● overview ● algorithm ● evaluation ● conclusion

6 Talk Outline Motivation / Problem Solution –Overview of Topei –Inference algorithm Evaluation Conclusion motivation ● overview ● algorithm ● evaluation ● conclusion

7 Prototype Task flow diagram Algorithm infers a format from cell values User reviews and customizes format User creates a format from scratch User loads an existing format from a file Plug-in flags cells that don’t match format User highlights spreadsheet cells [1][6] or motivation ● overview ● algorithm ● evaluation ● conclusion

8 Sample task: validating a spreadsheet with the prototype we have built The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in. motivation ● overview ● algorithm ● evaluation ● conclusion

9 Sample task: validating a spreadsheet Customizing an inferred format Inferred format is presented in editor with sentence-like prompts to improve human-readability User can specify meaningful names for parts motivation ● overview ● algorithm ● evaluation ● conclusion

10 Sample task: validating a spreadsheet Customizing constraints in our prototype User can add/edit constraints motivation ● overview ● algorithm ● evaluation ● conclusion

11 Sample task: validating a spreadsheet Flagging potential errors A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message motivation ● overview ● algorithm ● evaluation ● conclusion

12 Our algorithm has 2 phases Input: An array of strings Phase 1: Identify format parts Phase 2: Identify constraints on each part of each format Output: An array of formats –Sorted according to how many examples they match motivation ● overview ● algorithm ● evaluation ● conclusion

13 Phase 1: Identify format parts For each string, replace each character with its class, then collapse runs, generating a string “signature” –Supported character classes: Auppercase letter alowercase letter 0digit Example: motivation ● overview ● algorithm ● evaluation ● conclusion

14 Phase 1: Identify format parts Pack strings with identical signatures (often leads to significant performance improvement) Example: > motivation ● overview ● algorithm ● evaluation ● conclusion

15 Phase 1: Identify format parts Align signatures based on separators a. a. A. a motivation ● overview ● algorithm ● evaluation ● conclusion

16 Phase 1: Identify format parts Abstract to least general composite character class, yielding the parts of each format. Example (3 formats a. a. A. a a0A aA a - a. a a. a. a motivation ● overview ● algorithm ● evaluation ● conclusion

17 Phase 2: Identify constraints on each part Constrain each part’s contents to the character classes Require indicated separators before/after parts Infer an additional content constraint that is “often” true: –Must be in a set of 3 or fewer literals? –Must be in a numeric range? –Must start with or end with certain characters? A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature. Afterward, the user can review/customize format. motivation ● overview ● algorithm ● evaluation ● conclusion

18 Evaluation as an outlier finder Outlier finding: –Infer a format from example values –Use the inferred format to check the examples  Reveals “outliers” that might contain typos or other errors Comparison algorithm: Lapis Lapis is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring is Number equal to /1[012]|0?[1-9]/ ignoring is Number equal to /\d\d/ ignoring nothing Date is ignoring either Spaces or Punctuation motivation ● overview ● algorithm ● evaluation ● conclusion

19 Evaluation data Drawn from EUSES spreadsheet corpus 6288 US phone numbers in 37 columns –First cell in column contains “phone” –And at least 20 cells have exactly 10 digits –And at least 2/3 of cells have exactly 10 digits 1124 country names in 7 columns –First cell in column contains “country” –And there are at least 20 cells –And at least one cell contains “Portugal” motivation ● overview ● algorithm ● evaluation ● conclusion

20 Run each algorithm (Topei & Lapis) and compare their output to hand-labeling For determining “true outliers” in calculating accuracy: –Outlier phone numbers have an area code that is not in service, or if they contain errant separators such as spaces not shared by most cells in the column. –Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English- speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy] motivation ● overview ● algorithm ● evaluation ● conclusion

21 Results: Topei’s precision/recall exceed Lapis’s Standard machine learning measures for outlier finding –Precision = # outliers found / # outliers claimed –Recall = # outliers found / # true outliers TaskAlgorithmPrecision (%)Recall (%) Country Topei Country Lapis Phone Topei Phone Lapis motivation ● overview ● algorithm ● evaluation ● conclusion

22 Limitations & future work Topei still makes mistakes: –Doesn’t infer constraints aggressively enough –Doesn’t recognize non-ASCII chars in character classes –Doesn’t handle formats with repeating parts Need deeper integration with EUPs’ tools Computational complexity: –Is intended to be O(# examples), seems to be true –More careful verification needed Usability has not yet been evaluated in user study motivation ● overview ● algorithm ● evaluation ● conclusion

23 Thank You… …to you for your interest and attention …to INSTICC for the opportunity to present …to NSF and EUSES for funding (ITR and CCF ) motivation ● overview ● algorithm ● evaluation ● conclusion

24 Another example: Carnegie Mellon University phone #: motivation ● overview ● algorithm ● evaluation ● conclusion

25Integration Formats can be inferred from… –Spreadsheet cells –Database queries (e.g.: Access/MSSQL) –Arbitrary collection of text strings (via C#) Formats can then be used without modification in other venues, as well. –E.g.: infer a format from spreadsheet cells, then use it to create a trigger for a database table motivation ● overview ● algorithm ● evaluation ● conclusion

26 Related Work Many algorithms train a recognizer to notice features –See (Mitchell, 1997) for a summary –Such algorithms do not infer a human-editable format. Others generate formats in specialized notation. –(Miller, 2001) (Blackwell, 2001) (Lerman, 2000), (Lieberman, 2001) (Nardi, 1998) –Regular expressions and CFGs have limited readability. Several tools recognize or manipulate some of the same kinds of data as Topei. –(Hong, 2006) (Pandit, 1997) (Stylos 2004) –Custom formats are unsupported (only hardcoded formats) motivation ● overview ● algorithm ● evaluation ● conclusion