Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Advertisements

SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
Using the Self Service BMC Helpdesk
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
CHAPTER 30 THE HTML 5 FORMS PROCESSING. LEARNING OBJECTIVES What the three form elements are How to use the HTML 5 tag to specify a list of words’ form.
Calendar Browser is a groupware used for booking all kinds of resources within an organization. Calendar Browser is installed on a file server and in a.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Aki Hecht Seminar in Databases (236826) January 2009
Estimating the Numbers of End Users and End User Programmers Christopher Scaffidi Brad Myers Mary Shaw Carnegie Mellon University EUSES Consortium VL/HCC.
Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Carnegie Mellon University.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
Accommodating Data Heterogeneity in ULS Systems Christopher Scaffidi Mary Shaw Carnegie Mellon University.
Introduction to a Programming Environment
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.
Proposal 13 HUMAN CENTRIC COMPUTING (COMP106) ASSIGNMENT 2.
CrackingSiebel.com Utility Siebel Repository Extract (SRE) Tool.
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Exploring Formulas.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
Introduction to Programming Lecture Number:. What is Programming Programming is to instruct the computer on what it has to do in a language that the computer.
My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University.
No application is an island: Using topes to transform strings during data transfer Atipol Asavametha, Prashanth Ayyavu, Christopher Scaffidi School of.
What is Validation Understanding Validation (Different from Verification)
Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Program Development Life Cycle (PDLC)
AUTOMATION OF WEB-FORM CREATION - KINNERA ANGADI – MS FINAL DEFENSE GUIDANCE BY – DR. DANIEL ANDRESEN.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
Examining data using Microsoft Access Queries Using Criteria and Calculations SESSION 3.2 This section covers specifying an exact match condition in a.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
 Whether using paper forms or forms on the web, forms are used for gathering information. User enter information into designated areas, or fields. Forms.
Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.
Technical Paper Review Designing Usable Web Forms – Empirical Evaluation of Web Form Improvement Guidelines By Amit Kumar.
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Chapter 4: Working with ASP.NET Server Controls OUTLINE  What ASP.NET Server Controls are  How the ASP.NET run time processes the server controls on.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
Genesys Shell development Input-side development progress.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
This presentation demonstrates the transition from the traditional menu structure to a more GUI look. Our objectives were to allow for quick access to.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
1 Year of Progress on Topes Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
CoScripter and Topes: Putting Data into Usable Formats Christopher Scaffidi Carnegie Mellon University With Allen Cypher and Jimmy Lin IBM Almaden.
Step 1 Lead Notifications Dear Partner, New leads have been assigned to your organization based on customer preference and are available for you.
Access to Electronic Journals and Articles in ARL Libraries By Dana M. Caudle Cecilia M. Schmitz.
Designing classes How to write classes in a way that they are easily understandable, maintainable and reusable 6.0.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
AP CSP: Cleaning Data & Creating Summary Tables
Other Kinds of Arrays Chapter 11
A Data Model to Help End Users Shape Effective Software
Setting up an online account
To the ETS – Accounts Setup and Preferences Online Training Course
The ultimate in data organization
Presentation transcript:

Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2 Hurricane Katrina “Person Locator” site: Many inputs unvalidated... and error-ful Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

3 Data errors reduce the usefulness of data. Even little typos impede data de-duplication. Age is not useful for flying my helicopter to come rescue you. Nor is a “city name” with 1 letter. Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

4 Hurricane Katrina sites are not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) Eg: Spreadsheets –40% of cells are non-numeric, non-date textual data –Often used to gather/organize textual data for reports Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

5Outline 1.Challenges of data validation 2.Topes Model for describing data Tools for creating/using topes 3.Evaluations 4.Conclusion Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

6 Digging into the details: real user inputs that need validation. Sources: –Interviews of Hurricane Katrina website creators –Survey of Information Week readers –Contextual inquiry of information workers who created and used websites –Logs of what admin assistants typed into browsers –Exploration of the EUSES spreadsheet corpus Validating user inputs has 3 primary challenges… Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

7 1. Inputs don’t always conform well to the simple “binary” validation model. Data is sometimes questionable… yet valid. –Eg: a suspiciously long address –In practice, person names and other proper nouns are never validated with regexps… too brittle. –Life is full of corner cases and exceptions. If code can identify questionable data, then it can double-check the data: –Ask an application end user to confirm the input –Flag the input for checking by a system administrator –Compare the value to a list of known exceptions –Call up a server and see if it can confirm the value Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

8 2. User inputs often can occur in multiple different formats. Two different strings can be equivalent. –How many ways can you write a date? –What if an end user types a date in the wrong format? –“Jan ” and “1/1/2007” mean the same thing because of the category that they are in: date. –Sometimes the interpretation is ambiguous. In real life, preferences and experience guide interpretation. If code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format as needed. –Display result so users can check/fix interpretation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

9 3. The meaning of data is often tied to its “parts”, not directly to its characters. Data often has parts, each with a meaning. –What are the parts of a date, 12/31/2008? –Valid data obeys intra- and inter-part constraints. –Constraints are usually platform-independent –Writing regexps requires you to translate constraints into a character sequence… tough in many cases, practically or truly impossible in others. If code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain? –Especially if it was platform-independent! Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

10 Limitations of existing approaches Types do not support questionable values Grammars do not, either, nor can they reformat Information extraction algorithms rely on grammatical cues that are absent during validation Cues, Forms/3,  -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

11 Imagine a world where… Code can ask an oracle, “Is this a company name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray. Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed. People teach the oracle about a new data category by concisely stating its parts and constraints. Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

12 New Approach: Topes A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Validating with topes improves –Accuracy of validation –Reusability of validation code –Consistency of data formatting Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

13 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

14 A tope is a conceptual abstraction. A tope implementation is code. Each tope implementation has executable functions: –1 isa:string  [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string  string functions linking formats, for transforming values from one format to another Validation function:  (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when  (str) = 1 –Invalid when  (str) = 0 –Questionable when 0 <  (str) < 1 Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

15 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: Company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

16 Two other common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –Sometimes labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, lookup tables, and changes to separators & capitalization. Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

17 Tope Development Environment (TDE) Topei Module Infers tope from examples Toped Module Enables EUPs to create/edit topes Topeg Module Generates context-free grammars and transformations Topep Module Parses data against grammars, performs transformations Plug-ins Read/write program data Robofox Web macros Vegemite/CoScripter Web macros Microsoft Excel Spreadsheets Visual Studio.NET Web applications … Introduction  Challenges  Topes  Tools  Evaluation  Conclusion Repository Stores topes for sharing/reuse

18 Toped User Interface Features Format inference Format/part names Soft constraints Value whitelists Testing features Format reusability Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

19 Integration with programming platforms Microsoft Excel: buttons and menus Visual Studio: drag-and drop code generation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

20 Integration with programming platforms Introduction  Challenges  Topes  Tools  Evaluation  Conclusion Recommends tope for the data at hand Convenient access to reformatting

21 Other integrations to date: CoScripter, Robofox, XML/HTML library Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

22 Evaluating accuracy Implemented topes for spreadsheet data –Grouped 1712 columns of spreadsheet data (from the EUSES spreadsheet corpus) into data categories –Created 32 topes for the most common 32 data categories (~ 70% of the data) –Compared validation with topes to validation with regexps or enumerations from the web –Tope-based validation was over 3 times as accurate (for 5 formats or regexps per data category) Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

23 Evaluating reusability Reused spreadsheet-based topes on webform data –Downloaded data for 8 data categories on Google Base and 5 in Hurricane Katrina website –Reused spreadsheet-based topes on the web data –Validation was just as accurate (and sometimes even better, as the webform data was from just two sources and therefore less diverse than the spreadsheet data) Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

24 Evaluating support for data cleaning Used topes to put web data into consistent formats –Again with the 5 columns in Hurricane Katrina website –Used transformation functions to put each string into the most common format for that data category –Increased number of duplicate strings found by 10% Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

25 Evaluating usability for data validation End users validating data with single-format topes –Between-subjects lab study (early version of Toped) –8 users validated spreadsheet data with Toped; for comparison, 8 users validated with Lapis patterns –Toped users found twice as many of the typos compared to Lapis users –Topes were 50% more accurate than Lapis patterns –Toped gave significantly higher user satisfaction –(Comparison to an earlier regular expression study that had similar but not identical tasks: Toped users were faster and more accurate, but not a statistically significant difference) Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

26 Evaluating usability for data reformatting End users reformatting data with multi-format topes –Within-subjects lab study (latest version of Toped) –9 users reformatted spreadsheet data by creating & using topes; for comparison, they then did it manually –Effort of creating a tope “pays off” at only 47 strings (further reuse is essentially “free”) –Every participant strongly preferred using Toped instead of doing tasks manually Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

27 Evaluating tope recommendations Quickly recommend existing tope for data at hand –Supports keyword-based search + search-by-match (eg: topes that match “ ”) –Evaluated by searching through topes for the 32 most common data categories in EUSES spreadsheet corpus, using strings from corpus –High accuracy: Recall over 80% (result set size = 5) –Adequate speed: User is likely to have a few dozen topes on computer, taking under 1 sec to search Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

28 Conclusion: Topes improve data validation Validating with topes improves –Accuracy of validation –Consistency of data formatting –Reusability of validation code Primary contributions: –Support for ambiguous data categories –Support for reformatting values –Platform-independent, reusable validation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

29 Future work: quality control Quality control (of topes) within topes repository –Indicators of tope reusability Eg: meaningful names given to parts in formats? Eg: plenty of test strings that match the tope? –Extension of work on identifying reusable web macros Quality control (by topes) of data exchange –Two modules (components/web services/…) may use the same kind of data, but require different formats. –Topes can automatically reformat strings on demand. –One step toward a larger goal… helping end users to create, share, and combine their code – ask for details! Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

30 Thank You… For this opportunity to present To NSF for funding Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

31 Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.com Split inputs into many easy-to-validate fields. Who cares if the user has to type tabs now, or if he can’t just copy-paste into one field? Make users pick from drop-downs. Who cares if it’s faster for users to type “NJ” or “1/2007”? (Disclaimer: drop-downs sometimes are good!) I implemented this site in Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

32 Even with these tricks, writing validation is still very time-consuming. Overall, the site had over 1100 lines of JavaScript just for validation…. Plus equivalent server-side Java code (too bad code isn’t platform-independent) if (!rfcCheck (frm.primary .value)) return messageHelper(frm.primary , "Please enter a valid Primary address."); var atloc = if (atloc > 31 || atloc < frm.primary .value.length-33) return messageHelper(frm.primary , "Sorry. You may only enter 32 characters or less for your name\r\n”+ ”and 32 characters or less for your domain Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

33 That was worst case. Best case: reusable regexps. Many IDEs allow the programmer to enter one regular expression for validating each input field. –Usually, this drastically reduces the amount of code, since most validation ain’t fancy. –So why don’t programmers validate most inputs? Introduction  Challenges  Topes  Tools  Evaluation  Conclusion