Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
2.2 Validation & Verification
Advertisements

With Microsoft Access 2010© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
University of Leeds Department of Chemistry The New MCM Website Stephen Pascoe, Louise Whitehouse and Andrew Rickard.
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Aki Hecht Seminar in Databases (236826) January 2009
Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie.
ITEC810 Project By: P. M. Mathindri Nilushika Pathiraja 1.
Scenario-Based Requirements for Web Macro Tools Christopher Scaffidi, Allen Cypher, Sebastian Elbaum, Andhy Koesnandar, Brad Myers.
A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Carnegie Mellon University.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
Accommodating Data Heterogeneity in ULS Systems Christopher Scaffidi Mary Shaw Carnegie Mellon University.
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.
Functional Testing.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 16 Slide 1 User interface design.
Troy Eversen | 19 May 2015 Data Integrity Workshop.
Spreadsheet in excel o Spreadsheet in excel o Uses of spreadsheet o Advantages Prepared by: Yusra Waseem 8 th C.
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.
CIS 375 Final Presentation Doug Code § Brad Lloyd § Michelle Zukowski.
Webpage Understanding: an Integrated Approach
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University.
Overview of Previous Lesson(s) Over View  ASP.NET Pages  Modular in nature and divided into the core sections  Page directives  Code Section  Page.
OFC304 Excel 2003 Overview: XML Support Joseph Chirilov Program Manager.
No application is an island: Using topes to transform strings during data transfer Atipol Asavametha, Prashanth Ayyavu, Christopher Scaffidi School of.
XP Class Objectives – 9/10 and 9/12 Learn how to design a small database Understand the goals of a database Understand the terminology of database design.
Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
Data entry: Validation
IB ITGS Case Study. Introduction: Serving thousands of clients, it is method of environment-friendly green ticketing. User friendly system which minimizes.
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.
DATABASE EVIDENCE Zosia Staniaszek. FIELD NAMES AND DATA TYPES I switched to design view to enter the field names and change the data types. I entered.
Fanny Widadie, S.P, M.Agr 1 Database Management Systems.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
1 Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers,
Predicting Reuse of End-User Web Macro Scripts Chris Scaffidi 1  2, Chris Bogart 2, Margaret Burnett 2, Allen Cypher 3, Brad Myers 1, Mary Shaw 1 1 Carnegie.
Session 1 Module 1: Introduction to Data Integrity
1 Year of Progress on Topes Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Validation & Verification Today will look at: The difference between accuracy and validity Explaining sources of errors and how they could be overcome.
A Data Model to Support End-User Software Engineering Christopher Scaffidi Carnegie Mellon University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Microsoft Access By Ritesh Sharma. Introduction Microsoft Access is a desktop database program that enables you to enter, store, analyze,and present data.For.
CoScripter and Topes: Putting Data into Usable Formats Christopher Scaffidi Carnegie Mellon University With Allen Cypher and Jimmy Lin IBM Almaden.
Database (Microsoft Access). Database A database is an organized collection of related data about a specific topic or purpose. Examples of databases include:
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
AP CSP: Cleaning Data & Creating Summary Tables
Applying Deep Neural Network to Enhance EMPI Searching
Security Issues Formalization
Chapter 6 - Database Implementation and Use
Microsoft Office Access 2010 Lab 2
Overview of MDM Site Hub
Detecting Table Clones and Smells in Spreadsheets
A Data Model to Help End Users Shape Effective Software
Lecture 12: Data Wrangling
IT Applications Theory Slideshows
Finite Automata.
Sr. Quality Engineering Manager,
Vancouver Public Library
Presentation transcript:

Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University

2 Even when lives are at stake, people still make typos. Hurricane Katrina “Person Locator” Web site Problem  Topes  Validation  Conclusion

3 Data errors reduce the usefulness of data. Wrong data category Problem  Topes  Validation  Conclusion Questionable input Incorrect formatting

4 The website creators omitted input validation. Primary reason: rejecting obviously-wrong inputs would prevent collecting questionable data –Eg: Would you accept a city with 1 letter? This is the UI code for the web form where users entered data for this website. A RAD tool called CodeCharge Studio was used to create the UI. Problem  Topes  Validation  Conclusion

5 This site was not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) Eg: Spreadsheets –40% of cells are non-numeric, non-date textual data –Commonly used to gather and organize textual data for reports Problem  Topes  Validation  Conclusion

6 Validation of these short human-readable strings must support… Testing membership in a data category –Categories based on standards (eg: address) –Categories lacking standards (eg: city name) Ambiguously defined categories –Identify questionable values for double-checking Multiple formats –Format consistency, post-validation Platform-independent implementation –Reuse in webapps, spreadsheets, others Problem  Topes  Validation  Conclusion

7 Limitations of existing approaches Types do not support questionable values Grammars do not, either, nor can they reformat Information extraction algorithms rely on grammatical cues that are absent during validation Cues, Forms/3,  -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent Problem  Topes  Validation  Conclusion

8 New Approach: Topes A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Problem  Topes  Validation  Conclusion

9 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Problem  Topes  Validation  Conclusion Building abbreviation & room number EDSH 225

10 A tope is a conceptual abstraction. A tope implementation is code. Each tope implementation has executable functions: –1 isa:string  [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string  string functions linking formats, for transforming values from one format to another Validation function:  (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when  (str) = 1 –Invalid when  (str) = 0 –Questionable when 0 <  (str) < 1 Problem  Topes  Validation  Conclusion

11 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: Company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Problem  Topes  Validation  Conclusion

12 Two more common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –(Very rarely) labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, changes to separators, arithmetic, and lookup tables. Problem  Topes  Validation  Conclusion

13 Formal tool demonstration on Friday Features: Format inference Format/part names Soft constraints Testing features Format reusability Problem  Topes  Validation  Conclusion

14 Formal tool demonstration on Friday Microsoft Excel: buttons and menus Visual Studio: drag-and drop code generation Problem  Topes  Validation  Conclusion

15 Evaluating accuracy, reusability, and usefulness for data cleaning Implemented topes for spreadsheet data –32 topes based on 720 online spreadsheets –Tested accuracy Reused topes on web application data –8 data categories in Google Base and 5 data categories in Hurricane Katrina site –Tested accuracy Used transformations to reformat data –5 data categories in Hurricane Katrina site –Measured increase in number of duplicates identified Problem  Topes  Validation  Conclusion

16 Extracting spreadsheet test data Cluster spreadsheet columns based on data category –EUSES spreadsheet corpus “database” section –Hierarchical agglomerative clustering –Manual inspection –Result = 1713 columns in 246 clusters (1 cluster per data category) Created 1 tope for each of 32 most common categories –Yielding 32 topes –Covered 70% of clustered columns Problem  Topes  Validation  Conclusion

17 We considered 5 validation strategies Strategy 1: Current spreadsheet practice (accept all inputs) Strategy 2: Current webapp practice (validate with regexp or fixed list, when available; accept all other inputs) –36 regexps + 35 fixed lists, in 7 categories Strategy 3A: Tope rejecting questionable (accept when  (str)=1) Strategy 3B: Tope accepting questionable (accept when  (str)>0) Strategy 4: Tope warn on questionable (simulate double-check by user when 0<  (str)<1) Problem  Topes  Validation  Conclusion

18Measurements Based on 100 random values per category Used F1 to measure accuracy –standard measure of accuracy for classifiers = (precision*recall)/avg(precision,recall) Considered topes with 1, 2, 3, 4, or 5 formats Problem  Topes  Validation  Conclusion

19 Recognizing multiple formats and questionable inputs raises accuracy Condition 4: Hypothetical user has to help on ~ 3% of inputs Condition 1: Recall = 0 (fails to identify any invalid inputs) Problem  Topes  Validation  Conclusion

20 Topes based on spreadsheet data were accurate on web application data. Problem  Topes  Validation  Conclusion Hurricane Katrina Google Base

21 Putting data in a consistent format improves duplicate identification. Randomly extracted values for each of 5 Hurricane Katrina data categories Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used Found approximately 8% more duplicates after transformation Problem  Topes  Validation  Conclusion

22 Topes improve data validation Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Contributions: –Support for ambiguous data categories –Support for transforming values –Platform-independent validation Problem  Topes  Validation  Conclusion

23 Future Work: Sharing topes Repository search mechanisms based on –Relevance to new applications –Quality criteria Integrate with more programming platforms –Microsoft Excel  –Microsoft Visual Studio.NET  –A simple XML processing API  –Univ. Nebraska’s Robofox  –IBM’s CoScripter  –Your tool or platform? Problem  Topes  Validation  Conclusion

24 Thank You… To Jeff Magee, Betty Cheng, Barbara Ryder, Margaret Burnett, and others at ICSE 2007 for early feedback To NSF for funding To ICSE 2008 for this opportunity to present Problem  Topes  Validation  Conclusion