A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.

Slides:



Advertisements
Similar presentations
Debugging in End- User Software Engineering summarized by Andrew Ko Toward Sharing Reasoning to Improve Fault Localization in Spreadsheets Joey Lawrance,
Advertisements

Personalized Presentation in Web-Based Information Systems Institute of Informatics and Software Engineering Faculty of Informatics and Information Technologies.
Unit Testing in the OO Context(Chapter 19-Roger P)
C6 Databases.
Tutorial 8: Developing an Excel Application
SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION
CS 355 – Programming Languages
Exercise lecture : Exercise 2 and 3 Rune / Yun. Overview Intro to exercise 3 Aspects from exercise 2.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Carving up the Space of End User Programming EUSES, Lincoln, NE, Oct ‘05.
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.
Introduction to the EUSES Web Macro Scenario Corpus Allen Cypher, Sebastian Elbaum, Andhy Koesnandar, Brad Myers, Christopher Scaffidi.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Scenario-Based Requirements for Web Macro Tools Christopher Scaffidi, Allen Cypher, Sebastian Elbaum, Andhy Koesnandar, Brad Myers.
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Accommodating Data Heterogeneity in ULS Systems Christopher Scaffidi Mary Shaw Carnegie Mellon University.
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
Introduction to Structured Query Language (SQL)
Final Year Project LYU0301 Location-Based Services Using GSM Cell Information over Symbian OS Mok Ming Fai CEG Lee Kwok Chau CEG.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Michael F. Price College of Business Chapter 6: Logical database design and the relational model.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
XP New Perspectives on Microsoft Office Access 2003 Tutorial 12 1 Microsoft Office Access 2003 Tutorial 12 – Managing and Securing a Database.
Software Engineering 2003 Jyrki Nummenmaa 1 CASE Tools CASE = Computer-Aided Software Engineering A set of tools to (optimally) assist in each.
Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
11 MANAGING AND DISTRIBUTING SOFTWARE BY USING GROUP POLICY Chapter 5.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Verification and Validation in the Context of Domain-Specific Modelling Janne Merilinna.
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
ADTs and C++ Classes Classes and Members Constructors The header file and the implementation file Classes and Parameters Operator Overloading.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Relational Database vs. Data Files By Willa Zhu JISAO/UW - PMEL/NOAA March 25, 2005.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
Modul 4 Struktur Informasi Mata Kuliah Preservasi Informasi Digital.
CSE 413 Languages & Implementation Hal Perkins Autumn 2012 Structs, Implementing Languages (credits: Dan Grossman, CSE 341) 1.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Unit 17: SDLC. Systems Development Life Cycle Five Major Phases Plus Documentation throughout Plus Evaluation…
1 Year of Progress on Topes Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
CoScripter and Topes: Putting Data into Usable Formats Christopher Scaffidi Carnegie Mellon University With Allen Cypher and Jimmy Lin IBM Almaden.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
IST 220 – Intro to Databases
Prepared by : Moshira M. Ali CS490 Coordinator Arab Open University
Chapter 1: Introduction
Business Process Measures
A Data Model to Help End Users Shape Effective Software
Chapter 15 QUERY EXECUTION.
Unit# 8: Introduction to Computer Programming
Sirena Hardy HRMS Trainer
Microsoft Office Access 2003
OOPSLA Workshop on Domain-Specific Modeling Tools Workgroup
Chapter 1: Introduction
Chapter 1: Introduction
Presentation transcript:

A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

2 Motivation Consider automating repetitive actions in a web browser. Our recent contextual inquiry revealed that administrative assistants fill out many expense reports. Given a location and date, they used a government site to find the per diem rate.

3 Motivation Web macros cannot automate this task. Existing macro tools cannot convert from two-letter state abbreviation to full state name.

4 Motivation Web macros cannot automate this task. Nor can they convert dates from MM/DD/YYYY to Month DD.

5 Motivation Examples: –Dates –Credit card numbers –Person names –Quantities of RAM –Product codes Such data are –“bigger” than floats and strings. –“smaller” than a database row. –typically domain-specific. –full of exceptional cases. The world is full of “user-level” data. – State names – US phone numbers – Bus route numbers – Dewey decimal numbers – Etc…

6 Problem Tools do not “understand” user-level data such as states and dates. Limited support for data manipulation –Reformatting data in web macros or spreadsheets –Transporting (transforming) data between applications Limited support for data validation –Are any values mistyped? –Does the dataset contain duplicates? Information Week respondents complained more about data manipulation & interoperability problems than about software reliability problems!

7 Problem To be useful, representations of these data must meet 3 requirements. Extensibility Different people use different data.  Let users represent the data that they care about. Shareability Different people sometimes use the same kinds of data.  Help end users find & evaluate representations of data. Flexibility Data appear in many formats, with exceptions to every rule.  Support multiple formats, and permit exceptions.

8 Existing approaches Existing approaches do not meet the requirements. Regexps / grammars / data detectors represent syntax, not semantics (e.g.: how to represent “FL” = “Florida”?) Research on units typically only apply to numeric data in certain applications (e.g.: spreadsheets). Knowledge systems (e.g.: ConceptNet) do not contain representations of data formats. OO and formal types are too difficult for many end users and typically disallow exceptions to type rules. Federated database systems deal with heterogeneous joins but require the attention of a professional DBA. And each lacks built-in support for helping users decide whose code to trust.

9 Proposed model A “tope” defines the basic semantics of a single user-level data abstraction. A “tope” is a pair of functions defined by a user: isa: string   [0,1] returns a context-independent estimate of the likelihood that the string is an instance of this tope eq: string x string  [0, 1] returns an estimate of the likelihood that the strings are equivalent, conditional on being instances of this tope Topes will be defined in files and compiled, just like types.

10 Proposed model Reformatting functions would transform instances from one tope to another. Two topes are “isotopes” if instances of one can be reformatted into the other. fmt: string  string treats the input as an instance of one tope and returns an equivalent string that is in another format

11 Proposed meta-model Repositories would permit sharing of topes. Topes could be implemented in arbitrary languages. The binaries would be stored in “repositories”. Each user might subscribe to multiple repositories: –personal repository of custom topes –university repository of organization-specific topes –general repository of generic topes Users would be able to define new topes, search for topes, add them to their repository, and prune their repository of outdated topes.

12 Proposed meta-model A meta-model would represent aspects of tope trustworthiness. How do end users decide which topes to use? –Topes would be annotated with platform (e.g.: JDK1.5), author names, and other meta information to facilitate finding and choosing topes. What if a tope consistently over-estimates “isa” scores? –Let the user give feedback, either positive or negative. –“Wrap” the tope and correct for its bias (renormalization). What other renormalization might be handy? –Words in the context might “cue” that certain tope instances are nearby; “isa” functions for those topes may be more trustworthy in that context.

13 Tope implementation Macro tools would download topes on demand from repositories. Back to the macro example… 1.The macro tool retrieves topes. 2.The tool tries to infer a tope for each value in the macro. The user could override this assignment, of course. 3.The tool can now automatically reformat data if needed

14 Tope implementation Most isa functions could be implemented with an augmented context-free grammar. We logged data from information workers’ web browsers. It appears that most data can be recognized using probabilistic context-free grammars with constraints on the grammar terms. –E.g.: time  HH : MM ap HH, MM  ## {MM >= 0 && MM <= 59 && … } I will need to… –Verify that an augmented grammar is expressive enough –Identify what constraint primitives are necessary

15 Tope implementation Most equivalence and reformatting functions are built from very few primitives. Equivalence functions combine –Lookup in hard-coded tables –Arithmetic and conditionals –Numeric comparisons –“Identicalness” comparison –Case-insensitive comparison Reformatting functions combine –Lookup in hard-coded tables –Arithmetic and conditionals –Permutation

16 Tope implementation A prototype system will help users create, compile, and share topes. I will need to provide a prototype system with… –A user-friendly editor for end users to define topes. –A program that turns these definitions into binary modules. –A repository server to store binaries and the meta-model. Remember: Sophisticated end users (or professionals) can fall back on an arbitrary language to create topes. The simple grammar language is for the common case.

17 Applications of topes Equipping tools with topes will help users create programs of higher quality. During end user programming, tools would download useful topes from repositories. –Web macros could perform reformatting automatically. Improved composability of web macros –Spreadsheets could be checked for malformed values. Improved correctness of spreadsheets –Web applications could validate & reformat data from web. Improved security of applications

18 Applications of topes Providing a tope system to users will improve data interoperability. When receiving data, users could define a new tope (or use an existing tope) and apply it to validate data. –Users could reformat values to a uniform format. –Users could find and remove duplicate values. Data could carry along tope definitions, particularly if the representation is “secure” (e.g.: context-free grammar) –a form of self-describing data.

19 Thank You… To Mary Shaw, Brad Myers, Robin Abraham, Allen Blackwell, Margaret Burnett, Michael Coblenz, Allen Cypher, Sebastian Elbaum, Martin Erwig, Josh Gross, John Hosking, Andy Ko, Andhy Koesnandar, Henry Lieberman, Ericka Orrick, John Pane, Mary Beth Rosson, Jeff Stylos, Steve Tanimoto, and others for helpful discussions. To NSF for funding