Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group Part 1: Usability Testing.
Advertisements

Introduction Lesson 1 Microsoft Office 2010 and the Internet
Microsoft Office Illustrated Fundamentals Unit H: Using Complex Formulas, Functions, and Tables.
Introduction to Excel Chapter 2 Excel Fundamentals Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Tutorial 7: Using Advanced Functions and Conditional Formatting
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie.
Dimensions Characterizing Programming Feature Usage by Information Workers Christopher Scaffidi, Andrew Ko, Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Carnegie Mellon University.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
Accommodating Data Heterogeneity in ULS Systems Christopher Scaffidi Mary Shaw Carnegie Mellon University.
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.
Universe Design Concepts Business Intelligence Copyright © SUPINFO. All rights reserved.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
NU Data Excel Orientation Graphing of Screening Data and Basic Graphing Functions.
1 Agenda Views Pages Web Parts Navigation Office Wrap-Up.
COMPREHENSIVE Excel Tutorial 8 Developing an Excel Application.
FacilitiesDesk Product Overview. Looking For? A maintenance helpdesk? In other words a CMMS? An integrated solution for complete facilities management?
Lecturer: Ghadah Aldehim
1 CADE Finance and HR Reports Administrative Staff Leadership Conference Presenter: Mary Jo Kuffner, Assistant Director Administration.
September 5, 2015 Office Setup. Lesson Overview: Office Setup  In this lesson we will cover:  Adding new offices to COM  Individual office setup 
New Tools to Increase Sales And to Enhance The User Experience.
PowerPoint 2003 – Level 1 Computer Concepts Cathy Horwitz April 25, 2011.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
No application is an island: Using topes to transform strings during data transfer Atipol Asavametha, Prashanth Ayyavu, Christopher Scaffidi School of.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Chapter 17 Creating a Database.
System for Administration, Training, and Educational Resources for NASA SATERN Overview for Users December 2009.
Using the Right Method to Collect Information IW233 Amanda Murphy.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Slide 5-1 Chapter 5 Terms Applications Software for Business Introduction to Information Systems Judith C. Simon.
Animal Shelter Activity 2.
Usability Evaluation of the Course Management Features of Sakai Jonathan Howarth Rex Hartson Aaron Zeckoski
Online Catalog Tutorial. Introduction Welcome to the Online Catalog Tutorial. This is the place to find answers to all of your online shopping questions.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Goals of Documentation ITSW 1410, Presentation Media Software Instructor: Glenda H. Easter.
1 Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers,
Copyright © 2006 – Brad A. Myers Answering Why and Why Not Questions in User Interfaces Brad Myers, David A. Weitzman, Andrew J. Ko, and Duen Horng (“Polo”)
1 Year of Progress on Topes Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
CoScripter and Topes: Putting Data into Usable Formats Christopher Scaffidi Carnegie Mellon University With Allen Cypher and Jimmy Lin IBM Almaden.
Viewpoint Modeling and Model-Based Media Generation for Systems Engineers Automatic View and Document Generation for Scalable Model- Based Engineering.
Step 1 Lead Notifications Dear Partner, New leads have been assigned to your organization based on customer preference and are available for you.
© 2014 IBM Corporation e-config RPO MES Training Bill Luken September 29 th, 2014 Global Client Value.
Information Retrieval in Practice
Excel Tutorial 8 Developing an Excel Application
Welcome! To the ETS – Create Client Account & Maintenance
GO! with Microsoft Office 2016
Overview of MDM Site Hub
Software Documentation
Christopher Scaffidi Center for Applied Systems and Software
A Data Model to Help End Users Shape Effective Software
Lecture 12: Data Wrangling
SSI Toolbox Status Workbook Overview
Overview of Oracle Site Hub
Microsoft Office Illustrated Fundamentals
Presentation transcript:

Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University

2 People often use spreadsheets to store and organize “string” data According to study by Univ. Nebraska, nearly 40% of spreadsheet cells are strings (ie: not numbers, formulas, or dates) Example task found while observing administrative assistants (contextual inquiry)… Build a roster of employee contact info –Visit several project teams’ web sites –Copy data from web sites into spreadsheet –Manually put data into consistent format (because users care about formatting when creating reports) Introduction  Editor  Recommendation  Evaluation

3 Mishmash of formats and invalid strings 3 Introduction  Editor  Recommendation  Evaluation - illustrative example (not actually the spreadsheet in the contextual inquiry) - part of an actual spreadsheet from CMU web site

4 Needed: automated support for validating and reformatting domain-specific strings 4 Finding and fixing strings is tedious and error-prone Excel and other tools provide no features for automatically reformatting domain-specific strings –Only for numeric data & a few specific kinds of strings (not domain-extensible) Introduction  Editor  Recommendation  Evaluation

5 Underlying problem: abstraction mismatch Tools support strings, ints, floats, sometimes dates. Problem domain involves higher-level, multi-format categories of strings: –Person names –CMU department names –CMU course numbers –CMU building room numbers Introduction  Editor  Recommendation  Evaluation

6 Tope: Each tope describes how to validate and reformat one kind of string A notional depiction of a tope for CMU room numbers… Node = format, edge = reformatting rule Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction  Editor  Recommendation  Evaluation

7 What’s new and interesting today? Auto-reformatting and recommendation Previous work: –Early tope editing tool for creating topes to validate and reformat spreadsheet, web form and web macro data [ICSE’08, FSE’08] –Inferring new topes from example strings [ICEIS’07] –Usability evaluation of the early tope editing tool [ISEUD’09] Limitations of previous work: –Tedious to implement reformatting rules –Tedious to reuse topes Contributions today: –Automatic reformatting –Tope recommendation Introduction  Editor  Recommendation  Evaluation

8 New “Format As” feature 8 Introduction  Editor  Recommendation  Evaluation

9 Today’s presentation Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion Introduction  Editor  Recommendation  Evaluation

10 Creating a new tope Highlight cells containing example strings… system infers a boilerplate tope Introduction  Editor  Recommendation  Evaluation

11 Data Description Editor Toped ++ : an improved editor for topes 11 Introduction  Editor  Recommendation  Evaluation

12 Whitelist tab Introduction  Editor  Recommendation  Evaluation Other kinds of data easily described with a whitelist: US state names & abbreviations Campus building names & abbreviations

13Auto-reformatting Topes with a single word-like part –4 formats: UPPER CASE, lower case, Title Case, miXeD cAse Topes with a single numeric part –One format per # digits allowed: pad with “0” and/or round Topes with multiple parts and separators –(Recursively) reformat each part, concatenate with separators Topes that also have a whitelist –One format per synonym column: use lookup table Important: after reformatting, test the resulting string against the target format’s grammar to detect errors. Introduction  Editor  Recommendation  Evaluation

14 Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion

15 Supporting reuse: Recommendation via search-by-match algorithm Introduction  Editor  Recommendation  Evaluation Algorithm summary: 1.Sort topes by # keywords hit 2.Break ties by testing examples against whitelists 3.Break remaining ties by testing examples against the rest of the tope

16 Implementation details: Speeding up the recommendation Introduction  Editor  Recommendation  Evaluation Counting keyword hits and whitelist hits is easy– just use an inverted index. But testing every example on every tope is wasteful Why test a tope if it couldn’t match anyway? For example, if a phone number can only match formats like “ ” and “ ”, then it only needs to be tested against examples that have 10 digits and 2 hyphens or digits. –Index topes according to their “character content”

17 Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion

18 Evaluating usability for fixing spreadsheet data 9 master’s students, primarily in business Baseline: fixing strings manually Within-subject study design with 4 phases: –Tutorial task (up to 30 minutes) –Three tasks using Toped ++ (up to 30 minutes total) using Toped ++ to fix typos and reformat 100 cells, each –Same three tasks manually (up to 1 minute each) –Satisfaction questionnaire Introduction  Editor  Recommendation  Evaluation

19 Task details Each task = Find and fix typos in 100 spreadsheet cells, then put the cells into a specified format –Eg: add “.com” to addresses lacking top-level domain, then reformat like Different kinds of data assigned to different users: –3 users: Person first name, last name, university (single-part Word-like topes) –3 users: Course number, state name, country name (whitelist-driven topes; we provided whitelists from web) –3 users: address, phone number, person name (multi-part topes) Introduction  Editor  Recommendation  Evaluation

20 Usability: Improves user speed with negligible errors Minutes RequiredBreakeven Toped ++ (actual) Manual (projected) point (# cells) Group 1: Single word data Group 2: Whitelist data Group 3: Multi-part data Overall Average: Introduction  Editor  Recommendation  Evaluation with ~ 1/1000 error rate Projected, based on how many seconds participants spent fixing typos & reformatting each cell Even without reuse!

21 User satisfaction: They want to use topes User preference: Toped ++ or doing tasks manually –Every user strongly preferred Toped ++ 5-point Likert scales asking… –How easy Toped ++ was to use –How much users trusted it –How pleasant it was to use –If they would use it if made available –Every participant but one gave a score of 4 or 5 on every question (the good end of the scale) Two users described how they wished a tool like this had been available in previous office environments Introduction  Editor  Recommendation  Evaluation

22 Evaluating accuracy and speed of tope recommendation Prior study found that 32 categories covered 70% of columns that could be categorized in the EUSES spreadsheet corpus Evaluate accuracy & speed of tope recommendation –Create a tope in Toped ++ for each data category –Randomly choose a subset of these topes –Randomly choose examples from a column –Grab keywords from the column header –Query for a tope: Is it right? How long does query take? –Repeat many times –Then vary # topes, # examples, keywords to measure impact on accuracy & speed Introduction  Editor  Recommendation  Evaluation

23 Recommendation accuracy: Even a short menu usually has right tope 23 Introduction  Editor  Recommendation  Evaluation # choices in the drop down menu (result set size) # Examples; Use keywords?

24 Recommendation speed: Menu can be populated in < 1 second 24 Introduction  Editor  Recommendation  Evaluation Number of topes on the computer to choose from # Examples; Use keywords?

25 Toped ++ : first system to integrate user-extensible string validation with executable reformatting rules Other tools described in Related Work: –Grammex & SWYN: No reformatting rules –Potluck & Lapis: No “replayable” reformatting rules –Nix edit-by-example: No validation RE-Trees: search-by-match for regular expressions Topes is basically one way to model named entities, a central concept in information extraction research Introduction  Editor  Recommendation  Evaluation

26Conclusion Contributions –Auto-generate reformatting rules Very strongly preferred by users Users quickly & correctly fix typos and reformat data –Recommend based on examples of strings to match Good accuracy based on even just a few strings Fast enough to search user’s computer as he works Future Opportunities –Improving accuracy of recommendations Learn from user responses to previous recommendations Provide repository for intra-organizational tope reuse –Further integrations Adding reformatting-based Joins to DataSpaces? Introduction  Editor  Recommendation  Evaluation

27 Thank You… To Margaret Burnett, James Lin, Simone Stumpf, Weng-Keen Wong and others in the EUSES Consortium for feedback over the years on topes To NSF for funding To IUI 2009 for this opportunity to present

28References ICSE’08 Topes data model C. Scaffidi, B. Myers, and M. Shaw. Topes: Reusable Abstractions for Validating Data, International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, May 2008, pp ISEUD’09 User eval early tool C. Scaffidi, B. Myers, and M. Shaw. Fast, Accurate Creation of Data Validation Formats by End-User Developers. 2nd International Symposium on End-User Development (ISEUD 2009), March 2009, to appear. FSE’08 Use in web macros A. Koesnandar, S. Elbaum, G. Rothermel, L. Hochstein, K. Thomasset, and C. Scaffidi. Using Assertions to Help End-User Programmers Create Dependable Web Macros. Proc. 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2008), Atlanta, GA, November 2008, ICEIS’07 Inferring new topes C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th International Conference on Enterprise Information Systems - HCI Volume (ICEIS 2007), Madeira, Portugal, June 2007, pp