Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University

2 People often use spreadsheets to store and organize “string” data According to study by Univ. Nebraska, nearly 40% of spreadsheet cells are strings (ie: not numbers, formulas, or dates) Example task found while observing administrative assistants (contextual inquiry)… Build a roster of employee contact info –Visit several project teams’ web sites –Copy data from web sites into spreadsheet –Manually put data into consistent format (because users care about formatting when creating reports) Introduction  Editor  Recommendation  Evaluation

3 Mishmash of formats and invalid strings 3 Introduction  Editor  Recommendation  Evaluation - illustrative example (not actually the spreadsheet in the contextual inquiry) - part of an actual spreadsheet from CMU web site

4 Needed: automated support for validating and reformatting domain-specific strings 4 Finding and fixing strings is tedious and error-prone Excel and other tools provide no features for automatically reformatting domain-specific strings –Only for numeric data & a few specific kinds of strings (not domain-extensible) Introduction  Editor  Recommendation  Evaluation

5 Underlying problem: abstraction mismatch Tools support strings, ints, floats, sometimes dates. Problem domain involves higher-level, multi-format categories of strings: –Person names –CMU department names –CMU course numbers –CMU building room numbers Introduction  Editor  Recommendation  Evaluation

6 Tope: Each tope describes how to validate and reformat one kind of string A notional depiction of a tope for CMU room numbers… Node = format, edge = reformatting rule Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction  Editor  Recommendation  Evaluation

7 What’s new and interesting today? Auto-reformatting and recommendation Previous work: –Early tope editing tool for creating topes to validate and reformat spreadsheet, web form and web macro data [ICSE’08, FSE’08] –Inferring new topes from example strings [ICEIS’07] –Usability evaluation of the early tope editing tool [ISEUD’09] Limitations of previous work: –Tedious to implement reformatting rules –Tedious to reuse topes Contributions today: –Automatic reformatting –Tope recommendation Introduction  Editor  Recommendation  Evaluation

8 New “Format As” feature 8 Introduction  Editor  Recommendation  Evaluation

9 Today’s presentation Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion Introduction  Editor  Recommendation  Evaluation

10 Creating a new tope Highlight cells containing example strings… system infers a boilerplate tope Introduction  Editor  Recommendation  Evaluation

11 Data Description Editor Toped ++ : an improved editor for topes 11 Introduction  Editor  Recommendation  Evaluation

12 Whitelist tab Introduction  Editor  Recommendation  Evaluation Other kinds of data easily described with a whitelist: US state names & abbreviations Campus building names & abbreviations

13Auto-reformatting Topes with a single word-like part –4 formats: UPPER CASE, lower case, Title Case, miXeD cAse Topes with a single numeric part –One format per # digits allowed: pad with “0” and/or round Topes with multiple parts and separators –(Recursively) reformat each part, concatenate with separators Topes that also have a whitelist –One format per synonym column: use lookup table Important: after reformatting, test the resulting string against the target format’s grammar to detect errors. Introduction  Editor  Recommendation  Evaluation

14 Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion

15 Supporting reuse: Recommendation via search-by-match algorithm Introduction  Editor  Recommendation  Evaluation Algorithm summary: 1.Sort topes by # keywords hit 2.Break ties by testing examples against whitelists 3.Break remaining ties by testing examples against the rest of the tope

16 Implementation details: Speeding up the recommendation Introduction  Editor  Recommendation  Evaluation Counting keyword hits and whitelist hits is easy– just use an inverted index. But testing every example on every tope is wasteful Why test a tope if it couldn’t match anyway? For example, if a phone number can only match formats like “808-202-3030” and “808.202.3030”, then it only needs to be tested against examples that have 10 digits and 2 hyphens or digits. –Index topes according to their “character content”

17 Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion

18 Evaluating usability for fixing spreadsheet data 9 master’s students, primarily in business Baseline: fixing strings manually Within-subject study design with 4 phases: –Tutorial task (up to 30 minutes) –Three tasks using Toped ++ (up to 30 minutes total) using Toped ++ to fix typos and reformat 100 cells, each –Same three tasks manually (up to 1 minute each) –Satisfaction questionnaire Introduction  Editor  Recommendation  Evaluation

19 Task details Each task = Find and fix typos in 100 spreadsheet cells, then put the cells into a specified format –Eg: add “.com” to email addresses lacking top-level domain, then reformat like “CSCAFFID@gmail.com” Different kinds of data assigned to different users: –3 users: Person first name, last name, university (single-part Word-like topes) –3 users: Course number, state name, country name (whitelist-driven topes; we provided whitelists from web) –3 users: Email address, phone number, person name (multi-part topes) Introduction  Editor  Recommendation  Evaluation

20 Usability: Improves user speed with negligible errors Minutes RequiredBreakeven Toped ++ (actual) Manual (projected) point (# cells) Group 1: Single word data3.05.060 Group 2: Whitelist data6.915.943 Group 3: Multi-part data3.610.235 Overall Average: 4.59.647 Introduction  Editor  Recommendation  Evaluation with ~ 1/1000 error rate Projected, based on how many seconds participants spent fixing typos & reformatting each cell Even without reuse!

21 User satisfaction: They want to use topes User preference: Toped ++ or doing tasks manually –Every user strongly preferred Toped ++ 5-point Likert scales asking… –How easy Toped ++ was to use –How much users trusted it –How pleasant it was to use –If they would use it if made available –Every participant but one gave a score of 4 or 5 on every question (the good end of the scale) Two users described how they wished a tool like this had been available in previous office environments Introduction  Editor  Recommendation  Evaluation

22 Evaluating accuracy and speed of tope recommendation Prior study found that 32 categories covered 70% of columns that could be categorized in the EUSES spreadsheet corpus Evaluate accuracy & speed of tope recommendation –Create a tope in Toped ++ for each data category –Randomly choose a subset of these topes –Randomly choose examples from a column –Grab keywords from the column header –Query for a tope: Is it right? How long does query take? –Repeat many times –Then vary # topes, # examples, keywords to measure impact on accuracy & speed Introduction  Editor  Recommendation  Evaluation

23 Recommendation accuracy: Even a short menu usually has right tope 23 Introduction  Editor  Recommendation  Evaluation # choices in the drop down menu (result set size) # Examples; Use keywords?

24 Recommendation speed: Menu can be populated in < 1 second 24 Introduction  Editor  Recommendation  Evaluation Number of topes on the computer to choose from # Examples; Use keywords?

25 Toped ++ : first system to integrate user-extensible string validation with executable reformatting rules Other tools described in Related Work: –Grammex & SWYN: No reformatting rules –Potluck & Lapis: No “replayable” reformatting rules –Nix edit-by-example: No validation RE-Trees: search-by-match for regular expressions Topes is basically one way to model named entities, a central concept in information extraction research Introduction  Editor  Recommendation  Evaluation

26Conclusion Contributions –Auto-generate reformatting rules Very strongly preferred by users Users quickly & correctly fix typos and reformat data –Recommend based on examples of strings to match Good accuracy based on even just a few strings Fast enough to search user’s computer as he works Future Opportunities –Improving accuracy of recommendations Learn from user responses to previous recommendations Provide repository for intra-organizational tope reuse –Further integrations Adding reformatting-based Joins to DataSpaces? Introduction  Editor  Recommendation  Evaluation

27 Thank You… To Margaret Burnett, James Lin, Simone Stumpf, Weng-Keen Wong and others in the EUSES Consortium for feedback over the years on topes To NSF for funding To IUI 2009 for this opportunity to present

28References ICSE’08 Topes data model C. Scaffidi, B. Myers, and M. Shaw. Topes: Reusable Abstractions for Validating Data, International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, May 2008, pp. 1-10. ISEUD’09 User eval early tool C. Scaffidi, B. Myers, and M. Shaw. Fast, Accurate Creation of Data Validation Formats by End-User Developers. 2nd International Symposium on End-User Development (ISEUD 2009), March 2009, to appear. FSE’08 Use in web macros A. Koesnandar, S. Elbaum, G. Rothermel, L. Hochstein, K. Thomasset, and C. Scaffidi. Using Assertions to Help End-User Programmers Create Dependable Web Macros. Proc. 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2008), Atlanta, GA, November 2008, 124-134. ICEIS’07 Inferring new topes C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th International Conference on Enterprise Information Systems - HCI Volume (ICEIS 2007), Madeira, Portugal, June 2007, pp. 236-241.

Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Similar presentations

Presentation on theme: "Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Similar presentations

Presentation on theme: "Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback