Download presentation
Presentation is loading. Please wait.
Published byMelinda Potter Modified over 9 years ago
1
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University
2
2 People often use spreadsheets to store and organize “string” data According to study by Univ. Nebraska, nearly 40% of spreadsheet cells are strings (ie: not numbers, formulas, or dates) Example task found while observing administrative assistants (contextual inquiry)… Build a roster of employee contact info –Visit several project teams’ web sites –Copy data from web sites into spreadsheet –Manually put data into consistent format (because users care about formatting when creating reports) Introduction Editor Recommendation Evaluation
3
3 Mishmash of formats and invalid strings 3 Introduction Editor Recommendation Evaluation - illustrative example (not actually the spreadsheet in the contextual inquiry) - part of an actual spreadsheet from CMU web site
4
4 Needed: automated support for validating and reformatting domain-specific strings 4 Finding and fixing strings is tedious and error-prone Excel and other tools provide no features for automatically reformatting domain-specific strings –Only for numeric data & a few specific kinds of strings (not domain-extensible) Introduction Editor Recommendation Evaluation
5
5 Underlying problem: abstraction mismatch Tools support strings, ints, floats, sometimes dates. Problem domain involves higher-level, multi-format categories of strings: –Person names –CMU department names –CMU course numbers –CMU building room numbers Introduction Editor Recommendation Evaluation
6
6 Tope: Each tope describes how to validate and reformat one kind of string A notional depiction of a tope for CMU room numbers… Node = format, edge = reformatting rule Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction Editor Recommendation Evaluation
7
7 What’s new and interesting today? Auto-reformatting and recommendation Previous work: –Early tope editing tool for creating topes to validate and reformat spreadsheet, web form and web macro data [ICSE’08, FSE’08] –Inferring new topes from example strings [ICEIS’07] –Usability evaluation of the early tope editing tool [ISEUD’09] Limitations of previous work: –Tedious to implement reformatting rules –Tedious to reuse topes Contributions today: –Automatic reformatting –Tope recommendation Introduction Editor Recommendation Evaluation
8
8 New “Format As” feature 8 Introduction Editor Recommendation Evaluation
9
9 Today’s presentation Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion Introduction Editor Recommendation Evaluation
10
10 Creating a new tope Highlight cells containing example strings… system infers a boilerplate tope Introduction Editor Recommendation Evaluation
11
11 Data Description Editor Toped ++ : an improved editor for topes 11 Introduction Editor Recommendation Evaluation
12
12 Whitelist tab Introduction Editor Recommendation Evaluation Other kinds of data easily described with a whitelist: US state names & abbreviations Campus building names & abbreviations
13
13Auto-reformatting Topes with a single word-like part –4 formats: UPPER CASE, lower case, Title Case, miXeD cAse Topes with a single numeric part –One format per # digits allowed: pad with “0” and/or round Topes with multiple parts and separators –(Recursively) reformat each part, concatenate with separators Topes that also have a whitelist –One format per synonym column: use lookup table Important: after reformatting, test the resulting string against the target format’s grammar to detect errors. Introduction Editor Recommendation Evaluation
14
14 Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion
15
15 Supporting reuse: Recommendation via search-by-match algorithm Introduction Editor Recommendation Evaluation Algorithm summary: 1.Sort topes by # keywords hit 2.Break ties by testing examples against whitelists 3.Break remaining ties by testing examples against the rest of the tope
16
16 Implementation details: Speeding up the recommendation Introduction Editor Recommendation Evaluation Counting keyword hits and whitelist hits is easy– just use an inverted index. But testing every example on every tope is wasteful Why test a tope if it couldn’t match anyway? For example, if a phone number can only match formats like “808-202-3030” and “808.202.3030”, then it only needs to be tested against examples that have 10 digits and 2 hyphens or digits. –Index topes according to their “character content”
17
17 Introduction –Problem overview –Topes overview Tope editing tool: Toped ++ Tope recommendation Evaluation –Evaluation of usability –Evaluation of accuracy & speed Conclusion
18
18 Evaluating usability for fixing spreadsheet data 9 master’s students, primarily in business Baseline: fixing strings manually Within-subject study design with 4 phases: –Tutorial task (up to 30 minutes) –Three tasks using Toped ++ (up to 30 minutes total) using Toped ++ to fix typos and reformat 100 cells, each –Same three tasks manually (up to 1 minute each) –Satisfaction questionnaire Introduction Editor Recommendation Evaluation
19
19 Task details Each task = Find and fix typos in 100 spreadsheet cells, then put the cells into a specified format –Eg: add “.com” to email addresses lacking top-level domain, then reformat like “CSCAFFID@gmail.com” Different kinds of data assigned to different users: –3 users: Person first name, last name, university (single-part Word-like topes) –3 users: Course number, state name, country name (whitelist-driven topes; we provided whitelists from web) –3 users: Email address, phone number, person name (multi-part topes) Introduction Editor Recommendation Evaluation
20
20 Usability: Improves user speed with negligible errors Minutes RequiredBreakeven Toped ++ (actual) Manual (projected) point (# cells) Group 1: Single word data3.05.060 Group 2: Whitelist data6.915.943 Group 3: Multi-part data3.610.235 Overall Average: 4.59.647 Introduction Editor Recommendation Evaluation with ~ 1/1000 error rate Projected, based on how many seconds participants spent fixing typos & reformatting each cell Even without reuse!
21
21 User satisfaction: They want to use topes User preference: Toped ++ or doing tasks manually –Every user strongly preferred Toped ++ 5-point Likert scales asking… –How easy Toped ++ was to use –How much users trusted it –How pleasant it was to use –If they would use it if made available –Every participant but one gave a score of 4 or 5 on every question (the good end of the scale) Two users described how they wished a tool like this had been available in previous office environments Introduction Editor Recommendation Evaluation
22
22 Evaluating accuracy and speed of tope recommendation Prior study found that 32 categories covered 70% of columns that could be categorized in the EUSES spreadsheet corpus Evaluate accuracy & speed of tope recommendation –Create a tope in Toped ++ for each data category –Randomly choose a subset of these topes –Randomly choose examples from a column –Grab keywords from the column header –Query for a tope: Is it right? How long does query take? –Repeat many times –Then vary # topes, # examples, keywords to measure impact on accuracy & speed Introduction Editor Recommendation Evaluation
23
23 Recommendation accuracy: Even a short menu usually has right tope 23 Introduction Editor Recommendation Evaluation # choices in the drop down menu (result set size) # Examples; Use keywords?
24
24 Recommendation speed: Menu can be populated in < 1 second 24 Introduction Editor Recommendation Evaluation Number of topes on the computer to choose from # Examples; Use keywords?
25
25 Toped ++ : first system to integrate user-extensible string validation with executable reformatting rules Other tools described in Related Work: –Grammex & SWYN: No reformatting rules –Potluck & Lapis: No “replayable” reformatting rules –Nix edit-by-example: No validation RE-Trees: search-by-match for regular expressions Topes is basically one way to model named entities, a central concept in information extraction research Introduction Editor Recommendation Evaluation
26
26Conclusion Contributions –Auto-generate reformatting rules Very strongly preferred by users Users quickly & correctly fix typos and reformat data –Recommend based on examples of strings to match Good accuracy based on even just a few strings Fast enough to search user’s computer as he works Future Opportunities –Improving accuracy of recommendations Learn from user responses to previous recommendations Provide repository for intra-organizational tope reuse –Further integrations Adding reformatting-based Joins to DataSpaces? Introduction Editor Recommendation Evaluation
27
27 Thank You… To Margaret Burnett, James Lin, Simone Stumpf, Weng-Keen Wong and others in the EUSES Consortium for feedback over the years on topes To NSF for funding To IUI 2009 for this opportunity to present
28
28References ICSE’08 Topes data model C. Scaffidi, B. Myers, and M. Shaw. Topes: Reusable Abstractions for Validating Data, International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, May 2008, pp. 1-10. ISEUD’09 User eval early tool C. Scaffidi, B. Myers, and M. Shaw. Fast, Accurate Creation of Data Validation Formats by End-User Developers. 2nd International Symposium on End-User Development (ISEUD 2009), March 2009, to appear. FSE’08 Use in web macros A. Koesnandar, S. Elbaum, G. Rothermel, L. Hochstein, K. Thomasset, and C. Scaffidi. Using Assertions to Help End-User Programmers Create Dependable Web Macros. Proc. 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2008), Atlanta, GA, November 2008, 124-134. ICEIS’07 Inferring new topes C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th International Conference on Enterprise Information Systems - HCI Volume (ICEIS 2007), Madeira, Portugal, June 2007, pp. 236-241.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.