Download presentation
Presentation is loading. Please wait.
1
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Carnegie Mellon University
2
2 Target population In 2012, there will be 90 million computer end users in American workplaces. Of these, at least 55 million will create spreadsheets, databases, web applications, or other programs. –Spreadsheets for computing budgets –Spreadsheets and databases for storing information –Web applications for collecting data from coworkers And similar programs for automating a wide range of tedious or error-prone work tasks. Introduction Requirements Topes Tools Evaluation Conclusion
3
3 Contextual inquiry: What are the problems of end users? Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) Introduction Requirements Topes Tools Evaluation Conclusion
4
4 Lots of manual labor— validating and reformatting strings Building a staff roster, merging data from web sites: –Had to scrutinize data to identify questionable values (e.g.: CMU campus phone numbers are usually 258-xxxx but 259-xxxx might be right) –Had to manually transform data to consistent format (e.g.: Put person names in Lastname, Firstname format) Cannot automate with “web macro” tools –Intended for automating tasks like these –Tools don’t “know” how to check campus phone numbers or reformat person names. => Users simply performed the tasks manually Introduction Requirements Topes Tools Evaluation Conclusion
5
5 Another person’s task: validate web forms-- but he didn’t know JavaScript / regexps Is the input valid? “EDSH 225” Does it need reformatting? “Smith 225” Is the input questionable? “Gates 225” Or is it obviously invalid? “412-555-5444” Introduction Requirements Topes Tools Evaluation Conclusion
6
6 Collaborations of programmers with widely varying skills, interests, concerns Interviewing creators of Hurricane Katrina “person locator” sites (helping survivors publish their status) 4 managers in IT firms, 1 student, 1 graphic designer –2 people each created a site on their own –4 people collaborated with other programmers (principally on site aggregation) Introduction Requirements Topes Tools Evaluation Conclusion
7
7 Hurricane Katrina “Person Locator” site: Many inputs unvalidated Introduction Requirements Topes Tools Evaluation Conclusion
8
8 Data errors reduce the usefulness of data. Even little typos impede data de-duplication. Age is not useful for flying my helicopter to come rescue you. Nor is a “city name” with 1 letter. Introduction Requirements Topes Tools Evaluation Conclusion
9
9 Hurricane Katrina sites are not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) If professional programmers can’t get this right, then it’s unsurprising that those 90 million end users also have so much trouble. So many unvalidated inputs. So many data errors. So much time to find mistakes. So many millions of people laboriously reformatting data by hand. We need a better way! Introduction Requirements Topes Tools Evaluation Conclusion
10
10Outline 1.Requirements for a better data model 2.Topes Model for describing data Tools for creating/using topes 3.Evaluations 4.Conclusion Introduction Requirements Topes Tools Evaluation Conclusion
11
11 Underlying problem: abstraction mismatch Tools support strings, integers, floats, maybe dates. Problem domain involves higher-level data categories: –Person names “ Scaffidi, Chris”, “Chris Scaffidi” –CMU phone numbers “ 8-1234”, “x8-1234” –CMU room numbers “ WeH 4623”, “Wean 4623” Introduction Requirements Topes Tools Evaluation Conclusion
12
12 Approach: Create a new abstraction for each category of data Like software “libraries,” implementations of these abstractions could be reused in many programs. Abstractions would need to include functions for: –Recognizing instances of the category (for automating data validation) –Transforming instances among various formats (for automating data reformatting) Introduction Requirements Topes Tools Evaluation Conclusion
13
13 1. Identify valid, invalid, and questionable values Data is sometimes questionable… yet valid. –Eg: an unusually long email address –In practice, person names and other proper nouns are never validated with regexps… too brittle. –Life is full of corner cases and exceptions. If code can identify questionable data, then it can double-check the data: –Ask an application end user to confirm the input –Flag the input for checking by a system administrator –Compare the value to a list of known exceptions –Call up a server and see if it can confirm the value Introduction Requirements Topes Tools Evaluation Conclusion
14
14 2. Capture reformatting rules Two different strings can be equivalent. –What if an end user types a date in the wrong format? –“Jan-3-2007” and “1/3/2007” mean the same thing because of the category that they are in: date. –Sometimes the interpretation is ambiguous. In real life, preferences and experience guide interpretation. If code can transform among formats, then it can put data in an unambiguous format as needed. –Display result so users can check/fix interpretation Introduction Requirements Topes Tools Evaluation Conclusion
15
15 3. User-extensibility Many kinds of data are organization-specific But users at those organizations know what the data values mean—take advantage of what they know… Users can describe the constrained parts of data. –Eg: CMU room numbers, “EDSH 303”, have a building name and an internal room number –Valid data obeys intra- and inter-part constraints. Introduction Requirements Topes Tools Evaluation Conclusion
16
16 4. Reusability across programming environments (“platforms”) If a CMU room number is invalid, it’s generally because the room does not exist… –i.e.: it does not matter whether the room number is in a spreadsheet or a webform or a database To validate a kind of data, people don’t want to write –JavaScript for webforms on the client side –C#/Java/PHP for webforms on the server side –Stored procedures for databases –VBScript for spreadsheets Introduction Requirements Topes Tools Evaluation Conclusion
17
17 Limitations of existing approaches Types do not support questionable values Grammars (eg: regexps, CFGs, Lapis) do not either, and cannot reformat Tools to integrate heterogeneous databases require a professional DBA and are specific to database systems (ie: not spreadsheets, webforms, etc). Cues, Forms/3, -calculus, Slate, etc, infer numerical constraints but not constraints on strings, and they are tied to specific programming platforms Information extraction algorithms rely on grammatical cues that are absent during validation Introduction Requirements Topes Tools Evaluation Conclusion
18
18 Imagine a world where… Code can ask an oracle, “Is this a person name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray. Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed. Regardless of whether they are working in spreadsheets, webforms, or other programming environment, end users can teach the oracle about a new data category by concisely stating its parts and constraints. Introduction Requirements Topes Tools Evaluation Conclusion
19
19Topes A “tope” = a platform-independent abstraction that describes how to recognize and reformat instances of a data category Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Introduction Requirements Topes Tools Evaluation Conclusion
20
20 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction Requirements Topes Tools Evaluation Conclusion
21
21 A tope has functions for recognizing and transforming instances of a data category Each tope implementation has executable functions: –1 isa:string [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string string functions linking formats, for transforming values from one format to another Validation function: (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when (str) = 1 –Invalid when (str) = 0 –Questionable when 0 < (str) < 1 Introduction Requirements Topes Tools Evaluation Conclusion
22
22 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Introduction Requirements Topes Tools Evaluation Conclusion
23
23 Two other common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –Sometimes labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, lookup tables, and changes to separators & capitalization. Introduction Requirements Topes Tools Evaluation Conclusion
24
24 Topes in action 1.Users implement new topes to describe data categories. 2.Users publish tope implementations on repositories. 3.Other users download topes to a local cache. 4.Tool plug-ins help users browse their local cache and associate topes with variables and input fields. 5.Plug-ins get topes from local cache and use them at runtime to validate and reformat data. Introduction Requirements Topes Tools Evaluation Conclusion
25
25 Role of good tool support Some simple isa functions could be implemented as –Enumerations –Regular expressions / formal grammars But for many topes, we also need to support questionable values and reformatting And usability can almost always be improved by tailoring the tools to the problem domain –Integrate with users’ familiar tools –Match the user interface to the problem’s structure Introduction Requirements Topes Tools Evaluation Conclusion
26
26 What the user sees Introduction Requirements Topes Tools Evaluation Conclusion User highlights cells Clicks “New” button on our Validation toolbar
27
27 System infers a boilerplate tope and presents it for review and customization Introduction Requirements Topes Tools Evaluation Conclusion Induction steps: 1.Identify number & word parts 2.Align parts based on punctuation 3.Infer simple constraints on parts
28
28 User gives names to the parts and edits constraints Features Part names Soft constraints Value whitelists Testing features Introduction Requirements Topes Tools Evaluation Conclusion
29
29 System identifies typos Introduction Requirements Topes Tools Evaluation Conclusion Features Targeted messages Overridable Filterable Can add to “whitelist” Integrated with Excel’s “reviewing” functionality Checking inputs 1.Convert description to CFG w/ constraints on productions 2.Parse each input string 3.For each constraint violation, downgrade parse’s isa score
30
30 Easy access to reformatting functionality Introduction Requirements Topes Tools Evaluation Conclusion Reformatting string 1.Parse with input format’s CFG 2.For each part in target format, a)Get node from parse tree b)Reformat node if needed (recurse) c)Concatenate (with separators if needed) 3.Validate result with target format’s CFG
31
31 Recommending topes based on label and examples-to-match Introduction Requirements Topes Tools Evaluation Conclusion Efficient recommendation Only consider a tope if its instances could possibly have the “character content” of each example string. (eg.: could this have 12 letters & 1 space?)
32
32 Search repository by label and/or examples Note: many repositories will be organization-specific Introduction Requirements Topes Tools Evaluation Conclusion
33
33 Integration with Visual Studio.NET Introduction Requirements Topes Tools Evaluation Conclusion Features Targeted messages Overridable Drag & drop code generation
34
34 Other integrations to date: CoScripter, Robofox, XML/HTML library Introduction Requirements Topes Tools Evaluation Conclusion
35
35 Other integration underway Introduction Requirements Topes Tools Evaluation Conclusion RedRover –Spreadsheet auditing –They already support formula auditing –Goal: Using topes for checking strings LogicBlox –Decision-support –Helping users enter data & make decisions from it –Goal: Using topes for validating data –Goal: Using topes for data de-duplication
36
36 Evaluating accuracy Implemented topes for spreadsheet data –Grouped 1712 spreadsheet columns into categories –Created 32 topes for the most common categories (~ 70% of the data) –Compared validation with topes to validation with existing regexps or enumerations from the web –Tope-based validation was 3 times as accurate Most benefit from supporting multi-format topes; smaller benefit from double-checking questionable values (~ 3% of inputs) Introduction Requirements Topes Tools Evaluation Conclusion
37
37 Evaluating reusability Reused spreadsheet-based topes on webform data –Downloaded data for 8 data categories on Google Base and 5 in Hurricane Katrina website –Reused spreadsheet-based topes on the web data –Validation was just as accurate as on spreadsheets Introduction Requirements Topes Tools Evaluation Conclusion
38
38 Evaluating support for data cleaning Used topes to put web data into consistent formats –Again with the 5 columns in Hurricane Katrina website –Used transformation functions to put each string into the most common format for that data category –Increased number of duplicate strings found by 10% Introduction Requirements Topes Tools Evaluation Conclusion
39
39 Evaluating usability for data validation Users validating data with single-format topes –Between-subjects lab study –8 users validated spreadsheet data with our tools; for comparison, 8 users validated with Lapis patterns –Yes/no validation tasks (no questionable data) –Our tool users vs Lapis users Found three times as many typos Were twice as fast Reported significantly higher user satisfaction –Our tool users vs users in earlier regexp study Faster & more accurate (Similar but not identical tasks: not statistically comparable) Introduction Requirements Topes Tools Evaluation Conclusion
40
40 Evaluating usability for data reformatting Users reformatting data with multi-format topes –Within-subjects lab study –9 users reformatted spreadsheet data by creating & using topes; for comparison, they then did it manually –Effort of creating a tope “pays off” at only 47 strings (further reuse is essentially “free”) –Every participant strongly preferred using our tools instead of doing tasks manually Introduction Requirements Topes Tools Evaluation Conclusion
41
41 Evaluating tope recommendations Quickly recommend existing tope for data at hand –Supports keyword-based search + search-by-match (eg: topes that match “888-555-1212”) –Evaluated by searching through topes for the 32 most common data categories in EUSES spreadsheet corpus, using strings from corpus –High accuracy: Recall over 80% (result set size = 5) –Adequate speed: User is likely to have a few dozen topes on computer, taking under 1 sec to search Introduction Requirements Topes Tools Evaluation Conclusion
42
42 Topes improve data validation Validating with topes improves –Accuracy of validation –Consistency of data formatting –Reusability of validation code Primary contributions: –Support for ambiguous data categories –Support for reformatting values –Platform-independent, reusable validation Introduction Requirements Topes Tools Evaluation Conclusion
43
43 Research approach 1.Understand users’ needs & context (empirically) 2.Identify a general abstract problem 3.Apply, adapt and extend methods and models of software engineering & human-computer interaction 4.Evaluate empirically; iterate w/ step 3 until adequate Introduction Requirements Topes Tools Evaluation Conclusion
44
44 Long-term goal: Improving the benefit/cost ratio of end-user programming Helping users automate larger tasks/computations –Finding reusable pieces of code –Repurposing and combining code Topes as “glue” Reducing cost of supporting end-user programming –Need appropriate software application architectures –May impact design and maintenance of applications –Requires partnership with software development companies to reach those 90 million end users Introduction Requirements Topes Tools Evaluation Conclusion
45
45 Thank You… To you for this opportunity to present To Oregon State University for leading EUSES To my advisor, Mary Shaw at Carnegie Mellon, and EUSES for great feedback To NSF for funding Introduction Requirements Topes Tools Evaluation Conclusion
46
46References For more information on end users and topes -End users’ counts and needs: VL/HCC’05, VL/HCC’07 -Topes model: ICSE’08 -Format inferrence: ICEIS’07 -Integration with other systems: WEUSE’08 & FSE’08 -Our latest tools + usability validation: ISEUD’09 & IUI’09 For more information on some related work -Dependent types, eg: X. Ou, Dynamic Typing with Dependent Types, Tech Rpt TR-695-04, Princeton Univ, 2004 -Regexp induction, eg: K. Lerman, S. Minton. Learning the Common Structure of Data, Proc. AAAI, 2000. -Lapis system : R. Miller, Lightweight structure in text, Tech Rpt CMU-CS-02-134, Carnegie Mellon Univ., 2002. -SWYN regexp editor : A. Blackwell, See What You Need: Helping End-users to Build Abstractions, JVLC, 2001. -Federated databases, eg: A. Sheth, J. Larsen, Federated database systems for managing distributed, heterogeneous, and autonomous databases, CSUR, 1990. -ETL Tools, eg: E. Rahn, H. Do, Data Cleaning: Problems and Current Approaches, IEEE Data Eng. Bulletin, 2000. -Potter’s Wheel : V. Raman, J. Hellerstein, Potter's Wheel: An Interactive Data Cleaning System, VLDB, 2001. -Forms/3 : M. Burnett et al, End-user software engineering with assertions in the spreadsheet paradigm, ICSE, 2003. - -calculus : M. Erwig, M. Burnett, Adding Apples and Oranges. Symp. Practical Aspects of Declarative Lang., 2002. -Named entities, eg: Message Understanding Conference series. Introduction Requirements Topes Tools Evaluation Conclusion
47
47 Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.com Split inputs into many easy-to-validate fields. Who cares if the user has to type tabs now, or if he can’t just copy-paste into one field? Make users pick from drop-downs. Who cares if it’s faster for users to type “NJ” or “1/2007”? (Disclaimer: drop-downs sometimes are good!) I implemented this site in 2003. Introduction Requirements Topes Tools Evaluation Conclusion
48
48 Even with these tricks, writing validation is still very time-consuming. Overall, the site had over 1100 lines of JavaScript just for validation…. Plus equivalent server-side Java code (too bad code isn’t platform-independent) if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address."); var atloc = frm.primaryemail.value.indexOf('@'); if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @)."); Introduction Requirements Topes Tools Evaluation Conclusion
49
49 That was worst case. Best case: reusable regexps. Many IDEs allow the programmer to enter one regular expression for validating each input field. –Usually, this drastically reduces the amount of code, since most validation ain’t fancy. –Yet programmers don’t validate most inputs. Introduction Requirements Topes Tools Evaluation Conclusion
50
50 Users’ spreadsheets are rife with formatting inconsistencies & other typos In one study by Univ Nebraska, nearly 40% of spreadsheet cell values were strings (not numbers or dates). Part of an actual spreadsheet on Carnegie Mellon’s intranet Introduction Requirements Topes Tools Evaluation Conclusion
51
51 Evaluating expressiveness Implemented topes for common webform inputs –Instrumented web browsers of 4 administrative assistants for 3 weeks –Logged strings that they typed into forms – in a regexp-masked format e.g.: user0@XYZ.EDU [a-z]{4}[0-9]@[A-Z]{3}.[A-Z]{3} –Also logged strings nearby to textfields –Semi-automatically grouped strings into categories e.g.: project number, expense type, email address, zip code –Implemented 14 most common topes –Found 22 probable typos in user inputs (0.5%) Introduction Requirements Topes Tools Evaluation Conclusion
52
52 Tope Development Environment (TDE) Topei Module Infers tope from examples Toped Module Enables users to create/edit topes Topeg Module Generates context-free grammars and transformations Topep Module Parses data against grammars, performs transformations Plug-ins Read/write program data Robofox Web macros Vegemite/CoScripter Web macros Visual Studio.NET Web applications Microsoft Excel Spreadsheets … Introduction Requirements Topes Tools Evaluation Conclusion Repository Stores topes for sharing/reuse
53
53 As a tool builder, what do I have to do so that people can use topes in my tool? You need to make a plug-in 1.Figure out what kind of fields you want to help your users validate/reformat (eg: spreadsheets’ cells; webforms’ textboxes) 2.Download our open source C# or Java API (library) 3.In your tool’s UI, add buttons and other widgets so user can select a tope for the fields; in your event handler, call our API methods 4.At runtime, pass field’s value (a string) to our API methods to validate or reformat strings 5.Display validation error messages; update value in UI Introduction Requirements Topes Tools Evaluation Conclusion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.