Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University
2 Even when lives are at stake, people still make typos. Hurricane Katrina “Person Locator” Web site Problem Topes Validation Conclusion
3 Data errors reduce the usefulness of data. Wrong data category Problem Topes Validation Conclusion Questionable input Incorrect formatting
4 The website creators omitted input validation. Primary reason: rejecting obviously-wrong inputs would prevent collecting questionable data –Eg: Would you accept a city with 1 letter? This is the UI code for the web form where users entered data for this website. A RAD tool called CodeCharge Studio was used to create the UI. Problem Topes Validation Conclusion
5 This site was not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) Eg: Spreadsheets –40% of cells are non-numeric, non-date textual data –Commonly used to gather and organize textual data for reports Problem Topes Validation Conclusion
6 Validation of these short human-readable strings must support… Testing membership in a data category –Categories based on standards (eg: address) –Categories lacking standards (eg: city name) Ambiguously defined categories –Identify questionable values for double-checking Multiple formats –Format consistency, post-validation Platform-independent implementation –Reuse in webapps, spreadsheets, others Problem Topes Validation Conclusion
7 Limitations of existing approaches Types do not support questionable values Grammars do not, either, nor can they reformat Information extraction algorithms rely on grammatical cues that are absent during validation Cues, Forms/3, -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent Problem Topes Validation Conclusion
8 New Approach: Topes A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Problem Topes Validation Conclusion
9 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Problem Topes Validation Conclusion Building abbreviation & room number EDSH 225
10 A tope is a conceptual abstraction. A tope implementation is code. Each tope implementation has executable functions: –1 isa:string [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string string functions linking formats, for transforming values from one format to another Validation function: (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when (str) = 1 –Invalid when (str) = 0 –Questionable when 0 < (str) < 1 Problem Topes Validation Conclusion
11 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: Company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Problem Topes Validation Conclusion
12 Two more common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –(Very rarely) labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, changes to separators, arithmetic, and lookup tables. Problem Topes Validation Conclusion
13 Formal tool demonstration on Friday Features: Format inference Format/part names Soft constraints Testing features Format reusability Problem Topes Validation Conclusion
14 Formal tool demonstration on Friday Microsoft Excel: buttons and menus Visual Studio: drag-and drop code generation Problem Topes Validation Conclusion
15 Evaluating accuracy, reusability, and usefulness for data cleaning Implemented topes for spreadsheet data –32 topes based on 720 online spreadsheets –Tested accuracy Reused topes on web application data –8 data categories in Google Base and 5 data categories in Hurricane Katrina site –Tested accuracy Used transformations to reformat data –5 data categories in Hurricane Katrina site –Measured increase in number of duplicates identified Problem Topes Validation Conclusion
16 Extracting spreadsheet test data Cluster spreadsheet columns based on data category –EUSES spreadsheet corpus “database” section –Hierarchical agglomerative clustering –Manual inspection –Result = 1713 columns in 246 clusters (1 cluster per data category) Created 1 tope for each of 32 most common categories –Yielding 32 topes –Covered 70% of clustered columns Problem Topes Validation Conclusion
17 We considered 5 validation strategies Strategy 1: Current spreadsheet practice (accept all inputs) Strategy 2: Current webapp practice (validate with regexp or fixed list, when available; accept all other inputs) –36 regexps + 35 fixed lists, in 7 categories Strategy 3A: Tope rejecting questionable (accept when (str)=1) Strategy 3B: Tope accepting questionable (accept when (str)>0) Strategy 4: Tope warn on questionable (simulate double-check by user when 0< (str)<1) Problem Topes Validation Conclusion
18Measurements Based on 100 random values per category Used F1 to measure accuracy –standard measure of accuracy for classifiers = (precision*recall)/avg(precision,recall) Considered topes with 1, 2, 3, 4, or 5 formats Problem Topes Validation Conclusion
19 Recognizing multiple formats and questionable inputs raises accuracy Condition 4: Hypothetical user has to help on ~ 3% of inputs Condition 1: Recall = 0 (fails to identify any invalid inputs) Problem Topes Validation Conclusion
20 Topes based on spreadsheet data were accurate on web application data. Problem Topes Validation Conclusion Hurricane Katrina Google Base
21 Putting data in a consistent format improves duplicate identification. Randomly extracted values for each of 5 Hurricane Katrina data categories Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used Found approximately 8% more duplicates after transformation Problem Topes Validation Conclusion
22 Topes improve data validation Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Contributions: –Support for ambiguous data categories –Support for transforming values –Platform-independent validation Problem Topes Validation Conclusion
23 Future Work: Sharing topes Repository search mechanisms based on –Relevance to new applications –Quality criteria Integrate with more programming platforms –Microsoft Excel –Microsoft Visual Studio.NET –A simple XML processing API –Univ. Nebraska’s Robofox –IBM’s CoScripter –Your tool or platform? Problem Topes Validation Conclusion
24 Thank You… To Jeff Magee, Betty Cheng, Barbara Ryder, Margaret Burnett, and others at ICSE 2007 for early feedback To NSF for funding To ICSE 2008 for this opportunity to present Problem Topes Validation Conclusion