CACheck: Detecting and Repairing Cell Arrays in Spreadsheets

Slides:



Advertisements
Similar presentations
L3S Research Center University of Hanover Germany
Advertisements

ADMINISTRATION Information Technology for Administrators SPREADSHEETS Click To Continue.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Validating Excel-based Spreadsheets Robert Ladyman File-Away Limited.
Spreadsheet in excel o Spreadsheet in excel o Uses of spreadsheet o Advantages Prepared by: Yusra Waseem 8 th C.
) Linked2Safety Project (FP7-ICT – 5.3 ) A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR SEMANTICALLY-INTERCONNECTING ELECTRONIC.
Verification and Validation Overview References: Shach, Object Oriented and Classical Software Engineering Pressman, Software Engineering: a Practitioner’s.
Exploring Engineering Chapter 3, Part 2 Introduction to Spreadsheets.
Dimensions in Synthesis Part 3: Ambiguity (Synthesis from Examples & Keywords) Sumit Gulwani Microsoft Research, Redmond May 2012.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Using Social Network Analysis Methods for the Prediction of Faulty Components Gholamreza Safi.
Excel and Data Analysis. Excel can be a powerful tool for analysis Excel provides many tools for analyzing data –Filtering –Sorting –Formulas –Charts.
Adapted from Auditing User-Developed Applications (UDA) End User Computing (EUC) Global Technology Audit Guide GTAG® 14.
Automatic Video Editing Stanislav Sumec. Motivation  Multiple source video data – several cameras in the meeting room, several meeting rooms in teleconference,
Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells due to Ambiguous Computation Wensheng Dou 1, Shing-Chi Cheung 2, Jun Wei 1.
Verification vs. Validation Verification: "Are we building the product right?" The software should conform to its specification.The software should conform.
A User-Guided Approach to Program Analysis Ravi Mangal, Xin Zhang, Mayur Naik Georgia Tech Aditya Nori Microsoft Research.
To play, start slide show and click on circle Access 1 Access 2 Access 3 Access 4 Access Access
EET 1131 Unit 5 Boolean Algebra and Reduction Techniques
Migrating CSS to Preprocessors by Introducing Mixins
Excel IF Function.
Microsoft Access 2016 Design and Implement Powerful Relational Databases Chapter 6.
Learning Usage of English KWICly with WebLEAP/DSR
John D. McGregor Session 9 Testing Vocabulary
VEnron A Versioned Spreadsheet Corpus and Related Evolution Analysis
CSC 480 Software Engineering
DATABASE CONCEPTS A database is a collection of logically related data designed to meet the information needs of one or more users Data bases are store-houses.
Verification and Validation
Analysis Using Spreadsheets
Detecting Table Clones and Smells in Spreadsheets
Solver & Optimization Problems
Using Execution Feedback in Test Case Generation
Verification and Validation Overview
Susmit Jha1, Vasumathi Raman, Sanjit A. Seshia2 1SRI International
John D. McGregor Session 9 Testing Vocabulary
Algorithm and Ambiguity
Design and Implement Powerful Relational Databases Chapter 6
Verification and Validation
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Authors: Khaled Abdelsalam Mohamed Amr Kamel
The Extensible Tool-chain for Evaluation of Architectural Models
John D. McGregor Session 9 Testing Vocabulary
Lecture Software Process Definition and Management Chapter 3: Descriptive Process Models Dr. Jürgen Münch Fall
Lecture 12: Data Wrangling
Model-View-Controller Patterns and Frameworks
Sergiy Vilkomir January 20, 2012
A Comprehensive Study on Real World Concurrency Bugs in Node.js
Detecting Faulty Empty Cells in Spreadsheets
Expandable Group Identification in Spreadsheets
How Are Spreadsheet Templates Used in Practice: A Case Study on Enron
Database solutions Chosen aspects of the relational model Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology.
Computer Vision Chapter 4
CS150 Introduction to Computer Science 1
Baisc Of Software Testing
Control Structure Testing
Spreadsheets, Modelling & Databases
The ultimate in data organization
A Framework for Testing Query Transformation Rules
Test-Driven Ontology Development in Protégé
Ideas for testing Transformations of cds 4/27/2019 AOO/Demeter.
Precise Condition Synthesis for Program Repair
Enabling Prediction of Performance
Chapter 4 System Modeling.
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Jiasi Shen, Martin Rinard MIT EECS & CSAIL
Presentation transcript:

CACheck: Detecting and Repairing Cell Arrays in Spreadsheets Wensheng Dou 2016-05-08

Spreadsheets are widely used Spreadsheet is one of the most wildly-used End-user development tool nowadays Microsoft Excel, WPS, Google Excel Used for data storage, decision support, financial reporting, quality control... vs.

Spreadsheet errors matter! KPMG and Coopers & Lybrand have reported finding errors in more than 90% spreadsheets. The European Spreadsheet Risks Group lists 72 publicly reported errors of up to $1billion due to inadequate spreadsheets and/or spreadsheet controls. [1] EuSpRIG Horror Stories, http://www.eusprig.org/horror-stories.htm

Motivating example The spreadsheet contains incorrect formulas Update on the incorrect formulas could cause faulty values in the spreadsheet Should be 18 4→ 6 4→ 6 … a real example extracted from EUSES spreadsheet corpus

Problems Screen shot of the spreadsheet before and after the change No warning is issued by Excel Q1: Which cells contain incorrect formulas? Q2: Which cells’ values are incorrect?

Key challenge - No oracle! It is hard to identify which cells contain incorrect formulas or values Require human judgments or specifications

Total Fruit = Apple + Orange Methodology Total Fruit = Apple + Orange Total Price = Total Fruit * Price Cell array Cells are often grouped in a row or column with the same intended computation We call this kind of group as a cell array

Methodology The intended computation is ambiguous when not all the cells in a cell array follow the same formula pattern The cell array suffers from ambiguous computation smells

CACheck overview Cell Array Identification Formula Pattern Recovery Annotated spreadsheets Smells Errors Cell Array Identification Formula Pattern Recovery Spreadsheets Cell Array Filtering Statically analyze ambiguous computation smells

CACheck overview Cell Array Identification Formula Pattern Recovery Annotated spreadsheets Smells Errors Cell Array Identification Formula Pattern Recovery Spreadsheets Cell Array Filtering Statically analyze ambiguous computation smells

How to identify cell arrays? No records about cell arrays in spreadsheets What is the boundary of such a cell array

Cells reference their input cells in the similar way Cell array Data cells could reference any other cells. At least there is a formula in a cell array.

Cells reference their input cells in the similar way Cell array Data cells could reference any other cells. At least there is a formula in a cell array.

Cells reference their input cells in the similar way Data cells could reference any other cells. At least there is a formula in a cell array.

Possible false positives Our relaxed cell array detection approach could possibly introduce false positives.

CACheck overview Cell Array Identification Formula Pattern Recovery Annotated spreadsheets Smells Errors Cell Array Identification Formula Pattern Recovery Spreadsheets Cell Array Filtering Statically analyze ambiguous computation smells

How to get the intended computation?

Finding candidates from existing formulas = Di*Ei

= Di*Ei Gaining confidence 5 4 20 = D6*E6 Q: Is it likely the intended computation? A: Yes if it computes the values of the majority of cells 20 = D6*E6 5 4

Conformance error detection = Di*Ei 12 ≠ D7*E7 Likely an error Assumption: The values of cells are more likely correct than not 6 3

What if we find multiple formula patterns? = SUM(X2:X5), when X6, X7 = 0 = X5+X6+X7, when X2, X3, X4 = 0 Here, X = {B, C}

Constraints for intended formula pattern Existing formula patterns = SUM(X2:X5), when X6, X7 = 0 = X5+X6+X7, when X2, X3, X4 = 0 SUM, + The values in the cell array Likely specifications Likely computational components Input cells Likely input-output pairs Output cells

Synthesizing intended formula pattern Adapt component-based program synthesis [1][2] to find the intended formula pattern Basic idea – Compose given computational components, and generate a program that satisfy specifications and input-output pairs. E.g., SUM and +, For our example, we can generate SUM(X2:X5) + X6 + X7. c1: ret = SUM(X2:X5) c2: ret = SUM(X2:X5)+X6 c3: ret = X2+X3+X4 c4: ret = SUM(X2:X5)+X6+X7 cn: ret = …… [1] S. Jha, S. Gulwani, S.A. Seshia, and A. Tiwari. Oracle-guided component-based program synthesis. In ACM/IEEE 32nd International Conference on Software Engineering (ICSE), pages 215–224. 2010. [2] S. Gulwani, S. Jha, A. Tiwari, and R. Venkatesan, Synthesis of loop-free programs. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 62–73. 2011.

CACheck overview Cell Array Identification Formula Pattern Recovery Annotated spreadsheets Smells Errors Cell Array Identification Formula Pattern Recovery Spreadsheets Cell Array Filtering Statically analyze ambiguous computation smells

Filter out FPs Our relaxed cell array detection approach could report many false positives. Select a subset of all detected cell arrays, having: More true positives Less false positives 3 FPs

Filter out FPs ----- Rule 1 Cell arrays rarely overlap Empirical study on EUSES [1] and Enron [2] shows that only 0.6% cell arrays overlap. Only select one of them Rule 1: If two cell arrays overlap, only one could be true. [1] M. Fisher and G. Rothermel, “The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms,” SIGSOFT Softw Eng Notes, vol. 30, no. 4, pp. 1–5, May 2005. [2] F. Hermans and E. Murphy-Hill, “Enron’s Spreadsheets and Related Emails: A Dataset and Analysis.” ICSE 2015.

Filter out FPs ----- Rule 2 In order not to mistakenly miss true cell arrays, the set of selected cell arrays should also be maximized. Select all of them Rule 2: The set of selected cell arrays are maximized.

Filter out FPs ----- Rule 3 Because a FP’s contained cells are put together in an unreasonable way, cells in it cannot easily be covered by the formula pattern, and causes wrong data. No error Rule 3: The set of selected cell arrays have minimal errors. 2 errors 3

CACheck implementation Annotate the smells in the resulted spreadsheets

Evaluation Experimental subject: EUSES [1] RQ1: How common are ambiguous computation smells in real-life spreadsheets? RQ2: Can CACheck detect ambiguous computation smells precisely? RQ3: Are ambiguous computation smells harmful? Experimental subject: EUSES [1] [1] M. Fisher and G. Rothermel, “The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms,” SIGSOFT Softw Eng Notes, vol. 30, no. 4, pp. 1–5, May 2005.

How common? (RQ1) 15.5% of cell arrays suffer from ambiguous computation smells Category Cell array (CA) Smelly cell arrays (SCA) SCA / CA cs101 39 12 30.8% database 3,271 448 13.7% filby n.a. financial 7,008 1,259 18.0% forms3 150 16 10.7% grades 2,955 666 22.5% homework 2,702 343 12.7% inventory 3,903 517 13.2% jackson modeling 2,018 182 9.0% personal 131 0.0% Total 22,177 3,443 15.5%

Is CACheck precise? (RQ2) Coverage gives the percentage of cells that can be computed by the intended formula pattern For coverage threshold of 70%, experimental precision is 86.8% Coverage SCA TP TP/SCA 100% 1,184 1,092 92.23% [90%, 100%) 152 112 73.68% [80%, 90%) 164 117 71.34% [70%, 80%) 97 65 67.01% [60%, 70%) 406 74 18.23% [50%, 60%) 1,042 76 7.29% [0%, 50%) 398 50 12.56% Total 3,443 1,586 46.06%

Are ambiguous smells harmful? (RQ3) CACheck detects 5,553 cells with wrong data. 1,458 cells were confirmed. Coverage Detected wrong cells Confirmed wrong cells 100% [90%, 100%) 226 131 [80%, 90%) 398 323 [70%, 80%) 262 201 [60%, 70%) 749 294 [50%, 60%) 1,496 218 [0%, 50%) 2,422 291 Total 5,553 1,458

Summary Evaluate on EUSES Ad-hoc modification introduces computation smells The cells in a cell array have the same computational semantics Ambiguous computation smell detection and repairing Evaluate on EUSES Ambiguous computation smells are common and harmful Evaluation

Thank you!