VEnron A Versioned Spreadsheet Corpus and Related Evolution Analysis

Slides:



Advertisements
Similar presentations
MOSS 2007 Document Management Adam McCarthy 1 st April 2009.
Advertisements

XP New Perspectives on Microsoft Office Excel 2003, Second Edition- Tutorial 7 1 Microsoft Office Excel 2003 Tutorial 7 – Working With Excel’s Editing.
Microsoft Excel 2003 Illustrated Complete Excel Files and Incorporating Web Information Sharing.
1 of 4 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
1 New : Create your own message starting from scratch 2 New From Template: add professionally designed templates provided exclusively by Gorilla Contact.
Chapter 5 Application Software.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
Solutions Summit 2014 Discrepancy Processing & Resolution Terri Sullivan.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SISP Training Documentation Template.
Karen Herter (HMG) Mike Langley (DGS) April 15, 2008 Portfolio Manager for California State Buildings Meeting the Requirements of Executive Order S
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
© Paradigm Publishing Inc. 5-1 Chapter 5 Application Software.
IPortal Bringing your company and your business partners together through customized WEB-based portal software. SanSueB Software Presents iPortal.
Process Refactoring Michael L. Collard, Ph.D.. Real World Often ad hoc with no process Different levels of developers knowledge, experience, and capabilities.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SISP 6.1 Delta Training Documentation.
90% U.S. corporations currently engaged in litigation 147 Average number of active lawsuits for $1B+ companies $1M A verage per case cost of eDiscovery.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells due to Ambiguous Computation Wensheng Dou 1, Shing-Chi Cheung 2, Jun Wei 1.
CIS-NG CASREP Information System Next Generation Shawn Baugh Amy Ramirez Amy Lee Alex Sanin Sam Avanessians.
Orders – Create Responses Boeing Supply Chain Platform (BSCP) Detailed Training July 2016.
CACheck: Detecting and Repairing Cell Arrays in Spreadsheets
Software Application Overview
Mail Merge for Lotus Notes and Excel User Guide
GovDelivery® & Digital Subscription Management:
Using E-Business Suite Attachments
UPS – Tradeshift Supplier Training
Version Control.
Detecting Table Clones and Smells in Spreadsheets
Module 4.2: Calculations.
CSE 303 Concepts and Tools for Software Development
Statistical Analysis with Excel
Good Morning  Place your bags under the dry erase board at the front left side of the room and be seated. Turn around and face the back ready for announcements.
Lesson 9 Sharing Documents
Good Morning  Place your bags under the dry erase board at the front left side of the room and be seated. Turn around and face the back ready for announcements.
Materials Engineering Product Data Management (ePDM)
Introduction to SharePoint
Facilitator: Aiman Saleem
RELATIONAL DATABASE MODEL
Lesson 9 Sharing Documents
CSM System ( Customer Service Management System)
Science Fair Project Due:
What is a Database and Why Use One?
Statistical Analysis with Excel
Overview of Managing Access for a Contract
Automation in an XML Authoring Environment
PPS/OPTRS Departmental Roles Structure System
Statistical Analysis with Excel
Chapter 1 Database Systems
Global Enterprise Search
SDMX: A brief introduction
A Comprehensive Study on Real World Concurrency Bugs in Node.js
Detecting Faulty Empty Cells in Spreadsheets
Health-e Claims July 2007.
Software Metrics “How do we measure the software?”
Sharing of Eurostat predefined tables
Chapter 13 Quality Management
Expandable Group Identification in Spreadsheets
How Are Spreadsheet Templates Used in Practice: A Case Study on Enron
Microsoft Office Excel 2003
Tracking Status Monitor System
Unit# 5: Internet and Worldwide Web
Sharing of Eurostat predefined tables
COMP 208/214/215/216 – Lecture 7 Documenting Design.
Lesson 14 Sharing Documents
Chapter 1 Database Systems
Unit J: Creating a Database
Purchase Document Management
Day 1: Getting Started with Microsoft Excel 2010
Presentation transcript:

VEnron A Versioned Spreadsheet Corpus and Related Evolution Analysis ICSE 2016-Software Engineering in Practice VEnron A Versioned Spreadsheet Corpus and Related Evolution Analysis Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, Tao Huang

A spreadsheet usage scenario Service price in May Enter data Write formula Alice Share Bob

Create a spreadsheet for June Service price in May Update data Modify formulas Bob Service price in June Share Alice

Similar to software development! Retrieve Edit Share Developers Users

Version management is important Software development Developer Developer Software code is well maintained by version control tools.

Version management is missing Spreadsheet development User User However, spreadsheets are rarely maintained by version control tools, like SVN and Git.

Do spreadsheet versions matter? Average lifetime Average users on a spreadsheet 1 F. Hermans, M. Pinzger, and A. van Deursen, ICSE’11

Do spreadsheet versions matter? Find refactoring opportunities 31 days in May 31 & 30 can be changed to a cell A1, which has the number of days 30 days in June … real spreadsheets from VEnron

Do spreadsheet versions matter? Find inconsistency Similar Inconsistency The formula is missing … real spreadsheets from VEnron

Facilitate future scientific studies on spreadsheet evolution Our goal VEnron Build an industrial-scale and publicly available spreadsheet evolution corpus Facilitate future scientific studies on spreadsheet evolution

Why VEnron? Independent spreadsheets! No version information! Large spreadsheet corpora so far EUSES1 ~4,500 spreadsheets Obtained by searching on Google FUSE2 ~250,000 spreadsheets Extracted from ~27 billion web pages Enron3 ~15,000 spreadsheets Extracted from an email archive in the Enron corporation Independent spreadsheets! No version information! 1 M. Fisher and G. Rothermel, WEUSE’05 2 T. Barik, K. Lubick, J. Smith, J. Slankas, and E. Murphy-Hill, MSR’15 3 F. Hermans, M. Pinzger, and A. van Deursen, ICSE’15

Why VEnron? The change history of spreadsheets is rarely documented 90+ million end-user programmers with no comment tracking or version control1 Few companies may use version management tools (e.g., SharePoint, SpreadGit, Github) for spreadsheets, they are unwilling to share them due to business confidentiality 1 P. Durusau and S. Hunting, The Markup Conference’15

Version information is not available in spreadsheets Version information is not available in spreadsheets! Why can we build VEnron?

How are spreadsheets exchanged? Spreadsheet development User User Emails are a common way to exchange spreadsheets

Emails provide version information! Emails can (partially) provide the history of spreadsheets Sender Time Receivers Previous version Spreadsheet

How can we build VEnron?

Why choose the Enron email archive? We choose the Enron email archive as our subject1 Publicly available Real emails and spreadsheets in the Enron corporation Large ~750,000 emails ~15,000 unique spreadsheets 1 http://info.nuix.com/Enron.html

Approach overview evolution group Enron emails Cluster spreadsheets evolution group A group of spreadsheets that are originated from the same spreadsheet Recover version orders VEnron

How to cluster spreadsheets? Manually check ~15,000 spreadsheets and their related emails? Unclear semantics in these spreadsheets Unknown social network among users Mission Impossible!

A viable way to cluster spreadsheets Cluster spreadsheets into different groups according to spreadsheet filename similarity Spreadsheets in an evolution group often have the same shortened filename by deleting version-related substrings. Version id Spreadsheet filename v1 May00_FOM_Req2.xls v2 Jun00_FOM_Req.xls v3 July00_FOM_Req.xls v4 July00_FOM_Req02.xls v5 Aug00_FOM_Req.xls ID for versions Date for versions Shortened names

A viable way to cluster spreadsheets The clustering approach may misjudge situations, and wrongly cluster spreadsheets Manually validated each group, and adjusted each group accordingly Key idea: check whether all spreadsheets in a group share similar table structures Similarity on worksheet names? Similarity on the structure of corresponding worksheets?

Recover version orders (1) Manually recover version orders by using spreadsheet-related version information Timestamps indicate version orders Spreadsheet filenames, worksheet names, spreadsheet tables Version id Spreadsheet filename Worksheet names v1 May00_FOM_Req2.xls FOM May Storage v2 Jun00_FOM_Req.xls FOM Jun Storage v3 July00_FOM_Req.xls FOM Jul Storage v4 July00_FOM_Req02.xls v5 Aug00_FOM_Req.xls FOM Aug Storage

Recover version orders (2) The email sending history S S’ Alice Bob Spreadsheets S and S’ belong to the same evolution group

Recover version orders (3) Email contents may clearly describe where the current spreadsheet come from “The attached file is an update of CEF FOM June'00 request that was transmitted on 5/23/00” Previous version

Recover version orders (4) Timestamps v1 v2 v3 v4 Email sending history Total order Email contents

How about VEnron?

Statistics of evolution groups 7,294 spreadsheets and 35,373 worksheets in total 83% groups have more than 5 versions #group #spreadsheet

Statistics - More 78% groups are maintained by more than 1 user #group

Find more statistics in the paper… Statistics - More Count the number of Excel built-in errors E.g., #Div/0!, #N/A!, etc. 16.9% groups introduce new errors during evolution #error #version id Find more statistics in the paper…

Takeaway Have a try! http://sccpu2.cse.ust.hk/venron/ VEnron: The first industrial-scale and publicly available spreadsheet evolution corpus 360 evolution groups, including 7,294 spreadsheets and 35,373 worksheets Initial statistics (evidence) have shown that VEnron contains interesting evolution, and could be a basis for future studies on spreadsheet evolution Have a try! http://sccpu2.cse.ust.hk/venron/

Thank you!