1 Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers,

Slides:



Advertisements
Similar presentations
Sensitivity Analysis A systematic way of asking “what-if” scenario questions in order to understand what outcomes could possibly occur that would effect.
Advertisements

Changing the Appearance of a Spreadsheet Excel Lesson 2.
Lesson 12 Getting Started with Excel Essentials
© Paradigm Publishing, Inc Excel 2013 Level 2 Unit 1Advanced Formatting, Formulas, and Data Management Chapter 1Advanced Formatting Techniques.
Using Basic FormulasUsing Basic Formulas Lesson 4 © 2014, John Wiley & Sons, Inc.Microsoft Official Academic Course, Microsoft Word Microsoft Excel.
XP New Perspectives on Microsoft Office Excel 2003 Tutorial 1 1 Microsoft Office Excel 2003 Tutorial 1 – Using Excel To Manage Data.
Understanding Microsoft Excel
Collin College Excel Exam Review. True In Excel worksheets, rows are designated using numbers while columns are designated using letters.
XP Microsoft Excel Lecture -5- By lec. (Eng.) Hind Basil University of technology Department of Materials Engineering.
Using Complex Formulas, Functions, and Tables. Objectives Navigate a workbookNavigate a workbook Enter labels and valuesEnter labels and values Change.
Pasewark & Pasewark 1 Word Lesson 7 Working with Documents Microsoft Office 2007: Introductory.
Word Lesson 7 Working with Documents
Georgina Cantu EDTC Instructional Technology Practicum.
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
Chapter 8 Search and Sort Asserting Java ©Rick Mercer.
XP 1 ﴀ New Perspectives on Microsoft Office 2003, Premium Edition Excel Tutorial 1 Microsoft Office Excel 2003 Tutorial 1 – Using Excel To Manage Data.
Unit G: Using Complex Formulas, Functions, and Tables Microsoft Office Illustrated Fundamentals.
Random Sampling  In the real world, most R.V.’s for practical applications are continuous, and have no generalized formula for f X (x) and F X (x). 
Sensitivity Analysis A systematic way of asking “what-if” scenario questions in order to understand what outcomes could possibly occur that would affect.
A table is an arrangement of data (words and numbers) in rows and columns. Tables range in complexity from those with only two columns and a title to.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 15 Working with Tables 1 Morrison / Wells / Ruffolo.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
11 Exploring Microsoft Office Excel 2007 Chapter 4: Working with Large Worksheets and Tables Chapter 04 - Lecture Notes (CSIT 104)
Key Applications Module Lesson 16 — Excel Essentials Computer Literacy BASICS.
 Starting Excel 2003  Using Help  Workbook Management  Cursor Management  Manipulating Data  Using Formulae and Functions  Formatting Spreadsheet.
Microsoft Access Get a green book. Page AC 2 Define Access Define database.
Return To Index Excel Microsoft Excel Basics Lesson 10 Adding Information To Worksheets Adding Information To Worksheets - 2 Format Cells (Cell.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. 3-1 Office Excel 2007 Lab 3 Managing and Analyzing a Workbook.
Lesson 8 — Spreadsheets Unit 2 — Software. Lesson 8 – Spreadsheets 2 Objectives Understand the purpose and function of a spreadsheet. Identify the major.
Microsoft Access 2000 Presentation 2 Creating Databases Part I (Creating Tables)
Introduction to Microsoft Excel: Exploring Microsoft Excel.
10/3: Using Microsoft Excel
Information Processing Notes for beginning our Excel Unit.
Formatting WorksheetsFormatting Worksheets Lesson 7.
Chapter 6 Creating, Sorting, and Querying a Table
Chapter 8 Search and Sort ©Rick Mercer. Outline Understand how binary search finds elements more quickly than sequential search Sort array elements Implement.
Lesson 1 – Microsoft Excel * The goal of this lesson is for students to successfully explore and describe the Excel window and to create a new worksheet.
1 Lesson 12 Getting Started with Excel Essentials Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Chapter 3: Referencing and Names Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
Building Cryptosystems Massachusetts Institute of Technology Dan Sturtevant,
Spreadsheets: Part I Creating a Worksheet in MS Excel
Lesson 6 Formatting Cells and Ranges. Objectives:  Insert and delete cells  Manually format cell contents  Copy cell formatting with the Format Painter.
Sensitivity Analysis A systematic way of asking “what-if” scenario questions in order to understand what outcomes could possibly occur that would effect.
Lesson 12 Spreadsheets Unit 2—Using the Computer.
Key Applications Module Lesson 14 — Working with Tables Computer Literacy BASICS.
Chapter 13 Spreadsheets and Business Graphics: Facts and Figures.
Excel Introduction to computers. Excel 2007 Starting the Excel program.
Introduction to Excel RETC – Center for Professional Development.
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Understanding Microsoft Excel
Understanding Microsoft Excel
Chapter 6 Modifying Cell Styles
AP CSP: Cleaning Data & Creating Summary Tables
Microsoft Excel.
University of technology Department of Materials Engineering
Understanding Microsoft Excel
Lesson 17 Getting Started with Excel Essentials
Lesson 18 Getting Started with Excel Essentials
Tutorial 3 – Creating a Multiple-Page Report
Building a Budget In Excel
Spreadsheet/Worksheet
Lesson 15 Working with Tables
POWERPOINT (PPT) KEY Elements: Know these features
Understanding Microsoft Excel
Key Applications Module Lesson 16 — Excel Essentials
Unit G: Using Complex Formulas, Functions, and Tables
Key Applications Module Lesson 14 — Working with Tables
How to Use Microsoft Excel for Data Entry
Presentation transcript:

1 Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher Scaffidi School of Electrical Engineering and Computer Science Oregon State University

2 Spreadsheets as a hub for work Collecting, organizing, analyzing, and visualizing data Frequently shared among people in the organization –Who then edit the spreadsheets And then share the new versions –To other people who then reuse and edit them…  Proliferation of spreadsheets –People choose among which spreadsheets to reuse –Auditors may need to determine who made changes to which cells (that contain errors) Background  Algorithm  Evaluation  Conclusions

3 Should I reuse Spreadsheet A or B? Spreadsheet X Spreadsheet A Spreadsheet B Edits by Bob Edits by Alice Background  Algorithm  Evaluation  Conclusions

4 Existing features for understanding spreadsheet differences TellTable, as well as Excel change tracking –Show differences between X and direct descendant A –We need to compare A vs B DiffEngineX, Synkronizer, Suntrap, SheetDiff –Direct comparison of any A vs any B –Somewhat inaccurate at recovering intervening edits (errors on 2-12% at cell level, even higher on row/column, for 8 real spreadsheet pairs from the EUSES corpus) Background  Algorithm  Evaluation  Conclusions

5 Example of an error (Synkronizer) Actual edits: insert B’s second column (“c”, “g”, …), insert B’s second row (“d”, “d”, “d”), change B’s A3 from “d” to “e” Note and apologies: This figure is referenced but missing in the printed proceedings. (It’s my fault: accidentally deleted it during final round of edits.) Background  Algorithm  Evaluation  Conclusions

6 Outline of this talk Background Algorithm Evaluation Conclusions Background  Algorithm  Evaluation  Conclusions

7 New algorithm concept Find a “target alignment” of cells that are nearly identical –i.e., Find what A and B have in common All remaining differences are attributable to edits –Specifically, row/column insertions in A or B or cell-level edits within the target alignment cells Background  Algorithm  Evaluation  Conclusions

8 Target alignment concept An alignment with only 1 cell-level edit out of 14 cells Background  Algorithm  Evaluation  Conclusions

9 Starting point for a specific algorithm: LCS in 1D fcadbaefcadbaed Background  Algorithm  Evaluation  Conclusions

10 Let’s think in terms of aligning rows (put off thinking about columns for a moment) Background  Algorithm  Evaluation  Conclusions

11 Insight: Match up rows based on the length of their LCS (1D) dfdcbafdabaaee dcfegcbaafadafbagaegeddd A good alignment ∑ equals 12 Background  Algorithm  Evaluation  Conclusions

12 Insight: Match up rows based on the length of their LCS (1D) dfdcbafdabaaee dcfegcbaafadafbagaegeddd A better alignment (maximal, actually) ∑ equals 13 Background  Algorithm  Evaluation  Conclusions

13 Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits A  B Background  Algorithm  Evaluation  Conclusions

14 Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits A  B 1.Use dynamic programming to choose which rows to include in the target alignment –Argmax ∑LCS1D(rows retained in A, rows retained in B), where the ∑ is over rows. (Use dynamic programming.) Background  Algorithm  Evaluation  Conclusions

15 Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits A  B 1.Use dynamic programming to choose which rows to include in the target alignment 2.Do the same with A and B to choose columns –Argmax ∑LCS1D(cols retained in A, cols retained in B), where the ∑ is over columns Background  Algorithm  Evaluation  Conclusions

16 Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits A  B 1.Use dynamic programming to choose which rows to include in the target alignment 2.Do the same with A and B to choose columns 3.For each row or column not chosen for target alignment –If it’s in B (i.e., not A), then represent as an insert –Else (it’s in A, not B), represent as a delete Background  Algorithm  Evaluation  Conclusions

17 Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits A  B 1.Use dynamic programming to choose which rows to include in the target alignment 2.Do the same with A and B to choose columns 3.For each row or column not chosen for target alignment 4.For each aligned row or column –If it has virtually no differences between A and B, then represent any remaining differences as cell-level edits –Else, represent the entire row/column as a delete+insert Background  Algorithm  Evaluation  Conclusions

18 Three investigations we conducted to evaluate RowColAlign Tested on 10 manually-created spreadsheet pairs previously used to test an older algorithm (SheetDiff) –Won’t discuss today (due to time) – see paper –Bottom line: RowColAlign made no errors Tested on >500 automatically-generated cases –Discussed below –Bottom line: RowColAlign made no errors Formally analyzed expected behavior of RowColAlign –Summarized below –Bottom line: RowColAlign will rarely if ever make errors in practice; runtime is O(spreadsheet area 2 ) Background  Algorithm  Evaluation  Conclusions

19 Evaluation based on planted model Planted model = generative model Automatically generates test cases –For which we know the correct answer Very interesting technique to try because this way of thinking about evaluation might be useful for evaluating other algorithms that this community creates Background  Algorithm  Evaluation  Conclusions

20 Planted model / generating test cases 1.Create a blank spreadsheet O of size n x n 2.Randomly fill O with letters from alphabet of size s 3.Copy O twice to create A and B 4.For each row and each column in A and in B With probability p, delete that row or column 5.For each cell in B With probability q, replace with new random letter Background  Algorithm  Evaluation  Conclusions

21 Parameter values based on 8 real spreadsheet pairs from prior work ParameterReal range observedRange tested Spreadsheet area90 to 3212 cells (equiv. n= ) n=10 to 50 Alphabet size (s)50 to to 450 Row & col insertion rate (p) to to 0.41 Cell-level edit rate (q) to to For each parameter setting, we generated 25 test cases. Background  Algorithm  Evaluation  Conclusions

22 Result: RowColAlign made no errors ParameterReal range observedRange tested Spreadsheet area90 to 3212 cells (equiv. n= ) n=10 to 50 Alphabet size (s)50 to to 450 Row & col insertion rate (p) to to 0.41 Cell-level edit rate (q) to to For comparison: The existing SheetDiff algorithm made errors at a rate of up to 28% as p and q increased. Background  Algorithm  Evaluation  Conclusions

23 Pushing the algorithm further: Huge spreadsheets with many edits ParameterFor comparisonRange tested Top quartile of all EUSES corpus spreadsheets Width and height (n)961 cells (n=31)10000 cells (n=100) 8 pairs from prior work Alphabet size (s)50 to to 1000 Row & col insertion rate (p) to Cell-level edit rate (q) to Background  Algorithm  Evaluation  Conclusions

24 Results: Still no errors ParameterFor comparisonRange tested Top quartile of all EUSES corpus spreadsheets Width and height (n)961 cells (n=31)n=100 8 pairs from prior work Alphabet size (s)50 to to 1000 Row & col insertion rate (p) to Cell-level edit rate (q) to Background  Algorithm  Evaluation  Conclusions

25 In brief: Why? Incorrect alignment would be caused by a chance when rows happen to be similar. Which is less and less likely when… -The alphabet is large -Because the probability that two cells have the same value by chance is ~ 1/s -The spreadsheet is large -Because the probability that n cells have matching values by chance is ~ (1/s) n Background  Algorithm  Evaluation  Conclusions

26Conclusions The subsequence of rows and columns that two spreadsheets have in common can be computed using a dynamic programming algorithm The error rate of such an algorithm can be evaluated using a planted model Our specific dynamic programming algorithm –Is unlikely to make errors when recovering edits Except on spreadsheets that are small or have small alphabets Background  Algorithm  Evaluation  Conclusions

27 Future research opportunities Develop tools based on this algorithm –To help people understand and manage versions –To choose among multiple versions Develop enhanced algorithms –For simultaneous diff of more than 2 spreadsheets –For clustering collections of spreadsheets based on similarity Background  Algorithm  Evaluation  Conclusions

28 Thank you For this opportunity to present For funding from Google and NSF For your questions and ideas Background  Algorithm  Evaluation  Conclusions