Expandable Group Identification in Spreadsheets

Slides:



Advertisements
Similar presentations
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Advertisements

Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Commission on Information and Communications Technology Electronic Spreadsheet (Hands-on Module) Learning Technologies Division HUMAN CAPITAL DEVELOPMENT.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Microsoft Excel Computers Week 4.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Calculations, Visualization, and Simulation 6.  2001 Prentice Hall6.2 Chapter Outline The Spreadsheet: Software for Simulation and Speculation Statistical.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
ICT Homework Zak Barwell. Spreadsheets A computer program used chiefly for accounting, in which figures are arranged in the rows and columns of a grid.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
XP Practical PC, 3e Chapter 11 1 Making Spreadsheets and Presentations.
Unit 5 Spreadsheets 5.10 Formatting & Printing. Introduction Now that you have completed the tasks associated with creating spreadsheets, formulas, functions,
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Excel / 1 Electronic Spreadsheets What is an electronic spreadsheet or worksheet ? It is a Computer Software package which allows a user to Manipulate.
Spreadsheets Chapter 7. Objectives Define and describe spreadsheets, and their features and functions Describe how a spreadsheet might be integrated into.
Excel. BCSII-5: The student will utilize spreadsheet software a) Identify uses of spreadsheet software and careers related to spreadsheet. b) Identify.
An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.
CSCI 1101 Intro to Computers 3. Common Productivity Software.
1 Document Production Software Assists you with composing, editing, designing, printing, and electronically publishing documents –Word processing –Desktop.
BeginningQuiz 1. Excel is: A. Part of Microsoft Office B. Application Software C. Spreadsheet Software D. None of the above E. A, B, and C.
1 Performing Spreadsheet What-If Analysis Applications of Spreadsheets.
Definitions. Cell: Cell: Space in the intersection of a column (vertical division) and a row (horizontal division). Row: Row: A row runs horizontally.
Excel. MSBCS-BCSI-9: Students will develop and apply basic spreadsheet skills. a) Identify and explain basic spreadsheet terminology (cell, column, row,
UNIT 7: Using Excel in the Law Office. This Week’s Assignment You should be working on your three-part assignment Part 1 deals with the things you learned.
Spreadsheet Applications Calculations, Visualization, and Simulation.
The introduction of Microsoft Excel. Spreadsheet Basic.
A Spreadsheet for Balance Sheets. Balance Sheets What are balance sheets for??
Microsoft Excel Study Guide Test 6 & 7 7 th grade MSBCS-BCSI-9 Students will develop and apply basic spreadsheet skills MSBCS-BCSII-5 The student will.
Presented by: Ashgan Fararooy Referenced Papers and Related Work on:
Introduction to Excel The Basics of Microsoft Word 2007 Excel.
Introduction to Excel The Basics of Microsoft Excel 2010.
Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Chapter Six Calculation, Visualization, and Simulation.
Chapter Six Calculation, Visualization, and Simulation.
Using linked data to interpret tables Varish Mulwad September 14,
Semi-Automatic Image Annotation Liu Wenyin, Susan Dumais, Yanfeng Sun, HongJiang Zhang, Mary Czerwinski and Brent Field Microsoft Research.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells due to Ambiguous Computation Wensheng Dou 1, Shing-Chi Cheung 2, Jun Wei 1.
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
CACheck: Detecting and Repairing Cell Arrays in Spreadsheets
Migrating CSS to Preprocessors by Introducing Mixins
Automatically Labeled Data Generation for Large Scale Event Extraction
INF1060: SPREADSHEET1.
Superintendence Day April 18,2016
VEnron A Versioned Spreadsheet Corpus and Related Evolution Analysis
CS4311 Spring 2011 Process Improvement Dr
Detecting Table Clones and Smells in Spreadsheets
INTRODUCTION TO SPREADSHEET APPLICATIONS
MS-EXCEL SUMMARY.
Teacher Resource Idea - Paul
Microsoft Access 2003 Illustrated Complete
Machine Learning Ali Ghodsi Department of Statistics
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Project Implementation for ITCS4122
Liang Zheng and Yuzhong Qu
Report on Data Cleaning Framework
Detecting Faulty Empty Cells in Spreadsheets
How Are Spreadsheet Templates Used in Practice: A Case Study on Enron
Do you know what these buttons do?
Ying Dai Faculty of software and information science,
Lesson 1 - Visualizing Dependencies
Do you know what these buttons do?
Computer Science 10 & ICT 9 EXCEL
Spreadsheet Tools Concept Lesson.
Lab 08 Introduction to Spreadsheets MS Excel
Calculations, Visualization, and Simulation
Introduction to Spreadsheets
Presentation transcript:

Expandable Group Identification in Spreadsheets International Conference on Automated Software Engineering (ASE 2018) Expandable Group Identification in Spreadsheets Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, Jun Wei University of Chinese Academy of Sciences Institute of Software Chinese Academy of Sciences Microsoft Research Asia

Spreadsheet example Header: Describe other data Data for 24 hours Statistics … real example extracted from Enron spreadsheet corpus

Expandable group Expandable group: A basic unit and all its repeating units form an expandable group All units have similar formats and semantics Different semantics

Expandable group – Vex group Expandable group: A basic unit and all its repeating units form an expandable group Vex group: The basic unit is vertically expanded

Expandable group – Hex group Hex group: The basic unit is horizontally expanded

Structure information is missing Structure information is not stored with spreadsheets

Structure information is missing Spreadsheet data cannot be consumed by analysis tools What data in it? Data integration Insights in Excel Data mining

How can we use expandable groups? Non-Relational Data Relational Data Analysis-friendly

Automatically identify expandable groups in spreadsheets Our goal ExpCheck Automatically identify expandable groups in spreadsheets

Existing approach Identify expandable groups based on formula patterns[1] [1] R. Abraham, et, al., “Inferring Templates from Spreadsheets”, ICSE 2006.

Existing approach - Issues (1) Formulas may be missing

Existing approach - Issues (2) Same formula patterns don’t mean they are expandable Not expandable

Key observation (1) Corresponding cells have the same format

Key observation (2) Corresponding headers have the similar semantics

How to compare two cells in format? Compare two cells with their type features Cell type Label Data

How to compare two cells in format? Compare two cells with their format features Merge: Whether a cell is merged? Merged Not merged

How to compare two cells in format? Compare two cells with their format features Merge: Whether a cell is merged? Font: Font name, size, color Different color

How to compare two cells in format? Compare two cells with their format features Merge: Whether a cell is merged? Font: Font name, size, color Data format: #.## vs. $ #.## 2.00 $ 1.00

Compare two columns in format Can we require that all corresponding cells use the same format? We require that a majority (60%) of corresponding cells use the same format

Header semantics Headers in spreadsheets represent cells’ semantics Different semantics

How to compare headers? In Word2Vec, two words with similar semantics are usually located closely in the vector space New York – Chicago: similar New York – Fruit: dissimilar We utilize Word2Vec to compute the semantic similarity among two headers Use pre-trained Word2Vec model [1] on Google News dataset [1] GoogleNews-vectors-negative300.bin.gz. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM.

How to detect expandable groups? Basic unit with 1 column Basic unit

How to find expandable groups? Basic unit with 1 column Basic unit

How to find expandable groups? Basic unit with 2 columns Basic unit

How to find expandable groups? Basic unit with 3 columns Basic unit

How to find expandable groups? Basic unit with 4 columns Basic unit

Evaluation RQ1: Can ExpCheck detect expandable groups effectively? RQ2: How is ExpCheck compared with existing techniques?

Build ground truth Experimental subject EUSES [1] VEnron [2] Sample spreadsheets and manually extract all expandable groups Keep one for similar spreadsheets Keep one for similar worksheets Diversity [1] M. Fisher et al., “The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms,” SIGSOFT Softw Eng Notes, 2005. [2] W. Dou et al., “VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis,” ICSE SEIP, 2016

Ground truth Corpus Category Spreadsheet Worksheet Expandable Group EUSES cs101 1 3 database 11 12 27 financial 10 17 forms3 grades 5 homework 13 15 36 inventory 41 modeling VEnron 50 66 165 Total 120 139 313

Is ExpCheck effective? (RQ1) Precision: 69.3% Recall: 77.3% Corpus Category Ground Truth Detected Confirmed EUSES cs101 3 4 database 27 29 23 financial 17 18 6 forms3 1 grades 10 9 homework 36 38 inventory 41 44 35 modeling 13 VEnron 165 187 130 Total 313 349 242

Compare with others (RQ2) Formula pattern based approach [1] [1] R. Abraham et al., “Inferring Templates from Spreadsheets,” ICSE 2006.

Summary http://www.tcse.cn/~wsdou/project/ExpCheck/ Expandable groups are the fundamental structure in spreadsheets. The corresponding cells in an expandable group have the similar format and semantics ExpCheck: automatically identify expandable groups in spreadsheets Result ExpCheck performs much better than existing techniques http://www.tcse.cn/~wsdou/project/ExpCheck/

Thank you!