Lecture 12: Data Wrangling

Slides:



Advertisements
Similar presentations
Computer Concepts BASICS 4th Edition
Advertisements

Templates and Styles Excel Advanced. Templates are pre- designed and formatted spreadsheets –They provide consistency of layout/structure –They.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Introduction to SPSS Allen Risley Academic Technology Services, CSUSM
Chapter 5 Creating, Sorting, and Querying a Table
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
XP New Perspectives on Microsoft Office Excel 2003, Second Edition- Tutorial 5 1 Microsoft Office Excel 2003 Tutorial 5 – Working With Excel Lists.
Mgt 240 Lecture MS Excel and Access: Introduction to Databases September 23, 2004.
Introduction to SPSS Descriptive Statistics. Introduction to SPSS Statistics Program for the Social Sciences (SPSS) Commonly used statistical software.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
COMPREHENSIVE Excel Tutorial 8 Developing an Excel Application.
XP 1 Microsoft Office Excel 2003 Tutorial 3 – Working With Excel Lists.
Chapter 4-1. Chapter 4-2 Database Management Systems Overview  Not a database  Separate software system Functions  Enables users to utilize database.
Data processing in MathCAD. Data in tables Tables are analogous to matrices Tables are analogous to matrices The numbers of columns and rows can be dynamically.
Google Training By: Amy Shannon and Dave Auwerda.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
© 2002 ComputerPREP, Inc. All rights reserved. Excel 2000: Database Management and Analysis.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Using Advanced Formatting and Analysis Tools. 2 Working with Grouped Worksheets: Grouping Worksheets  Data is entered simultaneously on all worksheets.
Working with Reports in Microsoft Excel Session Version 1.0 © 2011 Aptech Limited.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Querying Structured Text in an XML Database By Xuemei Luo.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Course ILT Forms and queries Unit objectives Create forms by using AutoForm and the Form Wizard, and add or modify form headers and footers Open and enter.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
XP. Objectives Sort data and filter data Summarize an Excel table Insert subtotals into a range of data Outline buttons to show or hide details Create.
Chapter 6 Creating, Sorting, and Querying a Table
Introduction to a Database Defining a database Database window in Access The six items in window: Tables, Queries Forms, Reports, Macros, Modules.
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
XP New Perspectives on Microsoft Access 2002 Tutorial 31 Microsoft Access 2002 Tutorial 3 – Querying a Database.
Extracting Information from an Excel List The purpose of creating a database, or list in Excel, is to be able to manipulate the data elements in ways that.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
DAY 4,5,6: EXCEL CHAPTERS 1 & 2 Rohit January 27 th to February 1 st
Microsoft Access By Ritesh Sharma. Introduction Microsoft Access is a desktop database program that enables you to enter, store, analyze,and present data.For.
AdisInsight User Guide July 2015
CS239-Lecture 14 Data Curation
Excel Tutorial 8 Developing an Excel Application
Microsoft Office Access 2010 Lab 3
Creating Oracle Business Intelligence Interactive Dashboards
MS Access Forms, Queries, Reports Matt Martin
Exploring Excel Chapter 5 List and Data Management: Converting Data to
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
 2012 Pearson Education, Inc. All rights reserved.
Analyzing Table Data.
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
SUB-ICT STD-10 Working with Data in Microsoft Excel
Improvements to Search
Wrangler: Interactive Visual Specification of Data Transformation Scripts Presented by Tifany Yung October 5, 2015.
Microsoft Office Access 2003
Tutorial 3 – Querying a Database
Navya Thum February 13, 2013 Day 7: MICROSOFT EXCEL Navya Thum February 13, 2013.
Managing Rosters Screener Training Module Module 5
Microsoft Office Access 2003
Data Integration for Relational Web
Agenda About Excel/Calc Spreadsheets Key Features
Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.
Eviews Tutorial for Labor Economics Lei Lei
Dr. Clincy Professor of CS
Probabilistic Databases
Introduction to Access
Exploring Microsoft® Office 2016 Series Editor Mary Anne Poatsy
Objectives In this lesson, you will learn to:
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms
Microsoft Excel 2007 – Level 2
Microsoft Office Illustrated Fundamentals
Presentation transcript:

Lecture 12: Data Wrangling

Announcements Discussion Leaders Project meetings Meeting tomorrow: https://doodle.com/poll/c2p4vha97tqsi8e6 If tomorrow does not work email me with your availability

Today’s Agenda Interactive Data Cleaning Data Transformations Suggesting Transformations

1. Interactive Data Cleaning Section 1 1. Interactive Data Cleaning

Before analysis, data must be brought into a usable form. Section 1 Wrangling with Data Before analysis, data must be brought into a usable form.

Section 1 Wrangling with Data Data wrangling: restructure data, identifying and correcting erroneous/missing values, find outliers.

Section 1 Example Scenario

Section 1 Example Scenario

Section 1 Example Scenario

Section 1 Example Scenario

Section 1 Challenges Transforms for restructuring/validating data can be complicated to specify in scripts. e.g. Regular expressions to split strings, validate data format. Reuse and revision of transforms for data updates and changing schemas not possible with scripts or manual editing. Data wrangling should also output a record of transforms used.

Section 1 Proposed Solution Provide a transformation language with a handful operators A mixed-initiative method: instead of mapping an interaction to a single transform, we surface likely transforms as an ordered list of suggestions. Then use visual interfaces for users to navigate—prune, refine, and evaluate— these suggestions to find a desired transform.

Section 2 2. Data Transformations

Transformations over relational data Section 2 Transformations over relational data

Additional transformations Section 2 Additional transformations Map: transforms map one input data row to zero or multiple output rows. one to zero transform: delete. one to one transform: extracting, cutting, splitting one to many transforms: splitting data into multiple rows Lookups and joins: incorporate data from external tables. Two types of join: equi-joins and approximate joins with string-edit distance Reshape: transforms manipulate table structure and schema. Two operators: fold and unfold. Positional transforms: includes functions for fill and lag operations. Fill operations generate values based on neighboring values in a row or column and so depend on the sort order of the table. The lag operator shifts the values of a column up or down by a specified number of rows.

User-friendly transformations Section 2 User-friendly transformations The goal is to enable analysts to author expressive transformations with minimal difficulty. Approach: Provide more than a spreadsheet interface. Suggest data transforms: use natural language descriptions and visual transform preview. Verification on sample to help users discover data quality issues.

User-friendly transformations Section 2 User-friendly transformations Direct manipulation of and interaction with the data. Menu-based transform selection. Manual editing of transform parameters. Six basic interactions between the user and the data table. Select row. Select column. Click bars in the data quality meter. Select text in a cell. Edit a value in the table. Assign column name, data type, or semantic role.

3. Suggesting Transformation Section 3 3. Suggesting Transformation

Section 3 An inference engine Inputs to the engine consist of user interactions; the current working transform; data descriptions such as column data types, semantic roles, and summary statistics; and a corpus of historical usage statistics. Transform suggestions proceed in three phases: Inferring transform parameters from user interaction Generating candidate transforms from inferred parameters Ranking the results To generate transformations, rely on a corpus of usage statistics

Usage corpus and transform equivalence Section 3 Usage corpus and transform equivalence The corpus consists of frequency counts of transform descriptors and initiating interactions. Transforms are considered equivalent in this way: they have an identical transform type they have equivalent parameters as defined below Four types of parameters: row, column, text selection and enumerable. row selections are equivalent: if they both contains filtering conditions or match all rows in the table. column selection are equivalent if they refer to columns with the same data type or semantic rules. Text selections are equivalent if both are index-based selections or contain regular expressions enumerable parameters are equivalent if they match exactly

Inferring transform parameters from user interaction Section 3 Inferring transform parameters from user interaction Infer three types of transform parameter: row, column or text selection For each type, enumerate possible parameter values, resulting in a collection of inferred parameter set. Parameter’s value is independent of each other. row selection is based on row indices and predicate matching. column selection returns columns users have interacted with. text selection is either simple index ranges or inferred regular expressions

Regular expression inference Section 3 Regular expression inference

Generating suggested transform Section 3 Generating suggested transform For each parameter sets, loop over each transform type in the language, emitting the types that can accept all parameters in the set To determine values for missing parameters, query the corpus for the top-k parameterisations that cooccur most frequently with the provided parameter set. Filter the suggestions set to remove degenerate transform that would have no effect on the data

Ranking suggestions Ranking according to five criteria: Section 3 Ranking suggestions Ranking according to five criteria: First three criteria rank transform by their type Remaining two transforms within type First three criteria by transform type: explicit interactions, if a user choose a transform in the menu, assign higher ranking to that transform. specification difficulty, label row and text selection as hard, other as easy, sort according to count of hard parameter based on their corpus frequency, conditioned on their initiating user interaction Remaining two criteria with transform type: sort by frequency of equivalent transforms in the corpus sort transforms in ascending order using simple measure of transform complexity, transform complexity is defined as the sum of complexity scores for each parameter. The complexity of row selection predicate is the number of clauses it contains, the complexity of a regular expression is defined to be the number of tokens. Surface diverse transform types in the final suggestion list. No types accounts for more than 1/3 of the suggestions.

Section 3 Summary Data wrangler: a mixed-initiative interface that maps user interactions to suggested data transforms and presents natural language descriptions and visual transform previews to help assess each suggestion. Take-aways: simplicity is power, visual interfaces are crucial for data cleaning Limited to formatting and alignment errors. Many different types of errors in real systems.