Data quality Stefano Grazioli.

Slides:



Advertisements
Similar presentations
ICDL Software Applications - Database Concepts. Unit 6 Data and Data Representation Database Concepts –File Structure –Relationships Database Design –Data.
Advertisements

Introduction to Excel Formulas, Functions and References.
MS-Excel XP Lesson 2. Handling Worksheets 1.Bottom of the every workbook you can get worksheets. 2.No of sheets for a book is three. But you can add,
Financial Information Management FIM: BUSINESS INTELLIGENCE Stefano Grazioli.
Fall 2001Arthur Keller – CS 1808–1 Schedule Today Oct. 18 (TH) Schemas, Views. u Read Sections u Project Part 3 extended to Oct. 23 (T). Oct.
Winter 2002Arthur Keller – CS 1808–1 Schedule Today: Jan. 29 (T) u Modifications, Schemas, Views. u Read Sections Assignment 3 due. Jan. 31 (TH)
Winter 2002Judy Cushing8–1 Schedule Jan. 30 (Wed) u Modifications, Schemas, Views. Read Sections This Week (Feb 4ff) u Constraints Read Sections.
© Stefano Grazioli - Ask for permission for using/quoting:
Financial Information Management FIM: Databases Stefano Grazioli.
Financial Information Management Managing Financial Information Critical Thinking Business Process Modeling WINIT Control Structures Homework.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Exploring Formulas.
Chicago MetaStock User Group The Downloader ® Downloader ® and MetaStock ® Are Trademarks of Equis® International
Microsoft Access 2000 Creating Tables and Relationships.
Importing Data Text Data Parsing Scrubbing Data June 21, 2012.
Advanced Excel for Finance Professionals A self study material from South Asian Management Technologies Foundation.
Lecture Note 9: Introduction to the MS Access
 Starting Excel 2003  Using Help  Workbook Management  Cursor Management  Manipulating Data  Using Formulae and Functions  Formatting Spreadsheet.
© EZ-R Stats, LLC Duplicate Payments Slide 1 Auditing for Duplicate Payments A better way … Presentation of
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 Working with MSSQL Server Code:G0-C# Version: 1.0 Author: Pham Trung Hai CTD.
Winter 2006Keller Ullman Cushing8–1 Turning in Assignments Please turn in hard copy (use only in the direst of circumstances). I am not your secretary.
Chapter 4 Introduction to MySQL. MySQL “the world’s most popular open-source database application” “commonly used with PHP”
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
© Stefano Grazioli - Ask for permission for using/quoting:
Shannon K. Basher, MLS Houston Academy of Medicine – Texas Medical Center Library.
Financial Information Management Operations, BI, and Analytics Stefano Grazioli.
SCUHolliday - coen 1788–1 Schedule Today u Modifications, Schemas, Views. u Read Sections (except and 6.6.6) Next u Constraints. u Read.
Data Structures and Algorithms Lecture 1 Instructor: Quratulain Date: 1 st Sep, 2009.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
Tables and Constraints Oracle PL/SQL. Datatypes The SQL Data Definition Language Commands (or DDL) enable us to create, modify and remove database data.
INSERT Statement. 2 home back first prev next last What Will I Learn? Give examples of why it is important to be able to alter the data in a database.
Programming with Microsoft Visual Basic 2008 Fourth Edition Chapter Eight String Manipulation.
PERFORMING CALCULATIONS Microsoft Excel. Excel Formulas A formula is a set of mathematical instructions that can be used in Excel to perform calculations.
Spreadsheets What is Excel?. Objectives 1. Identify the parts of the Excel Screen 2. Identify the functions of a spreadsheet 3. Identify how spreadsheets.
Programming with Microsoft Visual Basic 2008 Fourth Edition Chapter Eight String Manipulation.
Financial Information Management Business Intelligence Stefano Grazioli.
© Stefano Grazioli - Ask for permission for using/quoting: Stefano Grazioli.
Microsoft Access Prepared by the Academic Faculty Members of IT.
Financial Information Management FIM: Databases Stefano Grazioli.
Financial Information Management Modifying data in a DB Stefano Grazioli.
© Stefano Grazioli - Ask for permission for using/quoting: Stefano Grazioli.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Choosing Data Types Database Administration Fundamentals
Business Intelligence
CPSC-310 Database Systems
Operations, BI, and Analytics
Business Intelligence
Process Automation The Technology
CS 106 Computing Fundamentals II Chapter 5 “Excel Basics for Windows”
CSIS 115 Database Design and Applications for Business
Process Automation The Technology
Business Intelligence
Microsoft Excel A Spreadsheet Program.
MySQL - Creating donorof database offline
Understanding Microsoft Excel
Event Title Event Intro Event Subtitle Date Time Location
Defining a Database Schema
Creating Tables & Inserting Values Using SQL
Business Intelligence
Process Automation: From models to code
BI and data quality Stefano Grazioli.
Flat Files & Relational Databases
Data quality Stefano Grazioli.
Understanding Microsoft Excel
Process Automation: focus on imagination and modeling
The intended use and features of Room Four Database
Event Title Event Intro Event Subtitle Date Time Location
Access Test Questions Test Date: 05/05/16.
BI and data quality Stefano Grazioli.
Operations, BI, and Analytics
CMSC-461 Database Management Systems
Presentation transcript:

Data quality Stefano Grazioli

Critical Thinking Last SQL homework due Fixed demo issues. Added a note to the homework text Easy Meter

What is Data Quality? The degree to which data is suitable for a business purpose Accuracy, precision

The quality of the data stored in organizational databases is often poor 10-25% of the records have inaccuracies or missing elements Data frequently misinterpreted Known data loss and theft Most databases implement inconsistent definitions 50% of the stored data is never used 10x duplication of data Source: T. Redman, Data Driven, 2008

Why is Data Bad? No one gets up in the morning and says “I’m going to make lots of errors today” - Cathy Bessant Source: T. Redman, Data Driven, 2008

Find the Data Quality issues Cust ID Name Addr1 Addr2 City State Zip Phone 0345 Daniel Steeper 765 Spider Cove New York NY 10012 875-3253 0346 Mr. Bigg Mr. Bigg’s Wigs, Inc. Cville Virginia 22901 434-567-3455 0467 MJ Watson 753 45th St Apt 45 10024 999-9 0488 Carl Zeithaml 34 Sprigg Lane Charlottesville VA 22904 (434)-453-3556 0499 Danny Steeper #875-3253 0722 Ben Grimm Broad and Main Staunton 24403 null Sue Storm 8564 Carver Dr. NYC 212-450-3556 0853 2345 Benson Rd Los Angeles CA 90210 StateID State VA Virginia NY New York WY null

Find the Data Quality issues Cust ID Name Addr1 Addr2 City State Zip Phone 0345 Daniel Steeper 765 Spider Cove New York NY 10012 875-3253 0346 Mr. Bigg Mr. Bigg’s Wigs, Inc. Cville Virginia 22901 434-567-3455 0467 MJ Watson 753 45th St Apt 45 10024 999-9 0488 Carl Zeithaml 34 Sprigg Lane Charlottesville VA 22904 (434)-453-3556 0499 Danny Steeper #875-3253 0722 Ben Grimm Broad and Main Staunton 24403 null Sue Storm 8564 Carver Dr. NYC 212-450-3556 0853 2345 Benson Rd Los Angeles CA 90210 StateID State VA Virginia NY New York WY null

Approaches to Data Quality Find and Fix Prevent at the source Do nothing Do nothing 3m case

Business Scenario: Google’s Daily Cagr Homework Business Scenario: Google’s Daily Cagr

You are a financial analyst at a fintech firm Many of our customers invest for short amounts of time on Google. They sell their shares within a few weeks…. I wonder: do they make any money out of it? I am on it….. While you are at it… clean the data, first Consider it done.

Daily Cagr for Google You get a file with ~1000 customers who recently bought and sold GOOG. Three steps (and two homework) Clean data: phones, dates Compute Daily Cagr = [(final price/initial price)1/days ]-1 Report the Average Daily Cagr across all customers.

Cleaning Phone Numbers From: #2345348565 To: (234)-534-8565

UML Activity Diagram - Daily Compound Average Growth of a Security (part I) When the user presses a button, a file selection windows pops out. The user selects a file. The file is shown starting at cell “A1”. The start button becomes invisible. Three more buttons appear: “Clean phone numbers”, “Format Dates”, and “Compute Daily CAGR”. A Next homework Next homework [Compute] [Format Dates] [Clean ph.no] Select the next phone no. Count its digits [Exactly 10 digits] Highlight the cell in red Format as (xxx)-xxx-xxxx & clear highlight if any A [No More Ph.No]

What Is New In Technology? WINIT What Is New In Technology?

used in data quality and beyond Text manipulation used in data quality and beyond

Strings and Characters Dim myString As String = “This is a sample string" Dim myString2 As String = "s" Dim myChar As Char = "s"c

Testing Numbers Dim myString As String = "#2344-234-33-3" Dim temp As String = "" For Each x As Char In myString If IsNumeric(x) Then temp = temp + x End If Next

Inserting and Removing Dim myStr As String = "This is a sample string" myStr = myStr.Insert(4, "xyz") myStr = myStr.Remove(4, 3) 'starting where,how many myStr = myStr.Replace(" is", " was")

Composing text Dim s As String = “4344562456” Dim temp As String temp = "Ph. " + s.Substring(0, 3) + " / " + s.Substring(3, 3) + " " + s.Substring(6, 4) ' The above is an example that produces this: ‘ Ph. 434 / 456 2456 ‘ or - same thing - temp = String.Format("Ph. {0} / {1} {2}", s.Substring(0, 3), s.Substring(3, 3), s.Substring(6, 4))

Total length of the result Trimming and Padding myLenght = myString.Length myNewString = myString.Trim() myNewString = myString.TrimEnd() myNewString = myString.TrimStart() myNewString = myString.PadLeft(50) myNewString = myString.PadRight(20) Total length of the result

Reading a File into EXCEL ' store the address of the current active sheet (the ‘target’) Dim myActiveS As Excel.Worksheet = Application.ActiveSheet ' select a file Dim myFile As String = Application.GetOpenFilename() ' get the data in a new temporary workbook Application.Workbooks.OpenText(myFile, , , Excel.XlTextParsingType.xlDelimited, , , , , True) ' store the address of the temporary workbook Dim myActiveWB As Excel.Workbook = Application.ActiveWorkbook ' copy the content from the temporary to the ‘target’ sheet myActiveS.Range("A1:J1000").Value = Application.ActiveSheet.Range("A1:J1000").Value ‘ close the temp workbook myActiveWB.Close()

Finding the last non-empty row Dim lastRow As Integer lastRow = Cells(Rows.Count,1). End(Excel.XlDirection.xlUp). Row

Suggestions Give yourself plenty of time