Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data quality Stefano Grazioli.

Similar presentations


Presentation on theme: "Data quality Stefano Grazioli."— Presentation transcript:

1 Data quality Stefano Grazioli

2 Critical Thinking Last SQL homework due
Fixed demo issues. Added a note to the homework text Easy Meter

3 What is Data Quality? The degree to which data is suitable for a business purpose Accuracy, precision

4 The quality of the data stored in organizational databases is often poor
10-25% of the records have inaccuracies or missing elements Data frequently misinterpreted Known data loss and theft Most databases implement inconsistent definitions 50% of the stored data is never used 10x duplication of data Source: T. Redman, Data Driven, 2008

5 Why is Data Bad? No one gets up in the morning and says “I’m going to make lots of errors today” - Cathy Bessant Source: T. Redman, Data Driven, 2008

6 Find the Data Quality issues
Cust ID Name Addr1 Addr2 City State Zip Phone 0345 Daniel Steeper 765 Spider Cove New York NY 10012 0346 Mr. Bigg Mr. Bigg’s Wigs, Inc. Cville Virginia 22901 0467 MJ Watson 753 45th St Apt 45 10024 999-9 0488 Carl Zeithaml 34 Sprigg Lane Charlottesville VA 22904 (434) 0499 Danny Steeper # 0722 Ben Grimm Broad and Main Staunton 24403 null Sue Storm 8564 Carver Dr. NYC 0853 2345 Benson Rd Los Angeles CA 90210 StateID State VA Virginia NY New York WY null

7 Find the Data Quality issues
Cust ID Name Addr1 Addr2 City State Zip Phone 0345 Daniel Steeper 765 Spider Cove New York NY 10012 0346 Mr. Bigg Mr. Bigg’s Wigs, Inc. Cville Virginia 22901 0467 MJ Watson 753 45th St Apt 45 10024 999-9 0488 Carl Zeithaml 34 Sprigg Lane Charlottesville VA 22904 (434) 0499 Danny Steeper # 0722 Ben Grimm Broad and Main Staunton 24403 null Sue Storm 8564 Carver Dr. NYC 0853 2345 Benson Rd Los Angeles CA 90210 StateID State VA Virginia NY New York WY null

8 Approaches to Data Quality
Find and Fix Prevent at the source Do nothing Do nothing 3m case

9 Business Scenario: Google’s Daily Cagr
Homework Business Scenario: Google’s Daily Cagr

10 You are a financial analyst at a fintech firm
Many of our customers invest for short amounts of time on Google. They sell their shares within a few weeks…. I wonder: do they make any money out of it? I am on it….. While you are at it… clean the data, first Consider it done.

11 Daily Cagr for Google You get a file with ~1000 customers who recently bought and sold GOOG. Three steps (and two homework) Clean data: phones, dates Compute Daily Cagr = [(final price/initial price)1/days ]-1 Report the Average Daily Cagr across all customers.

12 Cleaning Phone Numbers
From: # To: (234)

13 UML Activity Diagram - Daily Compound Average Growth of a Security (part I)
When the user presses a button, a file selection windows pops out. The user selects a file. The file is shown starting at cell “A1”. The start button becomes invisible. Three more buttons appear: “Clean phone numbers”, “Format Dates”, and “Compute Daily CAGR”. A Next homework Next homework [Compute] [Format Dates] [Clean ph.no] Select the next phone no. Count its digits [Exactly 10 digits] Highlight the cell in red Format as (xxx)-xxx-xxxx & clear highlight if any A [No More Ph.No]

14 What Is New In Technology?
WINIT What Is New In Technology?

15 used in data quality and beyond
Text manipulation used in data quality and beyond

16 Strings and Characters
Dim myString As String = “This is a sample string" Dim myString2 As String = "s" Dim myChar As Char = "s"c

17 Testing Numbers Dim myString As String = "#2344-234-33-3"
Dim temp As String = "" For Each x As Char In myString If IsNumeric(x) Then temp = temp + x End If Next

18 Inserting and Removing
Dim myStr As String = "This is a sample string" myStr = myStr.Insert(4, "xyz") myStr = myStr.Remove(4, 3) 'starting where,how many myStr = myStr.Replace(" is", " was")

19 Composing text Dim s As String = “4344562456” Dim temp As String
temp = "Ph. " + s.Substring(0, 3) + " / " s.Substring(3, 3) + " " + s.Substring(6, 4) ' The above is an example that produces this: ‘ Ph. 434 / ‘ or - same thing - temp = String.Format("Ph. {0} / {1} {2}", s.Substring(0, 3), s.Substring(3, 3), s.Substring(6, 4))

20 Total length of the result
Trimming and Padding myLenght = myString.Length myNewString = myString.Trim() myNewString = myString.TrimEnd() myNewString = myString.TrimStart() myNewString = myString.PadLeft(50) myNewString = myString.PadRight(20) Total length of the result

21 Reading a File into EXCEL
' store the address of the current active sheet (the ‘target’) Dim myActiveS As Excel.Worksheet = Application.ActiveSheet ' select a file Dim myFile As String = Application.GetOpenFilename() ' get the data in a new temporary workbook Application.Workbooks.OpenText(myFile, , , Excel.XlTextParsingType.xlDelimited, , , , , True) ' store the address of the temporary workbook Dim myActiveWB As Excel.Workbook = Application.ActiveWorkbook ' copy the content from the temporary to the ‘target’ sheet myActiveS.Range("A1:J1000").Value = Application.ActiveSheet.Range("A1:J1000").Value ‘ close the temp workbook myActiveWB.Close()

22 Finding the last non-empty row
Dim lastRow As Integer lastRow = Cells(Rows.Count,1). End(Excel.XlDirection.xlUp). Row

23 Suggestions Give yourself plenty of time

24


Download ppt "Data quality Stefano Grazioli."

Similar presentations


Ads by Google