Why clean data? Data quality is important.. Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Preparing Data for Quantitative Analysis
United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan,
Lecture-19 ETL Detail: Data Cleansing
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
SIM5102 Software Evaluation
Functional Testing.
Reference Manager Making your life easier! Updated September 2007.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Merge and Identify Defined Merge combines two or more patient index references (that refer to the same person) into one. When the names are different,
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Digital Measures Chair Training College of Education.
Eurostat Statistical Data Editing and Imputation.
System/Software Testing
The George Washington University Electrical & Computer Engineering Department ECE 002 Dr. S. Ahmadi Class 1.
© Hanson Research Corporation Deduping contacts in Sage CRM 24 th Day of November 2010.
Question 10 What do I write?. Spreadsheet Make sure that you have got a printout of your spreadsheet - no spreadsheet, no marks!
Data entry: Validation
1 Rev 2: 3/4/2014 LWAF Plant Database - AKA “Accession Database” - Please note that this presentation has note pages using PowerPoint notes capabilities.
Java Script: Arrays (Chapter 11 in [2]). 2 Outline Introduction Introduction Arrays Arrays Declaring and Allocating Arrays Declaring and Allocating Arrays.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Estimation and Measurement Learning Points Estimation as measurement Need for Measurements with scales Use of Standard Units of Length Measurement of.
Create Lists in Millennium Jenny Schmidt SWITCH Library Consortium.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Database Security Outline.. Introduction Security requirement Reliability and Integrity Sensitive data Inference Multilevel databases Multilevel security.
DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
There’s a right way and a wrong way to document data manually and from DAS at the same time … If you enter the data in the wrong sequence, you risk losing.
Chapter 2 Organizing Data
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
INVENTORY SHIP Use this function to deduct parts that are used by technicians, but not billed to customers (Supply items). Also used to deduct “GP-Form”
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Collection. Data Capture This is the first stage involved in getting data into a computer Various input devices are used when getting data to the.
Data Verification and Validation
Chapter 11 Data Validation. Question Should your program assume the data is correct, or should your program edit the data to ensure it is correct?
Systems Development The Kingsway School. Systems Development This is carried out when a company is having a problem. They usually employ an ICT Consultant.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
1 eCoRepair New Release Slide Expanded view of Circuit Details Change to Circuit Looping text4 View of closed faults up to 30 days old5 - 7 Minor.
1 TX SET Mass Transition Project RMS Update March 15, 2006.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Preparation for Analysis Chapter 11. Editing “The inspection and correction of the data received from each element of the sample.” “The inspection.
UNIT 3 – MODULE 5: Data Input & Editing. INTRODUCTION Putting data into a computer (called data coding) is a fundamental process for virtually all GIS.
Data Mining What is to be done before we get to Data Mining?
Enlargement Simple scale factors. Find the scale factor and the missing length ?
CENTURY 21 ACCOUNTING © 2009 South-Western, Cengage Learning LESSON 6-4 Finding and Correcting Errors on the Work Sheet.
Chapter 11 - JavaScript: Arrays
Click Once installation
Databases.
Winners Cheque Analysis
Data Management – Audit Trail
شاخصهای عملکردی بیمارستان
Data cleaning and transformation
فرق بین خوب وعالی فقط اندکی تلاش بیشتر است
Unit 101 Element 4 Maintain a Secure environment for customers, staff and visitors. Maintaining effective security should be the concern of everyone working.
and their missing lengths
Algebra Tutor User’s manual.
Week 4: Ungraded review questions
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Reports and Forms Second Term,
Discrepancy Management
Similar triangles: Missing sides with other triangles
Functions What is a function? What are the different ways to represent a function?
New Physics for You, pages 8, 362
Presentation transcript:

Why clean data? Data quality is important.

Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the data quality Improves the quality of the outputs

What to look for Non-response  an item non response  eg missing data Erroneous data  Can negatively affect data and resulting quality Suspicious data

Data gathering problems Manual entry (input error)  For example, switching numbers around, missing numbers, inputting two responses into one field Duplicates  For example, submit button hit more than once Measurement errors  For example, using inches instead of cm, reading scales incorrectly Non uniform standards for content and format  For example, people using different units - some giving index finger lengths in cm, some in mm.

Process of cleaning data Detect Resolve Treat

Detect Identify erroneous or suspicious data  Graph or sort data - look at outliers I have a student who throws ten dice and records the number of sixes. They recorded: (2, 0, 3, 12, 2, 0, 1, 1, 3, 1, 4).  What is wrong?  What do you think is the cause of it?

Detect Consider the data points  3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92  “92” is suspicious - an outlier Outliers:  are potentially legitimate (correct)  can be data or model glitches  can be a data miners dream, for example, a highly profitable customer Outlier - “departure from the expected”

Resolve Deciding if erroneous or suspicious data should be corrected or amended Deciding on the action to “treat” the data

Treat Leave as is Change  Impute: determine replacement value Replacement value is obtained from a similar record from the “clean” respondents from the data at hand Remove