Download presentation
Presentation is loading. Please wait.
Published byTomás Molinari Barreto Modified over 6 years ago
1
Presented to:- Dr. Dibyojyoti Bhattacharjee
DATA CLEANING Presented to:- Dr. Dibyojyoti Bhattacharjee Presented by:- Ali Akbar Mazumder (15) Mahbobul Hoque Barlaskar(17) Mritunjoy Deb(28) DBA-SMS, Assam University
2
DATA CLEANING
3
Content: Slide Numbers Definition Need for Data Cleaning Data quality Major Heads Data consistency Treatment of Missing numbers Reference 31 11/17/2018 Data Cleaning
4
Definition: A process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions. Data Cleaning 11/17/2018
5
Data cleaning, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. 11/17/2018 Data Cleaning
6
During data cleaning, records are checked
for accuracy and consistency, and either corrected, or deleted as necessary. Data cleaning can occur within a single set of records, or between multiple sets of data which need to be merged, or which will work together. Data Cleaning 11/17/2018
7
What is the need for data cleaning?
The need for data cleaning is centered around improving the quality of data to make them “fit for use” by users through reducing errors in the data and improving their documentation and presentation. 11/17/2018 Data Cleaning
8
Data is often of low quality.
Moreover the data collected should be subjected to a through scrutiny to see if they may be considered correct. Data is often of low quality. Data collected may be subjected to some sort of missing information. Data collected may be contradictory. 11/17/2018 Data Cleaning
9
Data cleaning is very important to the efficiency of any data dependent business.
If some of the clients within a database do not have accurate phone numbers, then company cannot easily contact them. 11/17/2018 Data Cleaning
10
If clients' addresses are not formatted correctly, an automated system would be unable to send out the latest coupons and special deals. 11/17/2018 Data Cleaning
11
The job of data cleaning is to ensure that the data within a system is correct, so that the system is able to use the data. Inaccurate or incomplete records are not much use to anyone. 11/17/2018 Data Cleaning
12
Accuracy: that is the data should be exact or precise.
High quality data needs to pass a set of quality criteria. Those include: Accuracy: that is the data should be exact or precise. Integrity: that is the data should be reliable. Completeness: that is wholeness, totality or entireness. 11/17/2018 Data Cleaning
13
Validity: related to legal, valid and forceful data.
Consistency: it concerns stability, steadiness and regularity. Uniformity: directly related to homogenous data. Density: it refers to as the compactness and the bulk of data collected. Uniqueness: Related to the number of duplicates in the data. 11/17/2018 Data Cleaning
14
Five major heads of data cleaning
Semi-structure Standardize Local consistency check Global consistency check Document 11/17/2018 Data Cleaning
15
Example Example adapted from Dealing with Dirty Data
By Juthika Konwar DBMS, September 2010 Address entry from unstructured file: Juthika Konwar and Debomalya Ghosh Hailakd Rd Box 1234 Sil Cach 11/17/2018 Data Cleaning
16
Semi-structuring 11/17/2018 Data Cleaning
17
Addressee First Name(1): Juthika Addressee Last Name(1): Konwar
Also known as parsing: Addressee First Name(1): Juthika Addressee Last Name(1): Konwar Addressee First Name(2): Debomalya Addressee Last Name(2): Ghosh Street Name: Hailakd Rd Post Office Box Number: 1234 City: Sil State: Cach Pin Number: 11/17/2018 Data Cleaning
18
Standardize We should replace synonyms with one standard term • Hailakd Rd ‡ Hailakandi Road • Sil ‡ Silchar • Cach ‡ Cachar 11/17/2018 Data Cleaning
19
Local consistency check
Does each piece of data make sense on its own? • Silchar and Pin Number are in Assam • State was listed as Cachar • 2/3 attributes point to Assam as the correct state. Change state to Assam. 11/17/2018 Data Cleaning
20
Global consistency check
Find data given by Juthika Konwar and Debomalya Ghosh in Administrative records and ensure that all of the elements of all the addresses are identical. 11/17/2018 Data Cleaning
21
Documenting Document the results of semi-structuring, standardizing, and consistency checking in data • Important for users of the integrated database • Important for doing future updates of the 11/17/2018 Data Cleaning
22
Data consistency It defines those data which are: Out of range
Logically inconsistent Having extreme values. 11/17/2018 Data Cleaning
23
Out of range Suppose that while entering the data by Juthika Konwar she mentioned the Pin Number which is out of range data as Pin should contain only 6 digits and the above Pin No. contains 7 digits which is not acceptable. 11/17/2018 Data Cleaning
24
Logically inconsistent
Logical inconsistency arises in case of mismatching in the responses between two or more variables. E.g.:- Suppose while entering the Pin number Juthika entered Pin Number along with alphabets Pin No.:- 7880AB, which is logically incorrect… 11/17/2018 Data Cleaning
25
Having extreme values. Extreme values of data are very large quantity of data which should be handled carefully. While dealing with extreme data one should keep certain points in minds that extreme value may seem incorrect but it should be properly verified before the actual use of data. 11/17/2018 Data Cleaning
26
Treatment of Missing numbers:
Missing data is a common problem in research analysis. Rates of less than 1% missing data are generally considered trivial,1-5% manageable. However, 5-15% require sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation. 11/17/2018 Data Cleaning
27
Missing Data occurs in a data set when an observation is missing a value on a variable.
11/17/2018 Data Cleaning
28
Treatment of Missing numbers
Data Cleaning 11/17/2018
29
Elimination • If a missing value occurs on any of the p variables, eliminate the entire observation. • This is the default method for most procedures. • Consequence – A data set with even a modest amount of missing values scattered throughout can result in a substantial reduction in sample size. 11/17/2018 Data Cleaning
30
Imputation • Imputation describes the process of filling in the missing values of a variable. 11/17/2018 Data Cleaning
31
Reference http://en.wikipedia.org http://www.wisegeek.com
11/17/2018 Data Cleaning
32
Thank you for your Precious TIME….
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.