Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented to:- Dr. Dibyojyoti Bhattacharjee

Similar presentations


Presentation on theme: "Presented to:- Dr. Dibyojyoti Bhattacharjee"— Presentation transcript:

1 Presented to:- Dr. Dibyojyoti Bhattacharjee
DATA CLEANING Presented to:- Dr. Dibyojyoti Bhattacharjee Presented by:- Ali Akbar Mazumder (15) Mahbobul Hoque Barlaskar(17) Mritunjoy Deb(28) DBA-SMS, Assam University

2 DATA CLEANING

3 Content: Slide Numbers Definition Need for Data Cleaning Data quality Major Heads Data consistency Treatment of Missing numbers Reference 31 11/17/2018 Data Cleaning

4 Definition: A process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions. Data Cleaning 11/17/2018

5 Data cleaning, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. 11/17/2018 Data Cleaning

6 During data cleaning, records are checked
for accuracy and consistency, and either corrected, or deleted as necessary. Data cleaning can occur within a single set of records, or between multiple sets of data which need to be merged, or which will work together. Data Cleaning 11/17/2018

7 What is the need for data cleaning?
The need for data cleaning is centered around improving the quality of data to make them “fit for use” by users through reducing errors in the data and improving their documentation and presentation. 11/17/2018 Data Cleaning

8 Data is often of low quality.
Moreover the data collected should be subjected to a through scrutiny to see if they may be considered correct. Data is often of low quality. Data collected may be subjected to some sort of missing information. Data collected may be contradictory. 11/17/2018 Data Cleaning

9 Data cleaning is very important to the efficiency of any data dependent business.
If some of the clients within a database do not have accurate phone numbers, then company cannot easily contact them. 11/17/2018 Data Cleaning

10 If clients' addresses are not formatted correctly, an automated system would be unable to send out the latest coupons and special deals. 11/17/2018 Data Cleaning

11 The job of data cleaning is to ensure that the data within a system is correct, so that the system is able to use the data. Inaccurate or incomplete records are not much use to anyone. 11/17/2018 Data Cleaning

12 Accuracy: that is the data should be exact or precise.
High quality data needs to pass a set of quality criteria. Those include: Accuracy: that is the data should be exact or precise. Integrity: that is the data should be reliable. Completeness: that is wholeness, totality or entireness. 11/17/2018 Data Cleaning

13 Validity: related to legal, valid and forceful data.
Consistency: it concerns stability, steadiness and regularity. Uniformity: directly related to homogenous data. Density: it refers to as the compactness and the bulk of data collected. Uniqueness: Related to the number of duplicates in the data. 11/17/2018 Data Cleaning

14 Five major heads of data cleaning
Semi-structure Standardize Local consistency check Global consistency check Document 11/17/2018 Data Cleaning

15 Example Example adapted from Dealing with Dirty Data
By Juthika Konwar DBMS, September 2010 Address entry from unstructured file: Juthika Konwar and Debomalya Ghosh Hailakd Rd Box 1234 Sil Cach 11/17/2018 Data Cleaning

16 Semi-structuring 11/17/2018 Data Cleaning

17 Addressee First Name(1): Juthika Addressee Last Name(1): Konwar
Also known as parsing: Addressee First Name(1): Juthika Addressee Last Name(1): Konwar Addressee First Name(2): Debomalya Addressee Last Name(2): Ghosh Street Name: Hailakd Rd Post Office Box Number: 1234 City: Sil State: Cach Pin Number: 11/17/2018 Data Cleaning

18 Standardize We should replace synonyms with one standard term • Hailakd Rd ‡ Hailakandi Road • Sil ‡ Silchar • Cach ‡ Cachar 11/17/2018 Data Cleaning

19 Local consistency check
Does each piece of data make sense on its own? • Silchar and Pin Number are in Assam • State was listed as Cachar • 2/3 attributes point to Assam as the correct state. Change state to Assam. 11/17/2018 Data Cleaning

20 Global consistency check
Find data given by Juthika Konwar and Debomalya Ghosh in Administrative records and ensure that all of the elements of all the addresses are identical. 11/17/2018 Data Cleaning

21 Documenting Document the results of semi-structuring, standardizing, and consistency checking in data • Important for users of the integrated database • Important for doing future updates of the 11/17/2018 Data Cleaning

22 Data consistency It defines those data which are: Out of range
Logically inconsistent Having extreme values. 11/17/2018 Data Cleaning

23 Out of range Suppose that while entering the data by Juthika Konwar she mentioned the Pin Number which is out of range data as Pin should contain only 6 digits and the above Pin No. contains 7 digits which is not acceptable. 11/17/2018 Data Cleaning

24 Logically inconsistent
Logical inconsistency arises in case of mismatching in the responses between two or more variables. E.g.:- Suppose while entering the Pin number Juthika entered Pin Number along with alphabets Pin No.:- 7880AB, which is logically incorrect… 11/17/2018 Data Cleaning

25 Having extreme values. Extreme values of data are very large quantity of data which should be handled carefully. While dealing with extreme data one should keep certain points in minds that extreme value may seem incorrect but it should be properly verified before the actual use of data. 11/17/2018 Data Cleaning

26 Treatment of Missing numbers:
Missing data is a common problem in research analysis. Rates of less than 1% missing data are generally considered trivial,1-5% manageable. However, 5-15% require sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation. 11/17/2018 Data Cleaning

27 Missing Data occurs in a data set when an observation is missing a value on a variable.
11/17/2018 Data Cleaning

28 Treatment of Missing numbers
Data Cleaning 11/17/2018

29 Elimination • If a missing value occurs on any of the p variables, eliminate the entire observation. • This is the default method for most procedures. • Consequence – A data set with even a modest amount of missing values scattered throughout can result in a substantial reduction in sample size. 11/17/2018 Data Cleaning

30 Imputation • Imputation describes the process of filling in the missing values of a variable. 11/17/2018 Data Cleaning

31 Reference http://en.wikipedia.org http://www.wisegeek.com
11/17/2018 Data Cleaning

32 Thank you for your Precious TIME….


Download ppt "Presented to:- Dr. Dibyojyoti Bhattacharjee"

Similar presentations


Ads by Google