Presented to:- Dr. Dibyojyoti Bhattacharjee DATA CLEANING Presented to:- Dr. Dibyojyoti Bhattacharjee Presented by:- Ali Akbar Mazumder (15) Mahbobul Hoque Barlaskar(17) Mritunjoy Deb(28) DBA-SMS, Assam University
DATA CLEANING
Content: Slide Numbers Definition 04-06 Need for Data Cleaning 07-11 Data quality 12-13 Major Heads 14-21 Data consistency 22-25 Treatment of Missing numbers 26-30 Reference 31 11/17/2018 Data Cleaning
Definition: A process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions. Data Cleaning 11/17/2018
Data cleaning, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. 11/17/2018 Data Cleaning
During data cleaning, records are checked for accuracy and consistency, and either corrected, or deleted as necessary. Data cleaning can occur within a single set of records, or between multiple sets of data which need to be merged, or which will work together. Data Cleaning 11/17/2018
What is the need for data cleaning? The need for data cleaning is centered around improving the quality of data to make them “fit for use” by users through reducing errors in the data and improving their documentation and presentation. 11/17/2018 Data Cleaning
Data is often of low quality. Moreover the data collected should be subjected to a through scrutiny to see if they may be considered correct. Data is often of low quality. Data collected may be subjected to some sort of missing information. Data collected may be contradictory. 11/17/2018 Data Cleaning
Data cleaning is very important to the efficiency of any data dependent business. If some of the clients within a database do not have accurate phone numbers, then company cannot easily contact them. 11/17/2018 Data Cleaning
If clients' email addresses are not formatted correctly, an automated email system would be unable to send out the latest coupons and special deals. 11/17/2018 Data Cleaning
The job of data cleaning is to ensure that the data within a system is correct, so that the system is able to use the data. Inaccurate or incomplete records are not much use to anyone. 11/17/2018 Data Cleaning
Accuracy: that is the data should be exact or precise. High quality data needs to pass a set of quality criteria. Those include: Accuracy: that is the data should be exact or precise. Integrity: that is the data should be reliable. Completeness: that is wholeness, totality or entireness. 11/17/2018 Data Cleaning
Validity: related to legal, valid and forceful data. Consistency: it concerns stability, steadiness and regularity. Uniformity: directly related to homogenous data. Density: it refers to as the compactness and the bulk of data collected. Uniqueness: Related to the number of duplicates in the data. 11/17/2018 Data Cleaning
Five major heads of data cleaning Semi-structure Standardize Local consistency check Global consistency check Document 11/17/2018 Data Cleaning
Example Example adapted from Dealing with Dirty Data By Juthika Konwar DBMS, September 2010 Address entry from unstructured file: Juthika Konwar and Debomalya Ghosh Hailakd Rd Box 1234 Sil Cach 788001 11/17/2018 Data Cleaning
Semi-structuring 11/17/2018 Data Cleaning
Addressee First Name(1): Juthika Addressee Last Name(1): Konwar Also known as parsing: Addressee First Name(1): Juthika Addressee Last Name(1): Konwar Addressee First Name(2): Debomalya Addressee Last Name(2): Ghosh Street Name: Hailakd Rd Post Office Box Number: 1234 City: Sil State: Cach Pin Number: 788001 11/17/2018 Data Cleaning
Standardize We should replace synonyms with one standard term • Hailakd Rd ‡ Hailakandi Road • Sil ‡ Silchar • Cach ‡ Cachar 11/17/2018 Data Cleaning
Local consistency check Does each piece of data make sense on its own? • Silchar and Pin Number 788001 are in Assam • State was listed as Cachar • 2/3 attributes point to Assam as the correct state. Change state to Assam. 11/17/2018 Data Cleaning
Global consistency check Find data given by Juthika Konwar and Debomalya Ghosh in Administrative records and ensure that all of the elements of all the addresses are identical. 11/17/2018 Data Cleaning
Documenting Document the results of semi-structuring, standardizing, and consistency checking in data • Important for users of the integrated database • Important for doing future updates of the 11/17/2018 Data Cleaning
Data consistency It defines those data which are: Out of range Logically inconsistent Having extreme values. 11/17/2018 Data Cleaning
Out of range Suppose that while entering the data by Juthika Konwar she mentioned the Pin Number-7880011 which is out of range data as Pin should contain only 6 digits and the above Pin No. contains 7 digits which is not acceptable. 11/17/2018 Data Cleaning
Logically inconsistent Logical inconsistency arises in case of mismatching in the responses between two or more variables. E.g.:- Suppose while entering the Pin number Juthika entered Pin Number along with alphabets Pin No.:- 7880AB, which is logically incorrect… 11/17/2018 Data Cleaning
Having extreme values. Extreme values of data are very large quantity of data which should be handled carefully. While dealing with extreme data one should keep certain points in minds that extreme value may seem incorrect but it should be properly verified before the actual use of data. 11/17/2018 Data Cleaning
Treatment of Missing numbers: Missing data is a common problem in research analysis. Rates of less than 1% missing data are generally considered trivial,1-5% manageable. However, 5-15% require sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation. 11/17/2018 Data Cleaning
Missing Data occurs in a data set when an observation is missing a value on a variable. 11/17/2018 Data Cleaning
Treatment of Missing numbers Data Cleaning 11/17/2018
Elimination • If a missing value occurs on any of the p variables, eliminate the entire observation. • This is the default method for most procedures. • Consequence – A data set with even a modest amount of missing values scattered throughout can result in a substantial reduction in sample size. 11/17/2018 Data Cleaning
Imputation • Imputation describes the process of filling in the missing values of a variable. 11/17/2018 Data Cleaning
Reference http://en.wikipedia.org http://www.wisegeek.com http://wiki.answers.com http://logic.stanford.edu/classes/cs246/lectures/lecture13.pdf 11/17/2018 Data Cleaning
Thank you for your Precious TIME….