Presented to:- Dr. Dibyojyoti Bhattacharjee

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Lecture 6/2/12. Forms and PHP The PHP $_GET and $_POST variables are used to retrieve information from forms, like user input When dealing with HTML forms.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 7e Kendall & Kendall 8 © 2008 Pearson Prentice Hall.
Computer Concepts 5th Edition Parsons/Oja Page 492 CHAPTER 10 File And Database Concepts Section A PARSONS/OJA Databases.
Dr Samah Kotb Lecturer of Biochemistry 1 CLS 432 Dr. Samah Kotb Nasr El-deen Biochemistry Clinical practice CLS 432 Dr. Samah Kotb Nasr.
Validation and Verification Today will look at: The difference between accuracy and validity Explaining sources of errors and how they could be overcome.
Quality Assurance in the clinical laboratory
Troy Eversen | 19 May 2015 Data Integrity Workshop.
Software Development Unit 2 Databases What is a database? A collection of data organised in a manner that allows access, retrieval and use of that data.
What Are File Maintenance Techniques and Validation Techniques?
Arun Srivastava. Types of Non-sampling Errors Specification errors, Coverage errors, Measurement or response errors, Non-response errors and Processing.
بسم الله الرحمن الرحيم * this presentation about :- “experimental design “ * Induced to :- Dr Aidah Abu Elsoud Alkaissi * Prepared by :- 1)-Hamsa karof.
WMS systems manage and coordinate several independent subtasks. The coordination problems get even more serious when the subtasks are performed on separate.
OCAN College Access Program Data Submissions Vonetta Woods HEI Analyst, Ohio Board of Regents
Checking data Chapter 7 Prepared by:Sir Mazhar Javed.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
System Analysis and Design
XP Chapter 1 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Preparing To Automate Data Management Chapter 1 “You.
Database Security Outline.. Introduction Security requirement Reliability and Integrity Sensitive data Inference Multilevel databases Multilevel security.
XML Engr. Faisal ur Rehman CE-105T Spring Definition XML-EXTENSIBLE MARKUP LANGUAGE: provides a format for describing data. Facilitates the Precise.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Biochemistry Clinical practice CLS 432 Dr. Samah Kotb Lecturer of Biochemistry 2015 Introduction to Quality Control.
Session 1 Module 1: Introduction to Data Integrity
24 Nov 2007Data Management and Exploratory Data Analysis 1 Yongyuth Chaiyapong Ph.D. (Mathematical Statistics) Department of Statistics Faculty of Science.
Data Management Seminar, 8-11th July 2008, Hamburg WinDEM- Verification Checks Part I.
Data Mining What is to be done before we get to Data Mining?
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Mail Merge Introduction to Word Processing ITSW 1401 Instructor: Glenda H. Easter Introduction to Word Processing ITSW 1401 Instructor: Glenda H. Easter.
Quality Assurance.
Getting started with Accurately Storing Data
Auditing Concepts.
I-9 Instructions and FAQs
CHAPTER SIX DATA Business Intelligence
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Lesson 10 Databases.
Journalizing Transactions
Databases.
“Automated” Tax Notice Best Practices
Quality Assurance in the clinical laboratory
Coupling and Cohesion 1.
Chapter 1: Introduction
Week 12 Option 3: Database Design
SECTION 5: INFORMATION PROCESSING
Databases.
Introduction to Database Management System
Introduction to Database Systems
Chapter 9 Database and Information Management.
Data Quality By Suparna Kansakar.
Teaching slides Chapter 8.
Software Design Lecture : 9.
Two methods to observe tutorial
Introduction to Invoicing
The ultimate in data organization
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Discrepancy Management
Instructor Materials Chapter 5: Ensuring Integrity
Requirement Validation
Relational data model. Codd's Rule E.F Codd was a Computer Scientist who invented Relational model for Database management. Based on relational model,
Hire Xpress User’s Training A Human Resources guide to Hire Xpress
A handbook on validation methodology. Metrics.
CHAPTER 6 Testing and Debugging.
Presentation transcript:

Presented to:- Dr. Dibyojyoti Bhattacharjee DATA CLEANING Presented to:- Dr. Dibyojyoti Bhattacharjee Presented by:- Ali Akbar Mazumder (15) Mahbobul Hoque Barlaskar(17) Mritunjoy Deb(28) DBA-SMS, Assam University

DATA CLEANING

Content: Slide Numbers Definition 04-06 Need for Data Cleaning 07-11 Data quality 12-13 Major Heads 14-21 Data consistency 22-25 Treatment of Missing numbers 26-30 Reference 31 11/17/2018 Data Cleaning

Definition: A process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions. Data Cleaning 11/17/2018

Data cleaning, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. 11/17/2018 Data Cleaning

During data cleaning, records are checked for accuracy and consistency, and either corrected, or deleted as necessary. Data cleaning can occur within a single set of records, or between multiple sets of data which need to be merged, or which will work together. Data Cleaning 11/17/2018

What is the need for data cleaning? The need for data cleaning is centered around improving the quality of data to make them “fit for use” by users through reducing errors in the data and improving their documentation and presentation. 11/17/2018 Data Cleaning

Data is often of low quality. Moreover the data collected should be subjected to a through scrutiny to see if they may be considered correct. Data is often of low quality. Data collected may be subjected to some sort of missing information. Data collected may be contradictory. 11/17/2018 Data Cleaning

Data cleaning is very important to the efficiency of any data dependent business. If some of the clients within a database do not have accurate phone numbers, then company cannot easily contact them. 11/17/2018 Data Cleaning

If clients' email addresses are not formatted correctly, an automated email system would be unable to send out the latest coupons and special deals. 11/17/2018 Data Cleaning

The job of data cleaning is to ensure that the data within a system is correct, so that the system is able to use the data. Inaccurate or incomplete records are not much use to anyone. 11/17/2018 Data Cleaning

Accuracy: that is the data should be exact or precise. High quality data needs to pass a set of quality criteria. Those include: Accuracy: that is the data should be exact or precise. Integrity: that is the data should be reliable. Completeness: that is wholeness, totality or entireness. 11/17/2018 Data Cleaning

Validity: related to legal, valid and forceful data. Consistency: it concerns stability, steadiness and regularity. Uniformity: directly related to homogenous data. Density: it refers to as the compactness and the bulk of data collected. Uniqueness: Related to the number of duplicates in the data. 11/17/2018 Data Cleaning

Five major heads of data cleaning Semi-structure Standardize Local consistency check Global consistency check Document 11/17/2018 Data Cleaning

Example Example adapted from Dealing with Dirty Data By Juthika Konwar DBMS, September 2010 Address entry from unstructured file: Juthika Konwar and Debomalya Ghosh Hailakd Rd Box 1234 Sil Cach 788001 11/17/2018 Data Cleaning

Semi-structuring 11/17/2018 Data Cleaning

Addressee First Name(1): Juthika Addressee Last Name(1): Konwar Also known as parsing: Addressee First Name(1): Juthika Addressee Last Name(1): Konwar Addressee First Name(2): Debomalya Addressee Last Name(2): Ghosh Street Name: Hailakd Rd Post Office Box Number: 1234 City: Sil State: Cach Pin Number: 788001 11/17/2018 Data Cleaning

Standardize We should replace synonyms with one standard term • Hailakd Rd ‡ Hailakandi Road • Sil ‡ Silchar • Cach ‡ Cachar 11/17/2018 Data Cleaning

Local consistency check Does each piece of data make sense on its own? • Silchar and Pin Number 788001 are in Assam • State was listed as Cachar • 2/3 attributes point to Assam as the correct state. Change state to Assam. 11/17/2018 Data Cleaning

Global consistency check Find data given by Juthika Konwar and Debomalya Ghosh in Administrative records and ensure that all of the elements of all the addresses are identical. 11/17/2018 Data Cleaning

Documenting Document the results of semi-structuring, standardizing, and consistency checking in data • Important for users of the integrated database • Important for doing future updates of the 11/17/2018 Data Cleaning

Data consistency It defines those data which are: Out of range Logically inconsistent Having extreme values. 11/17/2018 Data Cleaning

Out of range Suppose that while entering the data by Juthika Konwar she mentioned the Pin Number-7880011 which is out of range data as Pin should contain only 6 digits and the above Pin No. contains 7 digits which is not acceptable. 11/17/2018 Data Cleaning

Logically inconsistent Logical inconsistency arises in case of mismatching in the responses between two or more variables. E.g.:- Suppose while entering the Pin number Juthika entered Pin Number along with alphabets Pin No.:- 7880AB, which is logically incorrect… 11/17/2018 Data Cleaning

Having extreme values. Extreme values of data are very large quantity of data which should be handled carefully. While dealing with extreme data one should keep certain points in minds that extreme value may seem incorrect but it should be properly verified before the actual use of data. 11/17/2018 Data Cleaning

Treatment of Missing numbers: Missing data is a common problem in research analysis. Rates of less than 1% missing data are generally considered trivial,1-5% manageable. However, 5-15% require sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation. 11/17/2018 Data Cleaning

Missing Data occurs in a data set when an observation is missing a value on a variable. 11/17/2018 Data Cleaning

Treatment of Missing numbers Data Cleaning 11/17/2018

Elimination • If a missing value occurs on any of the p variables, eliminate the entire observation. • This is the default method for most procedures. • Consequence – A data set with even a modest amount of missing values scattered throughout can result in a substantial reduction in sample size. 11/17/2018 Data Cleaning

Imputation • Imputation describes the process of filling in the missing values of a variable. 11/17/2018 Data Cleaning

Reference http://en.wikipedia.org http://www.wisegeek.com http://wiki.answers.com http://logic.stanford.edu/classes/cs246/lectures/lecture13.pdf 11/17/2018 Data Cleaning

Thank you for your Precious TIME….