Data Quality David Loshin Knowledge Integrity Inc.

Slides:



Advertisements
Similar presentations
1 Ontolog OOR Use Case Review Todd Schneider 1 April 2010 (v 1.2)
Advertisements

MEDICAL MUTUAL OF OHIO Corporate Data Warehouse January 17, 2000 By Terry Cleary Alycia Lieber Mike Mina.
C6 Databases.
Spatial Data Infrastructure: Concepts and Components Geog 458: Map Sources and Errors March 6, 2006.
1er Simposio Latinoamericano Data Quality Fundamentals Miguel Angel Granados Troncoso.
Primary Benefit Types Value Discipline Benefits – Operating Excellence Reduce Cost Reduce Risk – Product Leadership Increase Revenue – Customer Intimacy.
Software Modeling SWE5441 Lecture 3 Eng. Mohammed Timraz
30 Jan Information Management Framework IMF Training 19 November 2003 Overview.
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc. All rights reserved. 8-1 BUSINESS DRIVEN TECHNOLOGY Chapter Eight: Viewing and Protecting Organizational.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
SAS® Data Integration Solution
Data Quality Class 2 David Loshin. Goals Overview of Databases Cost of low data quality The information chain Use of Mini Tools.
Oct 31, 2000Database Management -- Fall R. Larson Database Management: Introduction to Terms and Concepts University of California, Berkeley School.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
Data Quality David Loshin. Course Structure Overview of Data Quality –Data Ownership and Data Roles –Cost Analysis of Poor Data Qaulity Dimensions of.
Data Quality Class 2 David Loshin. Goals Cost of low data quality Mapping the information chain Data Quality impacts Economic measures Impact domains.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
System Engineering Instructor: Dr. Jerry Gao. System Engineering Jerry Gao, Ph.D. Jan System Engineering Hierarchy - System Modeling - Information.
Effort in hours Duration Over Weeks Or Months Inception Launch Web Lifecycle Methodology Maintenance Phases Copyright Wonderlane Studios.
® IBM Software Group © IBM Corporation IBM Information Server Metadata Management.
LEVERAGING THE ENTERPRISE INFORMATION ENVIRONMENT Louise Edmonds Senior Manager Information Management ACT Health.
Database Administration Chapter 16. Need for Databases  Data is used by different people, in different departments, for different reasons  Interpretation.
000000_1 Confidential and proprietary information of Ingram Micro Inc. — Do not distribute or duplicate without Ingram Micro's express written permission.
Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Data Governance Data & Metadata Standards Antonio Amorin © 2011.
® IBM Software Group © IBM Corporation IBM Information Server Understand - Information Analyzer.
EAGLE INSIGHT “EXPERIENCE SPEAKS OF LESSONS LEARNED”
Database Design - Lecture 1
Using Taxonomies Effectively in the Organization v. 2.0 KnowledgeNets 2001 Vivian Bliss Microsoft Knowledge Network Group
Chapter © 2012 Pearson Education, Inc. Publishing as Prentice Hall.
- 1 - Roadmap to Re-aligning the Customer Master with Oracle's TCA Northern California OAUG March 7, 2005.
Human Resource Management Lecture 27 MGT 350. Last Lecture What is change. why do we require change. You have to be comfortable with the change before.
Chapter 13 Research and Metrics McGraw-Hill/Irwin Purchasing and Supply Management, 13/e © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
Chapter © 2009 Pearson Education, Inc. Publishing as Prentice Hall.
© 2014 Cengage Learning. All rights reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Chapter 6.
1-1 System Development Process System development process – a set of activities, methods, best practices, deliverables, and automated tools that stakeholders.
Using Taxonomies Effectively in the Organization KMWorld 2000 Mike Crandall Microsoft Information Services
February 17, 1999Open Forum on Metadata Registries 1 Census Corporate Statistical Metadata Registry By Martin V. Appel Daniel W. Gillman Samuel N. Highsmith,
1 Introduction to Software Engineering Lecture 1.
Module 2: Information Technology Infrastructure Chapter 5: Databases and Information Management.
Search Engine Optimization © HiTech Institute. All rights reserved. Slide 1 What is Solution Assessment & Validation?
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
Database Administration
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Internal and Confidential Cognos CoE COGNOS 8 – Event Studio.
Principles of Marketing
Collaborative Planning Training. Agenda  Collaboration Overview  Setting up Collaborative Planning  User Setups  Collaborative Planning and Forecasting.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Main tasks of system analysis ? 1-study exit=sting information system 2-identify problem 3-spelify system requirement 4-asalysis decision ========= How.
David M. Kroenke and David J. Auer Database Processing Fundamentals, Design, and Implementation Appendix B: Getting Started in Systems Analysis and Design.
Business Models and Information Flow 10 th Meeting Course Name: Business Intelligence Year: 2009.
Copyright © 2007, Oracle. All rights reserved. Managing Items and Item Catalogs.
What is this? SE-2030 Dr. Mark L. Hornick 1. Same images with different levels of detail SE-2030 Dr. Mark L. Hornick 2.
ICS Area Managers Training 2010 ITIL V3 Overview April 1, 2010.
Copyright  2007 McGraw-Hill Pty Ltd PPTs t/a Marketing Research 2e by Lukas, Hair, Bush and Ortinau Slides prepared by Judy Rex 19-1 Chapter Nineteen.
EECS David C. Chan1 Computer Security Management Session 1 How IT Affects Risks and Assurance.
Banner Data Correction Training Employee Data Correction Process.
What is BizTalk ?
Introduction To DBMS.
Data Cleansing - Duplicate Identification and Resolution
DATA MINING © Prentice Hall.
Data Warehouse—Subject‐Oriented
Data Warehouse.
Database Design Using the REA Data Model
Data Quality By Suparna Kansakar.
Presentation transcript:

Data Quality David Loshin Knowledge Integrity Inc. www.knowledge-integrity.com Knowledge Integrity Incorporated

Course Structure Overview of Data Quality Dimensions of Data Quality Data Ownership and Data Roles Cost Analysis of Poor Data Quality Dimensions of Data Quality Data models, Data values, Presentation Data Analysis Techniques Data Analysis Tools

Course Structure (2) Data Quality Improvement Metadata and Enterprise Reference Data Domains and Mappings Data Quality Rules Definition Data Quality Rule Discovery

Course Structure (3) Data Profiling Using Data Quality Rules Data Transformation Data Cleansing Ongoing Validation

Course Structure (4) Data Correction Data Cleansing Scalability Issues Data Parsing Standardization Linkage Duplicate Elimination Approximate Searching Scalability Issues

Assignments 4 Assignments “Handy Tools” for data analysis Domain Analysis Data Parsing Data Linkage Assignments to be programmed using Perl or Java

Some Examples Frequent Flyer Miles and Long-Distance Service Corporate Credit Card Direct Marketing Event CD Club Scam

What is Data? Working definitions: Data: arbitrary values (with their own representation) Information: data within a context Knowledge: Understanding of information within its context Metadata: data about data

Data Contexts Static flat file data Static databases Dynamic data flows Message passing

Who Owns Data? Important question, because the answers indicate where responsibility for data quality lies Data quality can be difficult to effect because of complicating notions Data Processing as an “Information Factory” Actors in the information factory and their roles

Actors and Their Roles Supplier Acquirer Creator Processor Packager Delivery Agent Consumer Middle Manager Senior Manager Decision-maker

Ownership Responsibilities Definition of data Authorization and Security User support Data packaging and delivery Maintenance Data quality Management of business rules Management of metadata Standards management Supplier management

Ownership Paradigms Creator Consumer Compiler Enterprise Funder Decoder Packager Reader Subject Purchaser Everyone

Complicating Notions Ownership is affected by: The value of data Privacy Turf Fear Bureaucracy

The Data Ownership Policy Order of enforcement Identify stakeholders Identify data sets Allocation of ownership Ownership roles and responsibilities Dispute Resolution

The Data Ownership Policy (2) Maintain a metadata database for data ownership Parties table Data set table Roles and responsibilities Policies (i.e., dispute resolution, communication, etc.)

Ownership Roles CIO CKO Trustee Policy Manager Registrar Steward Custodian Data Administrator Security Administrator Information Flow Information Processing Application development Data Provider Data Consumer

Map the Flow of Information Data processing can be likened to an “information factory” Data sets from multiple sources are used as “raw input” Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated

Stages in the Information Map Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data Consumption Knowledge Integrity Incorporated

Directed Information Channels Indicates the flow of information from one processing stage to another Example: supplier data is delivered to an acquisition stage through an information channel Directed indicates the direction in which data flows This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated

Example: Credit Approval

Example: Hotel Reservation Process

Example: Catalog Sales

What is Data Quality? “Fitness for Use” Different rules for different data sets Includes: Data profiling Domain and cross-attribute analysis Discovery of business rules Data cleansing Standardization Deduplification and Merge-purge

Lather, Rinse, Repeat Data quality is a process: Assess the current state of the quality of data Determine the area that needs most improvement Determine success criteria Implement the improvement Measure against success threshold If successful: goto 2

Data Quality is Hard to Do No one wants to admit mistakes Denial of responsibility Lack of understanding “Dirty work” Lack of recognition

Steps to Data Quality Training Data ownership policy Economic model of data quality Current state assessment and requirements analysis Project selection and implementation

Simple Tools Goal: To look for simple patterns that indicate a problem that needs to be addressed Grouping and Linking Frequency Analysis Pattern Analysis

Grouping Try to make similar items gravitate together Joining data instances based on business rules Simple methods: Attribute selection Sorting Hashing

Frequency Analysis Look for insights in numbers Simple methods: Counting Hashing

Pattern Analysis Looking to distinguish between what is expected and what is not expected Attempt to find outliers and nonconformities

Example