Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

MSc IT UFCE8K-15-M Data Management Prakash Chatterjee Room 2Q18
The Relational Model Much of the material presented in these slides was developed by Dr. Ramon Lawrence at the University of Iowa.
C6 Databases.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
Logical Database Design
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Accounting System Design
Chapter 6 Methodology Logical Database Design for the Relational Model Transparencies © Pearson Education Limited 1995, 2005.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.
Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
Accounting Databases Chapter 2 The Crossroads of Accounting & IT
Data Quality Class 4. Goals Questions Review of SQL select Data Quality Rules.
Data Quality Class 8. Agenda Tests will be graded by next week Project Discovery –Domain discovery –Domain identification –Nulls –Clustering for rule.
FIS 431/631 Financial Information Systems: Analysis and Design ERD & Normalization Joe Callaghan Oakland University Department of Accounting & Finance.
Data Modeling 1 Yong Choi School of Business CSUB.
Chapter 14 & 15 Conceptual & Logical Database Design Methodology
Entity-Relationship Design
Chapter 3 Relational Model Chapter 4 in Textbook.
® IBM Software Group © IBM Corporation IBM Information Server Understand - Information Analyzer.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
Concepts and Terminology Introduction to Database.
MIS 301 Information Systems in Organizations Dave Salisbury ( )
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
Normalization (Codd, 1972) Practical Information For Real World Database Design.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Common Field Types Primary Key Descriptive Fields Foreign Key.
1 The Relational Model. 2 Why Study the Relational Model? v Most widely used model. – Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. v “Legacy.
FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Relational Database. Database Management System (DBMS)
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
Data Quality Class 3. Goals Dimensions of Data Quality Data Extraction, Transformation, and Loading Data Cleansing Project.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Database Application Design and Data Integrity AIMS 3710 R. Nakatsu.
1 © Prentice Hall, 2002 Chapter 5: Logical Database Design and the Relational Model Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B.
Handling Many to Many Relationships. 2 Handling Many:Many Relationships Aims: To explain why M:M relationships cannot be implemented in relational database.
1 Chapter 17 Methodology - Local Logical Database Design.
Relational Databases. Relational database  data stored in tables  must put data into the correct tables  define relationship between tables  primary.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21.
Information Access Mgt09/12/971 Entity-Relationship Design Information Level Design.
Master Data versus Reference Data
The relational model A data model (in general) : Integrated collection of concepts for describing data (data requirements). Relational model was introduced.
0 / Database Management. 1 / Identify file maintenance techniques Discuss the terms character, field, record, and table Describe characteristics.
Mr C Johnston ICT Teacher
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
Understand Primary, Foreign, and Composite Keys Database Administration Fundamentals LESSON 4.2.
CS34311 The Relational Model. cs34312 Why Relational Model? Currently the most widely used Vendors: Oracle, Microsoft, IBM Older models still used IBM’s.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
© 2009 Pearson Education, Inc. Publishing as Prentice Hall 1 Chapter 5 (Part a): Logical Database Design and the Relational Model Modern Database Management.
Methodology - Logical Database Design. 2 Step 2 Build and Validate Local Logical Data Model To build a local logical data model from a local conceptual.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
COP Introduction to Database Structures
Fundamentals of DBMS Notes-1.
Methodology Logical Database Design for the Relational Model
Chapter 5: The Art of Ensuring Integrity
CIS 207 The Relational Database Model
Lecture 2 The Relational Model
Data Modelling Introduction
Teaching slides Chapter 8.
CHAPTER 4: LOGICAL DATABASE DESIGN AND THE RELATIONAL MODEL
INSTRUCTOR: MRS T.G. ZHOU
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

Data Quality Class 3

Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Dimensions of Data Quality Poor data quality is similar to obscenity- – It seems as if there are no real ways to measure it, but you know it when you see it! In reality, data quality can be measured – The frame of reference for measurement is different

Dimensions of Data Quality 2 Data Models Data Values Data Presentation Data Policy

Example: Sales Database Sales and marketing database – Current customers – Sales leads Name, address, contact data For current customers, sales data

Data Quality of Data Models Clarity of definition Comprehensiveness Flexibility Robustness

Data Quality of Data Models 2 Essentialness Attribute granularity Precision of domains Homogeneity

Data Quality of Data Models 3 Naturalness Identifiability Obtainability Relevance

Data Quality of Data Models 4 Simplicity Semantic Consistency Structural Consistency

Data Quality of Data Values Accuracy Null values Completeness Consistency Currency

Accuracy Agreement with established sources Database of record Other corroborative sources

Null Values Null vs. Missing – Unavailable – Not applicable – No value – Not classified – Truly null

Completeness Mandatory attributes require values Optional attributes may hold values (when and how?) Inapplicable attributes may not have a value (also when and how?) Completeness constraints

Consistency Are values in one set consistent with values in another set? Consistency relations between attributes in the same table Consistency assertions across columns Consistency relationships between tables

Currency/Timeliness What data is current? How is it kept up-to-date? Time expectations for accessibility to data

Data Quality of Data Presentation Appropriateness Correct Interpretation Flexibility Format Precision

Data Quality of Data Presentation 2 Portability Representation Consistency Representation of Null Values

Data Quality of Data Policy Access Metadata Privacy Fault-tolerance Security

Reference Data Relatively static Referred to from within many tables Shared data Examples: – Product catalog – Security type classification – Currency codes

Reference Data 2 Metadata Data Domains Mappings between those domains

Domains A data domain is a subclassed data type Domains can be described using enumerations – Example: states, cities, product codes Domains can be described using functions – Example: formatted trouble ticket ids such as CC- NNNN

Domain Membership Data attributes can be affiliated with data domains Test a value to make sure it is valid within a domain – For enumerated domains, a lookup works – For generated/described domains, check to see if the value could be generated by function

Domains as Metadata The existence of a data domain is metadata Keep track of which tables have attributes making use of which domains Manage domains as enterprise reference data

Domain Tables Domains are sets of values Maintain in two tables: – Domain name table (domain_name, domain_id) – Domain value table (domain_id, value) Example: – (2, “STATES”) – {(2, “NY”), (2, “MA”), (2, “CT”), (2, “VA”)…}

Mappings Map values in one domain to values in other domains 1-1, 1-Many, Many-1, Many-Many Represents relations from one set to another set 1-1, Many-1 are functions

Mappings as Metadata Semi-static mappings are also metadata Keep track of which tables refer to mappings Enterprise reference data

Mapping Tables Maintain mapping information in two tables: – Mapping name table (from_domain, to_domain, mapping_id, mapping name) – Mapping table (mapping_id, source_value, target_value)

Example: (“Currency”, “Country”, 65, “CurrencyToCountry”) (65, “USD”, “US)

Tables Inference of data fields from source data In our case, this is straightforward based on SGML tags in the data Determine table structure in context of use of metadata and reference data In other words, try to maintain normal form

Keys Primary key: must be unique across all values in table Primary key may be assigned based on internal increasing value, or maybe extant in the data Foreign key: relates values in one table to values in another table Foreign keys must exist in target table (=referential integrity)

Data Profiling and Parsing Goals: – Identify data domains – Identify mappings – Identify candidate keys

Data Column Profiling For each table – For each column Type inference Subclassed Type inference Count the number of distinct values Enumerate the distinct values Sort the distinct values Look for patterns of usage Document discoveries

Data Table Profiling For each Table – Identify candidate keys – Identify data domains – Identify data mappings – Look for patterns

Multiple Table Profiling Identify foreign key relationships Validate referential integrity Look for patterns

Data Parsing 2 kinds of data domains – Enumerated, consisting of a predefined set of values – Inferred/Functional, consisting of values that are validated based on a set of rules

Type Inference Data value sets all conform to a data type All values within a column must belong to the same type Propose data type by a series of inferences – Initial assumption is that all values belong to strictest type – Test for violations to type restriction – As a violation is discovered, loosen the restriction and test again for violations – Measure conformance at each level

Subclassed Type Inference Once type has been inferred, look for additional restrictions within type Example: – Integer vs. Integer within a range (0..100) For integers and real values, look for ranges For character strings, look for length, context, character patterns

String Parsing Classify characters into groups – Alphabetic – Numeric/Digits – Punctuation (this can be further refined) Transform all values within a column set into their corresponding pattern Example: would change to – DDDPDDPDDDD

String Parsing Given a set of representative strings, perform column analysis again Look for high frequency counts for specific patterns This will provide proposed functional domains We can characterize certain domains by format (e.g., telephone numbers, SSNs, Tax Ids, UPC codes, etc.)

Word Parsing More advanced technique to explore domain types Classify word tokens in terms of predefined characterization – Example: Name words, business words, transaction words, connector words, titles, etc.

Primary Key Discovery Iterative process to find candidate keys A candidate key is one or more attributes whose values, when composed, can uniquely locate a record

Primary Key Discovery 2 For I = 1, number of attributes – For each set S composed of I attributes do Are the composed values unique across the tables? If so, add to candidate key set

Next Assignment Type Inferencer – Propose data types for analyzed columns Format Parser – Transform strings into pattern strings – Propose format for a column – Catalog discovered named formats (telno, SSN, etc.) – More details on posting on web site