Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Alternative Approach to Systems Analysis Structured analysis
Copyright Irwin/McGraw-Hill Data Modeling Prepared by Kevin C. Dittman for Systems Analysis & Design Methods 4ed by J. L. Whitten & L. D. Bentley.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
The Relational Model System Development Life Cycle Normalisation
Chapter 3 The Relational Model Transparencies © Pearson Education Limited 1995, 2005.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model.
Chapter 3. 2 Chapter 3 - Objectives Terminology of relational model. Terminology of relational model. How tables are used to represent data. How tables.
1 Minggu 2, Pertemuan 3 The Relational Model Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
Fundamentals, Design, and Implementation, 9/e COS 346 Day 2.
1 Pertemuan 04 MODEL RELASIONAL Matakuliah: >/ > Tahun: > Versi: >
Data Modeling Using the Entity-Relationship Model
CSE314 Database Systems Data Modeling Using the Entity- Relationship (ER) Model Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
1 Relational model concepts Key constraints Referential integrity constraint Steen Jensen, autumn 2013.
CSC271 Database Systems Lecture # 6. Summary: Previous Lecture  Relational model terminology  Mathematical relations  Database relations  Properties.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model Pearson Education © 2014.
© Pearson Education Limited, Chapter 2 The Relational Model Transparencies.
Module Title? DBMS E-R Model to Relational Model.
Relational Model Session 6 Course Name: Database System Year : 2012.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Content Resource- Elamsari and Navathe, Fundamentals of Database Management systems.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Relational Data Model and Relational Database Constraints.
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
Database Processing: Fundamentals, Design and Implementation, 9/e by David M. KroenkeChapter 2/1 Copyright © 2004 Please……. No Food Or Drink in the class.
THE RELATIONAL DATA MODEL CHAPTER 3 (6/E) CHAPTER 5 (5/E) 1.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Instructor: Churee Techawut Basic Concepts of Relational Database Chapter 5 CS (204)321 Database System I.
Chapter 3 The Relational Model. 2 Chapter 3 - Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
The Relational Model Pertemuan 03 Matakuliah: M0564 /Pengantar Sistem Basis Data Tahun : 2008.
Relational Database. Database Management System (DBMS)
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Chapter 2 : Entity-Relationship Model Entity Sets Relationship Sets Design Issues Mapping Constraints Keys E-R Diagram Extended E-R Features Design of.
© D. Wong Ch. 2 Entity-Relationship Data Model (continue)  Data models  Entity-Relationship diagrams  Design Principles  Modeling of constraints.
Modeling Issues for Data Warehouses CMPT 455/826 - Week 7, Day 1 (based on Trujollo) Sept-Dec 2009 – w7d11.
UNIT_2 1 DATABASE MANAGEMENT SYSTEM[DBMS] [Unit: 2] Prepared By Lavlesh Pandit SPCE MCA, Visnagar.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
The Relational Model. 2 Relational Model Terminology u A relation is a table with columns and rows. –Only applies to logical structure of the database,
Session 1 Module 1: Introduction to Data Integrity
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
The Relational Model © Pearson Education Limited 1995, 2005 Bayu Adhi Tama, M.T.I.
CSCI 6315 Applied Database Systems Review for Midterm Exam I Xiang Lian The University of Texas Rio Grande Valley Edinburg, TX 78539
Chapter 3 The Relational Model. Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between mathematical.
Logical Database Design and Relation Data Model Muhammad Nasir
Chapter 4 The Relational Model Pearson Education © 2009.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
Data Mining What is to be done before we get to Data Mining?
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
Data Understanding, Cleaning, Transforming. Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform.
Data Modeling Using the Entity- Relationship (ER) Model
COP Introduction to Database Structures
Plan for Populating a DW
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
The Relational Model Transparencies
Database solutions Chosen aspects of the relational model Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology.
Data Understanding, Cleaning, Transforming
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
INSTRUCTOR: MRS T.G. ZHOU
Presentation transcript:

Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Conceptual Modelling Solutions for the Data Warehouse Stefano Rizzi Sept-Dec 2009 – w8d22

Definition 1: Facts A fact is a focus of interest –for the decision-making process; Typically, it models a set of events –occurring in the enterprise world. A fact is graphically represented –by a box with two sections, one for the fact name and one for the measures. Sept-Dec 2009 – w8d23

Guideline 1: Facts Concepts represented in the data source –by frequently-updated archives are good candidates for facts Concepts represented –by almost-static archives are not good candidates for facts Sept-Dec 2009 – w8d24

Definition 2: Measure A measure is a numerical property of a fact, –and describes one of its quantitative aspects of interests for analysis. Measures are included in the bottom section of the fact. Sept-Dec 2009 – w8d25

Definition 3: Dimension A dimension is a fact property with a finite domain and –describes one of its analysis coordinates. The set of dimensions of a fact –determines its finest representation granularity???. Graphically, dimensions are represented –as circles attached to the fact by straight lines. Sept-Dec 2009 – w8d26

Guideline 2: Dimensions At least one of the dimensions of the fact –should represent time, at any granularity. Sept-Dec 2009 – w8d27

Definition 4: Primary Event A primary event is an occurrence of a fact, and –is identified by a tuple of values, –one value for each dimension. Each primary event is described –by one value for each measure. Sept-Dec 2009 – w8d28

Definition 5: Dimension Attributes A dimension attribute is a property, –with a finite domain, –of a dimension. Like dimensions, –it is represented by a circle. Sept-Dec 2009 – w8d29

Definition 6: hierarchy A hierarchy is a directed tree, –rooted in a dimension, –whose nodes are all the dimension attributes that describe that dimension, –and whose arcs model many-to-one associations between pairs of dimension attributes. Arcs are graphically represented by straight lines. Sept-Dec 2009 – w8d210

Definition 8: Descriptive attribute A descriptive attribute specifies a property of a dimension attribute, –to which is related by an x-to-one association. Descriptive attributes are not used for aggregation; –they are always leaves of their hierarchy –and are graphically represented by horizontal lines. Sept-Dec 2009 – w8d211

Definition 9: Cross-dimension attributes A cross-dimension attribute –is a (either dimension or descriptive) attribute –whose value is determined –by the combination of two or more dimension attributes, –possibly belonging to different hierarchies. It is denoted by connecting through a curve line –the arcs that determine it. Sept-Dec 2009 – w8d212

Definition 10: Convergence A convergence takes place –when two dimension attributes within a hierarchy –are connected by two or more alternative paths –of many-to-one associations. Convergences are represented –by letting two or more arcs converge –on the same dimension attribute. Sept-Dec 2009 – w8d213

Definition 13: Ragged Hierarchy A ragged (or incomplete) hierarchy is a hierarchy, –where, for some instances, –the values of one or more attributes are missing –(since undefined or unknown). A ragged hierarchy is graphically denoted –by marking with a dash the attributes –whose values may be missing. Sept-Dec 2009 – w8d214

Definition 14: Unbalanced Hierarchy An unbalanced (or recursive) hierarchy is a hierarchy –where, though inter-attribute relationships are consistent, –the instances may have different lengths. Graphically, it is represented –by introducing a cycle within the hierarchy. Sept-Dec 2009 – w8d215

Definition 15: Additive A measure is said to be additive along a dimension –if its values can be aggregated –along the corresponding hierarchy by the sum operator, –otherwise it is called nonadditive. A nonadditive measure is nonaggregable –if no other aggregation operator can be used on it. Sept-Dec 2009 – w8d216

Open Issues Lack of a standard for conceptual models Need for design patterns to support modelling Need for a method to model security issues Sept-Dec 2009 – w8d217

Data Cleaning (Based on Rahm) Sept-Dec 2009 – w8d218

Single source problems Lack of appropriate model-specific integrity constraints –Attribute: illegal values –Record: uniqueness violation –Relationship: referential integrity not validated Sept-Dec 2009 – w8d219

Single source problems Lack of appropriate application-specific integrity constraints can lead to: –Attribute problems: missing values, misspellings, cryptic abbreviations, embedded values, misfiled values –Record problems: violated attribute dependencies, word transpositions, duplicated records, contradicted records –Relationship problems: wrong references Sept-Dec 2009 – w8d220

Multi-source Problems In addition to single source problems, there can be: –overlapping or contradicting data –schema naming and structural conflicts –different data types / granularities / interpretations / points in time Sept-Dec 2009 – w8d221

Data Analysis for cleaning Using metadata for data profiling –focuses on the instance analysis of individual attributes –derives information such as the data type, length, value range, discrete values and their frequency, variance, uniqueness, occurrence of null values, typical string pattern (e.g., for phone numbers) –providing an exact view of various quality aspects of the attribute Data mining –helps discover specific data patterns in large data sets, e.g., relationships holding between several attributes –focuses on so-called descriptive data mining models including clustering, summarization, association discovery and sequence Sept-Dec 2009 – w8d222

Data transformations Can be done via SQL operations –which allows tracking of all transformations –can include Extracting values from free-form attributes (attribute split): Validation and correction: Standardization Duplicate elimination May require considerable human involvement –some transformations will be more complex than others –some transformations will apply to more or less data Sept-Dec 2009 – w8d223