Data Integration www.ePowerPoint.com. Data Integration Data integration: Combines data from multiple sources into a coherent store Challenges: Entity.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Sociology 601 Class 17: October 28, 2009 Review (linear regression) –new terms and concepts –assumptions –reading regression computer outputs Correlation.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Descriptive Data Summarization (Understanding Data)
Data Preprocessing.
Social Research Methods
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 3 The Relational Database Model.
Brown, Suter, and Churchill Basic Marketing Research (8 th Edition) © 2014 CENGAGE Learning Basic Marketing Research Customer Insights and Managerial Action.
Data Mining: Concepts and Techniques — Chapter 2 —
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
File and Database Design; Logic Modeling Class 24.
Review Guess the correlation. A.-2.0 B.-0.9 C.-0.1 D.0.1 E.0.9.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
Lecture 3: Data Preprocessing Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
2015年9月7日星期一 2015年9月7日星期一 2015年9月7日星期一 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han Department of.
Data Preprocessing Pertemuan 03 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Data Warehousing 資料倉儲 Data Preprocessing: Integration and the ETL process 1001DW03 MI4 Tue. 6,7 (13:10-15:00) B427 Min-Yuh Day 戴敏育 Assistant Professor.
11/7/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Data Preprocessing.
Agenda Review Association for Nominal/Ordinal Data –  2 Based Measures, PRE measures Introduce Association Measures for I-R data –Regression, Pearson’s.
CORE 2: Information systems and Databases NORMALISING DATABASES.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Introduction to Correlation Analysis. Objectives Correlation Types of Correlation Karl Pearson’s coefficient of correlation Correlation in case of bivariate.
Slide 26-1 Copyright © 2004 Pearson Education, Inc.
Essential Statistics Chapter 161 Review Part III_A_Chi Z-procedure Vs t-procedure.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Database Systems, 9th Edition 1.  In this chapter, students will learn: That the relational database model offers a logical view of data About the relational.
Two Variable Statistics Introduction To Chi-Square Test for Independence.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Correlation. Correlation is a measure of the strength of the relation between two or more variables. Any correlation coefficient has two parts – Valence:
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CTFS Workshop Shameema Esufali Asian data coordinator and technical resource for the network
Chapter 14 Chi-Square Tests.  Hypothesis testing procedures for nominal variables (whose values are categories)  Focus on the number of people in different.
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
3 1 Database Systems The Relational Database Model.
 Data integration: ◦ Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-# ◦ Integrate metadata from.
Motivation Data in the real world is dirty
Chapter 2 Copyright © 2015, 2011, 2007 Pearson Education, Inc. Chapter 2-1 Solving Linear Equations and Inequalities.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
June 26, C.Eng 714 Spring  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest,
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Chapter 4 Relational Databases
Social Research Methods
Understanding Data Characteristics
Understanding Basic Characteristics of Data
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Modified from
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Data Integration

Data Integration Data integration: Combines data from multiple sources into a coherent store Challenges: Entity identification problem How to Match schema and objects from different source? Attributes Redundancy Are the attributes correlation ? Tuple duplication : Redundant data occur often when integration of multiple databases Data value conflicts How to Detecting and resolving

3 Entity identification problem Object identification: The same attribute or object may have different names in different databases e.g., A.cust-id  B.cust-# e.g., Bill Clinton = William Clinton Schema integration and Object matching are tricky. Integrate metadata from different sources: Identify real world entities from multiple data sources, Detecting and resolving tuple duplicate and data value conflicts

Handling attribute Redundancy Attributes Redundancy Inconsistencies in attribute or dimension naming An attribute can be derived from another attributes- redundancy. e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis for numerical attributes, use correlation coefficient for categorical attributes, use Χ 2 (chi-square) test

Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson’s product moment coefficient, 皮尔逊积矩系数 ) where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross- product. If r A,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. r A,B = 0: independent; r A,B < 0: negatively correlated

Correlation Analysis (Categorical Data) Χ 2 (chi-square) test The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population

Χ 2 ( Chi-Square ) Calculation: An Example Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science_fiction and play_chess are correlated in the group Play chessNot play chessSum (row) Like science fiction250(90)200(360)450 Not like science fiction50(210)1000(840)1050 Sum(col.)

Handling Redundancy in Data Integration Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

resolving data value conflicts Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units Different abstract level