 Data integration: ◦ Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-# ◦ Integrate metadata from.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
CHAPTER 23: Two Categorical Variables The Chi-Square Test ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture.
Education 793 Class Notes Joint Distributions and Correlation 1 October 2003.
Sociology 601 Class 17: October 28, 2009 Review (linear regression) –new terms and concepts –assumptions –reading regression computer outputs Correlation.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Designing Experiments In designing experiments we: Manipulate the independent.
Chapter18 Determining and Interpreting Associations Among Variables.
Descriptive Data Summarization (Understanding Data)
Social Research Methods
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Standard Scores & Correlation. Review A frequency curve either normal or otherwise is simply a line graph of all frequency of scores earned in a data.
Data Mining: Concepts and Techniques — Chapter 2 —
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
CS2032 DATA WAREHOUSING AND DATA MINING
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Z Scores & Correlation Greg C Elvers.
File and Database Design; Logic Modeling Class 24.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 11 Chi-Square Tests and Strategies.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
Lecture 3: Data Preprocessing Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
2015年9月7日星期一 2015年9月7日星期一 2015年9月7日星期一 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han Department of.
Data Preprocessing Pertemuan 03 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Is there a relationship between the lengths of body parts ?
11/7/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Data Preprocessing.
CORE 2: Information systems and Databases NORMALISING DATABASES.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Introduction to Correlation Analysis. Objectives Correlation Types of Correlation Karl Pearson’s coefficient of correlation Correlation in case of bivariate.
Chi- square test x 2. Chi Square test Symbolized by Greek x 2 pronounced “Ki square” A Test of STATISTICAL SIGNIFICANCE for TABLE data.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 17 l Chi-Squared Analysis: Testing for Patterns in Qualitative Data.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
CHAPTER 23: Two Categorical Variables The Chi-Square Test ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture.
Two Variable Statistics Introduction To Chi-Square Test for Independence.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Correlation. Correlation is a measure of the strength of the relation between two or more variables. Any correlation coefficient has two parts – Valence:
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Integration Data Integration Data integration: Combines data from multiple sources into a coherent store Challenges: Entity.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 14 Chi-Square Tests.  Hypothesis testing procedures for nominal variables (whose values are categories)  Focus on the number of people in different.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
STATS 10x Revision CONTENT COVERED: CHAPTERS
Chapter 2 Copyright © 2015, 2011, 2007 Pearson Education, Inc. Chapter 2-1 Solving Linear Equations and Inequalities.
5 1 Chapter 5 Normalization of Database Tables Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Determining and Interpreting Associations between Variables Cross-Tabs Chi-Square Correlation.
June 26, C.Eng 714 Spring  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest,
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Noisy Data Noise: random error or variance in a measured variable.
Chapter 4 Relational Databases
Social Research Methods
Understanding Data Characteristics
Understanding Basic Characteristics of Data
Comparing Two Means.
Today (1/21/15) Learning objectives:
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Comparing Two Means.
The ANSI/SPARC Architecture of a Database Environment
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Presentation transcript:

 Data integration: ◦ Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-# ◦ Integrate metadata from different sources  Entity identification problem: ◦ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Detecting and resolving data value conflicts ◦ For the same real world entity, attribute values from different sources are different ◦ Possible reasons: different representations, different scales

 Redundant data occur often when integration of multiple databases ◦ Object identification: The same attribute or object may have different names in different databases ◦ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

 Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of records and, are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.  If r A,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  r A,B = 0: independent; r A,B < 0: negatively correlated

 Χ 2 (chi-square) test  The larger the Χ 2 value, the more likely the variables are related  The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count  Correlation does not imply causality ◦ # of hospitals and # of car-theft in a city are correlated ◦ Both are causally linked to the third variable: population

 Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)  It shows that like_science_fiction and play_chess are correlated in the group Play chess Not play chess Sum (row) Like science fiction 250(90)200(360)450 Not like science fiction 50(210)1000(840)1050 Sum(col.)