Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques
Data Preprocessing.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Data Preprocessing.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Chapter 2: Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
CS685: Special Topics in Data Mining Jinze Liu September 9,
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
Data Mining Lecture 5. Course Syllabus Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
November 24, Data Mining: Concepts and Techniques.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Bzupages.comData Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 3 — ©Jiawei Han and Micheline.
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Data Preprocessing CENG 514 June 17, 2018.
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Data Mining Waqas Haider Bangyal.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Classification and Prediction
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Presentation transcript:

Data Mining Lecture 4

Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data Warehouses: Gathering Raw Data from Relational Databases and transforming into Information. –Information Extraction and Data Processing Techniques –Data Marts: The need for building highly specialized data storages for data mining applications Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)

Data Pre-processing: Information Extraction and Data Processing Techniques Why we should do pre-processing ? pre-processing takes %80 of the time Real world data is not perfect (dirty) –missing values (no data entrance) eg. %35 of Education Field is incomplete eg. %20 of Birth Date is incomplete eg. %45 of Work Title is incomplete eg. %60 of Income is incomplete

Data Pre-processing: Information Extraction and Data Processing Techniques

–erraneous (noisy) eg. Birth Date > current date or Birth Date <1850 (approx. %10 of the data) eg. permissible values Education Field (C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) but X,Q,Y,T values may seen (approx. %10 of the data) Income field is negative (approx. %15 of the data)

Data Pre-processing: Information Extraction and Data Processing Techniques –inconsistent- discrepancies in codes or names eg. Birth Date =’01/01/1955’, 54 (same info but different forms) eg. Education Field coded (C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) (5: college 3: university 4: high school 1: doctorate 2: master 6 : secondary school 7: primary school 8 : illegitimate) Income field continuous (3200 K) or interval based ( K)

Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of missing values different considerations in coding and analyzing (discrepancies with time) hardware/software problems different sources not aligned with same data dictionary; Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Source 1Source 2Source 3 Field 1 Field 2 Field 3

Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of erraneous values Human gives incomplete, about to be correct information AD DOĞUM YERİ DOĞUM TARİHİ SOYAD ADRESİ ÇALIŞMA ÜNVANI ÇALIŞMA YERİ Metin Ü. GAZİANTEP SANRE Atatrk Cad. Kemaliye Mah. 25/3 Genel Müdür Devlet Su İşleri A.O M.Ulku G.ANTEB 10/04/1965 SANER Atatürk Cd. Kemaliye Sok. No.25 Gen. Müdr. G.Antep D.S.İ.

Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of erraneous values Human gives incomplete, about to be correct information Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent İst Asagidere Yokuşu D:14 Esendere Cd. 3.Levent ISTANBUL Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha. İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok.

Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of erraneous values insufficient, incapable data collection instruments partial matching, fuzzy understanding, syntactic- semantic enrichment continuous flow of data may cause data entrance faults error or disruption in data transmission

Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of inconsistent values insufficient lookup mappings incapable transformation infrastructures different data sources hard to prevent needs highly specialized synchronization and automation infrastructure also we should care duplicate data (Redundancy)

Data Pre-processing: Information Extraction and Data Processing Techniques –Why pre-processing so important Data quality brings successful data mining The Only Way to extract information from Data Major tasks in Data Pre-processing: Data cleaning Data integration Data transformation Data reduction Data discretization

Data Pre-processing: Information Extraction and Data Processing Techniques

Major tasks in Data Cleaning: –Fill in missing values –Identify outliers and smooth out noisy data –Correct inconsistent data –Resolve redundancy caused by data integration

Data Pre-processing: Information Extraction and Data Processing Techniques Major tasks in Data Cleaning: –Fill in missing values –Identify outliers and smooth out noisy data –Correct inconsistent data –Resolve redundancy caused by data integration

Data Pre-processing: Information Extraction and Data Processing Techniques How to handle missing data –simply do not accept it –fill it manually –fill it automatically: » a global constant : e.g., “unknown”, a new class?! » the attribute mean »the attribute mean for all samples belonging to the same class: smarter »the most probable value: inference-based such as Bayesian formula or decision tree

Data Pre-processing: Information Extraction and Data Processing Techniques How to handle noisy data –Binning (discretization) method: »first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. »use data distribution and domain knowledge –Clustering »detect and remove outliers –Combined computer and human inspection »detect suspicious values and check by human (e.g., deal with possible outliers) –Regression »smooth by fitting the data into regression functions –Model the data and infer the most probable values (difficult)

Data Pre-processing: Information Extraction and Data Processing Techniques Binning Equal-width (distance) partitioning: – divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. – The most straightforward, but outliers may dominate presentation – Skewed data is not handled well. Equal-depth (frequency) partitioning: – Divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky.

Data Pre-processing: Information Extraction and Data Processing Techniques Binning Sorted data (e.g., by price) – 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: Smoothing by bin means: Smoothing by bin boundaries:

Data Pre-processing: Information Extraction and Data Processing Techniques Binning Sorted data (e.g., by price) – 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34 Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34

Data Pre-processing: Information Extraction and Data Processing Techniques Regression x y y = x + 1 X1 Y1 Y1’

Data Pre-processing: Information Extraction and Data Processing Techniques Clustering

Data Pre-processing: Information Extraction and Data Processing Techniques How to handle inconsistent data –systematic conversion, “transformation” –dynamic and interactive control mechanishms –redundancy detection and intelligent mapping

Data Pre-processing: Information Extraction and Data Processing Techniques Transformation –Smoothing: remove noise from data –Aggregation: summarization, data cube construction –Generalization: concept hierarchy climbing –Normalization: scaled to fall within a small, specified range min-max normalization »z-score normalization »normalization by decimal scaling –Attribute/feature construction: New attributes constructed from the given ones

Data Pre-processing: Information Extraction and Data Processing Techniques Transformation –Smoothing: remove noise from data –Aggregation: summarization, data cube construction –Generalization: concept hierarchy climbing –Normalization: scaled to fall within a small, specified range min-max normalization »z-score normalization »normalization by decimal scaling –Attribute/feature construction: New attributes constructed from the given ones

Remember Stats Facts Min: –What is the big oh value for finding min of n-sized list ? Max: –What is the min number of comparisons needed to find the max of n-sized list? Range: –What about simultaneous finding of min-max? Value Types: –Cardinal value -> how many, counting numbes –Nominal value -> names and identifies something –Ordinal value -> order of things, rank, position

Transformation Min-max normalization: to [new_minA, new_maxA] –Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to Z-score normalization (μ: mean, σ: standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1

Remember Stats Facts Mean (algebraic measure) (sample vs. population): –Weighted arithmetic mean: –Trimmed mean: chopping extreme values Median: A holistic measure –Middle value if odd number of values, or average of the middle two values otherwise –Estimated by interpolation (for grouped data): Mode –Value that occurs most frequently in the data –Unimodal, bimodal, trimodal –Empirical formula:

Week 4-End read –Course Text Book Chapter 2