Data Mining Lecture 4
Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data Warehouses: Gathering Raw Data from Relational Databases and transforming into Information. –Information Extraction and Data Processing Techniques –Data Marts: The need for building highly specialized data storages for data mining applications Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)
Data Pre-processing: Information Extraction and Data Processing Techniques Why we should do pre-processing ? pre-processing takes %80 of the time Real world data is not perfect (dirty) –missing values (no data entrance) eg. %35 of Education Field is incomplete eg. %20 of Birth Date is incomplete eg. %45 of Work Title is incomplete eg. %60 of Income is incomplete
Data Pre-processing: Information Extraction and Data Processing Techniques
–erraneous (noisy) eg. Birth Date > current date or Birth Date <1850 (approx. %10 of the data) eg. permissible values Education Field (C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) but X,Q,Y,T values may seen (approx. %10 of the data) Income field is negative (approx. %15 of the data)
Data Pre-processing: Information Extraction and Data Processing Techniques –inconsistent- discrepancies in codes or names eg. Birth Date =’01/01/1955’, 54 (same info but different forms) eg. Education Field coded (C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) (5: college 3: university 4: high school 1: doctorate 2: master 6 : secondary school 7: primary school 8 : illegitimate) Income field continuous (3200 K) or interval based ( K)
Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of missing values different considerations in coding and analyzing (discrepancies with time) hardware/software problems different sources not aligned with same data dictionary; Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Source 1Source 2Source 3 Field 1 Field 2 Field 3
Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of erraneous values Human gives incomplete, about to be correct information AD DOĞUM YERİ DOĞUM TARİHİ SOYAD ADRESİ ÇALIŞMA ÜNVANI ÇALIŞMA YERİ Metin Ü. GAZİANTEP SANRE Atatrk Cad. Kemaliye Mah. 25/3 Genel Müdür Devlet Su İşleri A.O M.Ulku G.ANTEB 10/04/1965 SANER Atatürk Cd. Kemaliye Sok. No.25 Gen. Müdr. G.Antep D.S.İ.
Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of erraneous values Human gives incomplete, about to be correct information Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent İst Asagidere Yokuşu D:14 Esendere Cd. 3.Levent ISTANBUL Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha. İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok.
Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of erraneous values insufficient, incapable data collection instruments partial matching, fuzzy understanding, syntactic- semantic enrichment continuous flow of data may cause data entrance faults error or disruption in data transmission
Data Pre-processing: Information Extraction and Data Processing Techniques –Where may dirtiness come from the reasons of inconsistent values insufficient lookup mappings incapable transformation infrastructures different data sources hard to prevent needs highly specialized synchronization and automation infrastructure also we should care duplicate data (Redundancy)
Data Pre-processing: Information Extraction and Data Processing Techniques –Why pre-processing so important Data quality brings successful data mining The Only Way to extract information from Data Major tasks in Data Pre-processing: Data cleaning Data integration Data transformation Data reduction Data discretization
Data Pre-processing: Information Extraction and Data Processing Techniques
Major tasks in Data Cleaning: –Fill in missing values –Identify outliers and smooth out noisy data –Correct inconsistent data –Resolve redundancy caused by data integration
Data Pre-processing: Information Extraction and Data Processing Techniques Major tasks in Data Cleaning: –Fill in missing values –Identify outliers and smooth out noisy data –Correct inconsistent data –Resolve redundancy caused by data integration
Data Pre-processing: Information Extraction and Data Processing Techniques How to handle missing data –simply do not accept it –fill it manually –fill it automatically: » a global constant : e.g., “unknown”, a new class?! » the attribute mean »the attribute mean for all samples belonging to the same class: smarter »the most probable value: inference-based such as Bayesian formula or decision tree
Data Pre-processing: Information Extraction and Data Processing Techniques How to handle noisy data –Binning (discretization) method: »first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. »use data distribution and domain knowledge –Clustering »detect and remove outliers –Combined computer and human inspection »detect suspicious values and check by human (e.g., deal with possible outliers) –Regression »smooth by fitting the data into regression functions –Model the data and infer the most probable values (difficult)
Data Pre-processing: Information Extraction and Data Processing Techniques Binning Equal-width (distance) partitioning: – divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. – The most straightforward, but outliers may dominate presentation – Skewed data is not handled well. Equal-depth (frequency) partitioning: – Divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky.
Data Pre-processing: Information Extraction and Data Processing Techniques Binning Sorted data (e.g., by price) – 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: Smoothing by bin means: Smoothing by bin boundaries:
Data Pre-processing: Information Extraction and Data Processing Techniques Binning Sorted data (e.g., by price) – 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34 Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34
Data Pre-processing: Information Extraction and Data Processing Techniques Regression x y y = x + 1 X1 Y1 Y1’
Data Pre-processing: Information Extraction and Data Processing Techniques Clustering
Data Pre-processing: Information Extraction and Data Processing Techniques How to handle inconsistent data –systematic conversion, “transformation” –dynamic and interactive control mechanishms –redundancy detection and intelligent mapping
Data Pre-processing: Information Extraction and Data Processing Techniques Transformation –Smoothing: remove noise from data –Aggregation: summarization, data cube construction –Generalization: concept hierarchy climbing –Normalization: scaled to fall within a small, specified range min-max normalization »z-score normalization »normalization by decimal scaling –Attribute/feature construction: New attributes constructed from the given ones
Data Pre-processing: Information Extraction and Data Processing Techniques Transformation –Smoothing: remove noise from data –Aggregation: summarization, data cube construction –Generalization: concept hierarchy climbing –Normalization: scaled to fall within a small, specified range min-max normalization »z-score normalization »normalization by decimal scaling –Attribute/feature construction: New attributes constructed from the given ones
Remember Stats Facts Min: –What is the big oh value for finding min of n-sized list ? Max: –What is the min number of comparisons needed to find the max of n-sized list? Range: –What about simultaneous finding of min-max? Value Types: –Cardinal value -> how many, counting numbes –Nominal value -> names and identifies something –Ordinal value -> order of things, rank, position
Transformation Min-max normalization: to [new_minA, new_maxA] –Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to Z-score normalization (μ: mean, σ: standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1
Remember Stats Facts Mean (algebraic measure) (sample vs. population): –Weighted arithmetic mean: –Trimmed mean: chopping extreme values Median: A holistic measure –Middle value if odd number of values, or average of the middle two values otherwise –Estimated by interpolation (for grouped data): Mode –Value that occurs most frequently in the data –Unimodal, bimodal, trimodal –Empirical formula:
Week 4-End read –Course Text Book Chapter 2