Data Transformation: Normalization

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Data Mining – Intro.

Exploratory Data Mining and Data Preparation
Data Preprocessing.
Chapter 10 Simple Regression.
CSci 8980: Data Mining (Fall 2002)
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Spatial and Temporal Data Mining
Data Preprocessing.
Chapter 11 Multiple Regression.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Peter Brezany and Christian Kloner Institut für Scientific Computing
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Classification and Prediction: Regression Analysis
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Collaborative Filtering Matrix Factorization Approach
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Classification and Prediction (cont.) Pertemuan 10 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
November 24, Data Mining: Concepts and Techniques.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
Data Mining and Decision Support
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Bzupages.comData Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 3 — ©Jiawei Han and Micheline.
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
MATH-138 Elementary Statistics
Chapter 7. Classification and Prediction
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
A Simple Artificial Neuron
Data Warehousing/Mining Comp 150 DW Chapter 3: Data Preprocessing
Machine Learning Feature Creation and Selection
K Nearest Neighbor Classification
Data Mining – Intro.
Collaborative Filtering Matrix Factorization Approach
COSC 4335: Other Classification Techniques
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Chapter 7: Transformations
Presentation transcript:

Data Transformation: Normalization Useful for classification algorithms involving Neural networks Distance measurements (nearest neighbor) Backpropagation algorithm (NN) – normalizing help in speed up the learning phase Distance-based methods – normalization prevent attributes with initially large range (i.e. income) from outweighing attributes with initially smaller ranges (i.e. binary attribute)

Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1

Example: Suppose that the minimum and maximum values for the attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0.0, 1.0]. Suppose that the mean and standard deviation of the values for the attribute income are $54,000 and $16,000, respectively. Suppose that the recorded values of A range from –986 to 917.

Data Reduction Strategies Data is too big to work with – may takes time, impractical or infeasible analysis Data reduction techniques Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Data cube aggregation – apply aggregation operations (data cube)

Cont’d Dimensionality reduction—remove unimportant attributes Data compression – encoding mechanism used to reduce data size Numerosity reduction – data replaced or estimated by alternative, smaller data representation - parametric model (store model parameter instead of actual data), non-parametric (clustering sampling, histogram) Discretization and concept hierarchy generation – replaced by ranges or higher conceptual levels

Data Cube Aggregation Store multidimensional aggregated information Provide fast access to precomputed, summarized data – benefiting on-line analytical processing and data mining Fig. 3.4 and 3.5

Dimensionality Reduction Feature selection (i.e., attribute subset selection): Select a minimum set of attributes (features) that is sufficient for the data mining task. Best/worst attributes are determined using test of statistical significance – information gain (building decision tree for classification) Heuristic methods (due to exponential # of choices – 2d): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination etc

Decision tree induction Originally for classification Internal node denotes a test on an attribute Each branch corresponds to an outcome of the test Leaf node denotes a class prediction At each node – algorithm chooses the ‘best attribute to partition the data into individual classes In attribute subset selection – it is constructed from given data

Data Compression Compressed representation of the original data Original data can be reconstructed from compressed data (without loss of info – lossless, approximate - lossy) Two popular and effective of lossy method: Wavelet Transforms Principle Component Analysis (PCA)

Numerosity Reduction Reduce the data volume by choosing alternative ‘smaller’ forms of data representation Two type: Parametric – a model is used to estimate the data, only the data parameters is stored instead of actual data regression log-linear model Nonparametric –storing reduced representation of the data Histograms Clustering Sampling

Regression Develop a model to predict the salary of college graduates with 10 years working experience Potential sales of a new product given its price Regression - used to approximate the given data The data are modeled as a straight line. A random variable Y (response variable), can be modeled as a linear function of another random variable, X (predictor variable), with the equation

Cont’d Y is assumed to be constant  and  (regression coefficients) – Y-intercept and the slope line. Can be solved by the method of least squares. (minimizes the error between actual line separating data and the estimate of the line)

Cont’d

Multiple regression Extension of linear regression Involve more than one predictor variable Response variable Y can be modeled as a linear function of a multidimensional feature vector. Eg: multiple regression model based on 2 predictor variables X1 and X2

Histograms A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Use binning to approximate data distributions Bucket – horizontal axis, height (area) of bucket – the average frequency of the values represented by the bucket Bucket for single attribute-value/frequency pair – singleton buckets Continuous ranges for the given attribute

Example A list of prices of commonly sold items (rounded to the nearest dollar) 1,1,5,5,5,5,5,8,8,10,10,10,10,12, 14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30. Refer Fig. 3.9

Cont’d How are the bucket determined and the attribute values partitioned? (many rules) Equiwidth, Fig. 3.10 Equidepth V-Optimal – most accurate & practical MaxDiff – most accurate & practical

Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared”/ spread There are many choices of clustering definitions and clustering algorithms. We will discuss them later.

Sampling Data reduction technique 4 types Refer Fig. 3.13 pg 131 A large data set to be represented by much smaller random sample or subset. 4 types Simple random sampling without replacement (SRSWOR). Simple random sampling with replacement (SRSWR). Develop adaptive sampling methods such as cluster sample and stratified sample Refer Fig. 3.13 pg 131