Lecture 7: Data Preprocessing

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Machine Learning Lecture 8 Data Processing and Representation
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.

Lecture Notes for Chapter 2 Introduction to Data Mining
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Data Mining: Concepts and Techniques — Chapter 3 — Cont.
Principal Component Analysis
CSci 8980: Data Mining (Fall 2002)
Data Preprocessing.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
COSC 4335 DM: Preprocessing Techniques
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
From Rough Set Theory to Evidence Theory Roman Słowiński Laboratory of Intelligent Decision Support Systems Institute of Computing Science Poznań University.
What is Data? Attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining Lecture 2: Data. 2 What is Data? l Collection of data objects and their attributes l Attribute is a property or characteristic of an object.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Presented by: Muhammad Wasif Laeeq (BSIT07-1) Muhammad Aatif Aneeq (BSIT07-15) Shah Rukh (BSIT07-22) Mudasir Abbas (BSIT07-34) Ahmad Mushtaq (BSIT07-45)
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
Estimating standard error using bootstrap
Principal Component Analysis (PCA)
Unsupervised Learning
Data Transformation: Normalization
Chapter 7 (b) – Point Estimation and Sampling Distributions
Data Mining: Concepts and Techniques
School of Computer Science & Engineering
UNIT-2 Data Preprocessing
Machine Learning Dimensionality Reduction
Principal Component Analysis
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Lecture Notes for Chapter 2 Introduction to Data Mining
Principal Component Analysis (PCA)
Descriptive Statistics vs. Factor Analysis
Classification & Prediction
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
Lecture 6: Data Quality and Pandas
Dimensionality Reduction
Feature space tansformation methods
Data Transformations targeted at minimizing experimental variance
Lecture 1: Descriptive Statistics and Exploratory
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Group 9 – Data Mining: Data
Feature Selection Methods
Principal Component Analysis
Analyzing Reliability and Validity in Outcomes Assessment
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Marios Mattheakis and Pavlos Protopapas
Unsupervised Learning
Presentation transcript:

Lecture 7: Data Preprocessing CSE 482 Lecture 7: Data Preprocessing

Overview Previous lecture Today’s lecture Data quality issues Data preprocessing: Transforming the raw data into a more “useful” representation for subsequent analysis Includes data cleaning, aggregation, feature extraction, etc

Data Preprocessing Tasks Data cleaning Noise, outliers, missing values, duplicate data Sampling Aggregation Discretization Feature extraction

Sampling Sampling is a technique for data reduction The key principle for effective sampling is to find a representative sample A sample is representative if it has approximately the same property (of interest) as the original set of data 8000 points 2000 Points 500 Points

Types of Sampling Simple Random Sampling Stratified sampling There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions; then draw random samples from each partition Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once

Python Example

DataFrame.sample()

Aggregation Sometimes, less is more Purpose Aggregation combines two or more observations into a single observation Purpose Data reduction smaller data set means less memory and processing time Change of scale aggregation generates a coarser-level view of the data More “stable” data aggregated data tends to have less variability (less noisy)

Aggregation Precipitation at Maple City, Michigan Harder to detect trends/patterns Daily precipitation Monthly precipitation Annual precipitation Long-term trends and cycles are easier to discern

Discretization Ordinal attribute Numeric attribute: shirt size (small/medium/large), hurricane (category 1 – 5) Numeric attribute: Weight, height, salary, # days since Jan 1 2000. Discretization is used to split the range of a numeric attribute into discrete number of intervals Age can be discretized into [child, young adult, adult, and senior]. There may be no apparent relationship between age attribute and tendency to buy a particular product But the relationship may exist only among certain age groups (e.g.young adults)

Unsupervised Discretization Equal interval width Split the range of the numeric attribute into equal length intervals (bins) Pros: cheap and easy to implement Cons: susceptible to outliers Equal frequency Split the range of the numeric attribute in such a way that each interval (bin) has the same number of points Pros: robust to outliers Cons: more expensive (must sort data), may not be consistent with inherent structure of the data

Python Example Discretize into 5 equal-width bins Discretize into 5 equal-frequency bins (values are the quantiles)

Supervised Discretization Example Age Buy 10 No 15 18 Yes 19 24 29 30 31 40 44 55 64 Let “Buy” be the class attribute Suppose we’re interested to discretize the Age attribute We also want the intervals (bins) to contain data points from the same class (i.e., we want the bins to be as close to homogeneous as possible)

Example Equal width: interval = (64-10)/3 = 54/3 = 18 Age Buy 10 No 15 Yes 19 24 29 30 31 40 44 55 64 Equal frequency: Both approaches can produce intervals that contain non-homogeneous classes

Supervised Discretization Age Buy 10 No 15 18 Yes 19 24 29 30 31 40 44 55 64 Yes No Supervised discretization: In supervised discretization, our goal is to ensure that each bin contains data points from one class.

Entropy-based Discretization A widely-used supervised discretization method Entropy is a measure of impurity Higher entropy implies data points are from a large number of classes (heterogeneous) Lower entropy implies most of the data points are from the same class Where pj is the proportion of data points belonging to class j

Entropy Suppose you want to discretize age of users based on whether they buy or don’t buy a product Age Buy 10 No 15 18 Yes 19 24 29 30 31 40 44 55 64 Class here is Yes or No (whether the user buys a product) For each bin, calculate the proportion of data points belonging to each class

Entropy P(Yes) = 0/6 = 0 P(No) = 6/6 = 1 Where pj is the fraction of data objects belonging to class j P(Yes) = 0/6 = 0 P(No) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(Yes) = 1/6 P(Yes) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(Yes) = 2/6 P(No) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 As the bin becomes less homogeneous, entropy increases

Entropy-based Discretization Recursively find the best partition that minimizes entropy Split point = 35.5

Entropy-based Discretization Find the next best partition that minimizes entropy Split point = 16.5 Split point = 35.5

Curse of Dimensionality Suppose you want to build a model to predict whether a user will buy an item at an online store Simple model: predict whether users will buy based on their age Model is likely to perform poorly Can we improve it?

Curse of Dimensionality Suppose you want to build a model to predict whether a user will buy an item at an online store A more complicated model Model is likely to be more accurate since we can use the two attributes to separate the ones who buy from those who don’t Can we do even better?

Curse of Dimensionality Suppose you want to build a model to predict whether a user will buy an item at an online store Can we keep improving the model by adding more features?

Curse of Dimensionality Given a data set with fixed number of objects, increasing number of attributes (i.e., dimensionality of data) may actually degrade model performance As number of dimensions increases Higher chance for the model to overfit noisy observations Need for more examples to figure out which attribute is most relevant to predict the different classes

Overcoming Curse of Dimensionality Feature subset selection Pick a subset of attributes to build your prediction model Eliminate the irrelevant and highly correlated ones Feature extraction Construct a new set of attributes based on (linear or nonlinear) combination of the original attributes

Feature Selection Example Select the non-correlated features for your analysis Correlation matrix:

Feature Extraction Creation of a new set of attributes from the original raw data Example: face detection in images Raw pixels are too fine-grained to enable accurate detection of a face By generating higher-level features, such as those representing the presence or absence of certain facial features (e.g., mouth, eyebrow, etc) can help improve detection accuracy

Principal Component Analysis A widely-used (classical) approach for feature extraction The goal of PCA is to construct a new set of dimensions (attributes) that better captures variability of the data. The first dimension is chosen to capture as much of the variability as possible. The second dimension is orthogonal to the first, and, captures as much of the remaining variability as possible, and so on.

Principal Component Analysis k << d Projected Data d Data Frame (table) N d Principal components k

Example

Example Note: membership years, amount spent, and number of purchases are quite correlated

Computing Principal Components Given a data set D Calculate the covariance matrix C PCs are the eigenvectors of the covariance matrix To calculate the projected data: Center each column in the data: D’ Calculate the projections: (T: transpose operation) projectedT = (PC)T x (D’)T We want to project the data from 5 features to 2 features (principal components) k x N k x d d x N

Example We can use numpy linear algebra functions to calculate eigenvectors and perform matrix multiplication data.cov() – calculate covariance matrix data.as_matrix() – convert DataFrame to Numpy array Linalg.eig(cov) – calculate eigenvalues & eigenvectors (eigenvectors are the pcs) (A – mean(A.T, axis=1)) - center columns of data matrix dot(pc.T, M).T - multiply PC with centered data matrix

Example 6 3 7

Example 1st PC 2nd PC

Summary In this lecture, we discuss about Next lecture Data preprocessing approaches Examples of using Python to do data preprocessing Next lecture Data summarization and visualization