Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.

Data Mining: Concepts and Techniques
Data Preprocessing.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Data Preprocessing.
Data Mining By Archana Ketkar.
Data Mining – Intro.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 5 Data mining : A Closer Look.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Lab2 CPIT 440 Data Mining and Warehouse.
Data Mining Techniques
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
7 Strategies for Extracting, Transforming, and Loading.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Data Mining and Decision Support
Information Design Trends Unit Five: Delivery Channels Lecture 2: Portals and Personalization Part 2.
Classification Vikram Pudi IIIT Hyderabad.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Understanding, Cleaning, Transforming. Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Mining Functionalities
Data Mining – Intro.
Data Transformation: Normalization
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Classification & Prediction
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Classification and Prediction
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad

1 What is Preprocessing? Prepare input data for data mining Clean: noise, inconsistency, missing values Transform: discretize, normalize, generalize,… The kind of transformations to apply depend on our end goals, i.e. what are we trying to achieve by doing data mining. E.g. prediction, summarization, optimization, …

2 Scenario 1 Suppose you want to develop a media player that can provide recommendations to users on what items to play based on your taste.

3 Available data Song Artist Album Genre Rating User profile This data may be split across many tables.

4 Strategies 1. Find similar songs and recommend them. 2. Find similar users and recommend their highly- rated songs. In strategy 1, our preprocessing requires to find features of songs and store them; e.g. artist, song length, frequency spectrum, tempo, etc. In strategy 2, our preprocessing requires to find features of users and store them; e.g. nationality, age, lenient/strict, etc.

5 Scenario 2 Given sales transactions in a supermarket, suppose we want to determine the factors that lead to increase in sales of “Nirma washing powder”.

6 Available data Item-id Amount purchased Price Total User profile (more details in other tables) Date Type of item (more details in other tables)

7 Strategies 1. Determine what kind of users buy Nirma. 2. Determine what other items are purchased frequently with Nirma. 3. Determine what kind of items are purchased frequently with Nirma. 4. Determine on which kind of days people buy more Nirma. What preprocessing is required?

8 Scenario 3 Suppose you want to write an application that searches for faculty web-pages from all over the world, extracts information from these pages and determine which of these faculty do top-class research in the area of “data mining”. What kind of data is available? What preprocessing is needed?

9 Data Cleaning Noise, inconsistency, missing vales

10 Handling noise (inaccuracies) Binning: Sort values of an attribute and partition into ranges/bins. Replace all values within a bin by its mean/median/… Regression: Fit the data into a function such as linear or non-linear regression. Outliers: Treat outliers as noise and ignore them.

11 Inconsistency Avoid by using good integrity constraints when designing database. If inconsistency still arises, cure using same approaches as for handling noise.

12 Missing Values 1. Ignore missing values 2. Most common value 3. Concept most common value 4. All possible values 5. Missing values as special values 6. Use classification techniques

13 Data Transformation Format conversion, discretization, normalization, …

14 Data format conversion Needed usually when combining data from multiple sources. Needs major manual programming effort Remember Y2K problem!

15 Discretization (Numeric  Categorical) Sort values of numeric attribute Divide sorted values into ranges Equi-depth Clustering Information gain

16 Normalization Some attributes may have large ranges. Bring all attributes to common range. Scale values to lie within, say, -1.0 to +1.0

17 Generalization Categorical attributes (like name, location) contain too many values. Attributes like name can be ignored. Attributes like location can be generalized (e.g. instead of using address, use only the city/town name).

18 Feature selection Which features are likely to be relevant for a given task? E.g. for detecting spam s, some words such as “free university degree”, “easy loan”, etc. may be more relevant than others. Approach: Find discriminating features.