Data Mining Waqas Haider Bangyal.

Slides:



Advertisements
Similar presentations
Data Preparation for Knowledge Discovery. 2 Outline: Data Preparation Data Understanding Data Cleaning Metadata Missing Values Unified Date Format Nominal.
Advertisements

UNIT – 1 Data Preprocessing
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Introduction to Data Mining with XLMiner

Data Mining: Concepts and Techniques
Data Preprocessing.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Data Preparation for Knowledge Discovery. 2 Outline: Data Preparation  Data Understanding  Data Cleaning  Metadata  Missing Values  Unified Date.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
What is Data? Attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Data Mining and Decision Support
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Knowledge-based systems Sanaullah Manzoor CS&IT, Lahore Leads University
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Mining Modified from
Data Preliminaries CSC 600: Data Mining Class 1.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining: Concepts and Techniques
DATA MINING © Prentice Hall.
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Data Preparation
Data Mining Techniques and Applications
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Lecture 1: Descriptive Statistics and Exploratory
Data Mining Data Preprocessing
Data Preliminaries CSC 576: Data Mining.
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Pre-processing Lecture Notes for Chapter 2
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Data Mining Waqas Haider Bangyal

Data Preparation and Preprocessing

The extraction of knowledge from data is called data mining. Data mining is the principle of sorting through large amounts of data and picking out relevant information. The extraction of knowledge from data is called data mining.

Database Processing vs. Data Mining Processing Query Poorly defined No precise query language Output Fuzzy Not a subset of database Data Bases Query Well defined SQL Output Precise Subset of database

Query Examples Data Bases Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk Data Mining Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules)

Data Types and Forms Collection of data objects and their attributes is called data An attribute is a property or characteristic of an entity Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance

Data Types and Forms Data types Other data forms numeric, categorical (see the hierarchy for their relationship in next slide) static, dynamic (temporal) Other data forms distributed data text, Web, meta data images, audio/video

Data Types and Forms

TYPES OF ATTRIBUTES There are different types of attributes – Nominal Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio Examples: temperature in Kelvin, length, time, counts

DISCRETE AND CONTINUOUS ATTRIBUTES Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floatingpoint variables.

Data Preparation An important & time consuming task in KDD Huge size (volume) data Missing data Outliers Erroneous data (inconsistent, mis-recorded, distorted) Raw data

Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data

Data Understanding: Relevance What data is available for the task? Is this data relevant? Is additional relevant data available? How much historical data is available? Who is the data expert ?

Data Understanding: Quantity Number of instances (records, objects) Rule of thumb: 5,000 or more desired if less, results are less reliable; use special methods (boosting) Number of attributes (fields) Rule of thumb: for each attribute, 10 or more instances If more fields, use feature reduction and selection Number of targets Rule of thumb: >100 for each class if very unbalanced, use stratified sampling

Data Quality: Why Preprocess the Data? Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, … Consistency: some modified but some not, dangling, … Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data

Forms of Data Preprocessing

Data cleaning tasks Data Cleaning Data acquisition and metadata Fill in missing values Unified date format Converting nominal to numeric Identify outliers and smooth out noisy data Correct inconsistent data

Data Cleaning: Acquisition Data can be in DBMS DBC, JDBC protocols Data in a flat file Fixed-column format Delimited format: tab, comma “,”, other E.g. C4.5 and Weka “arff” use comma-delimited data Attention: Convert field delimiters inside strings Verify the number of fields before and after

Data Cleaning: Example

Data Cleaning: Reformatting Convert data to a standard format (e.g. arff or csv) Missing values Unified date format Binning of numeric data Fix errors and outliers Convert nominal fields whose values have order to numeric.

Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.

How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class? Imputation: Use the attribute mean to fill in the missing value, or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

Data Missing Example Position Original Sample 1 0.0886 2 0.0684 3 0.3515 4 0.9874 5 0.4713 6 0.6115 7 0.2573 8 0.2914 9 0.1662 10 0.4400 11 0.6939 Position 11 Missing 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 ? Preserve Mean 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 0.3731 Preserve Variance 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 0.6629  

Data Cleaning: Unified Date Format We want to transform all dates to the same format internally Some systems accept dates in many formats e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc dates are transformed internally to a standard value Frequently, just the year (YYYY) is sufficient For more details, we may need the month, the day, the hour, etc Representing date as YYYYMM or YYYYMMDD can be OK, but has problems Q: What are the problems with YYYYMMDD dates? Ans: Ignoring for now the Looming Y10K (year 10,000 crisis …) YYYYMMDD does not preserve intervals: 20040201 -20040131 /= 20040131 –20040130 This can introduce bias into models

Conversion: Nominal to Numeric Some tools can deal with nominal values internally Other methods (neural nets, regression, nearest neighbor) require only numeric inputs To use nominal fields in such methods need to convert them to a numeric value Q: Why not ignore nominal fields altogether? Ans: They may contain valuable information Different strategies for binary, ordered, multi-valued nominal fields

Conversion: Binary to Numeric Binary fields E.g. Gender=M, F Convert to Field_0_1 with 0, 1 values e.g. Gender = M Gender_0_1 = 0 Gender = F Gender_0_1 = 1

Conversion: Ordered to Numeric Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g. A +  4.0 A  3.7 B+  3.3 B  3.0 Q: Why is it important to preserve natural order? A: To allow meaningful comparisons, e.g. Grade > 3.5

Conversion: Nominal, Few Values Multi-valued, unordered attributes with small (rule of thumb < 20) no. of values e.g. Color=Red, Orange, Yellow, …, Violet for each value v create a binary “flag” variable C_v, which is 1 if Color=v, 0 otherwise

Noise: random error or variance in a measured variable Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data

How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions

Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Cluster Analysis

Regression

QUESTIONS