Course Outline 1. Pengantar Data Mining 2. Proses Data Mining

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Lecture-19 ETL Detail: Data Cleansing

Data Mining: Concepts and Techniques
Data Preprocessing.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Chapter 2: Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CS685: Special Topics in Data Mining Jinze Liu September 9,
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Descriptive Exploratory Data Analysis III Jagdish S. Gangolly State University of New York at Albany.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Data Mining: Concepts and Techniques Data Preprocessing.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

DATA MINING Handling Missing Attribute Values and Knowledge Discovery Shahzeb Kamal Olov Junker AmsterdamUppsala HASCO SHAH - OLOV.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining What is to be done before we get to Data Mining?
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
Knowledge-based systems Sanaullah Manzoor CS&IT, Lahore Leads University
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Mining Modified from
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Data Mining Techniques and Applications
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Data Mining Waqas Haider Bangyal.
©Jiawei Han and Micheline Kamber Department of Computer Science
Data Mining – Intro.
©Jiawei Han and Micheline Kamber Department of Computer Science
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Classification and Prediction
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques
Lecture 1: Descriptive Statistics and Exploratory
Data Mining Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
By Sandeep Patil, Department of Computer Engineering, I²IT
Resource: J. Han and other books
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Data Proprocessing — Chapter 2 —
Presentation transcript:

Course Outline 1. Pengantar Data Mining 2. Proses Data Mining 8. Text Mining 7. Algoritma Estimasi dan Forecasting 6. Algoritma Asosiasi 5. Algoritma Klastering 4. Algoritma Klasifikasi 3. Persiapan Data 2. Proses Data Mining 1. Pengantar Data Mining

3. Persiapan Data 3.1 Data Preprocessing 3.2 Data Cleaning 3.3 Data Reduction 3.4 Data Transformation and Data Discretization 3.5 Data Integration

3.1 Data Preprocessing

CRISP-DM romi@romisatriawahono.net Object-Oriented Programming http://romisatriawahono.net

Why Preprocess the Data? Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, … Consistency: some modified but some not, … Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing Data cleaning Fill in missing values Smooth noisy data Identify or remove outliers Resolve inconsistencies Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation and data discretization Normalization Concept hierarchy generation Data integration Integration of multiple databases or files

3.2 Data Cleaning

Data Cleaning Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation=“ ” (missing data) Noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error) Inconsistent: containing discrepancies in codes or names e.g., Age=“42”, Birthday=“03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” Discrepancy between duplicate records Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred

Contoh Missing Data Dataset: MissingDataSet.csv

MissingDataSet.csv Jerry is the marketing manager for a small Internet design and advertising firm Jerry’s boss asks him to develop a data set containing information about Internet users The company will use this data to determine what kinds of people are using the Internet and how the firm may be able to market their services to this group of users To accomplish his assignment, Jerry creates an online survey and places links to the survey on several popular Web sites Within two weeks, Jerry has collected enough data to begin analysis, but he finds that his data needs to be denormalized He also notes that some observations in the set are missing values or they appear to contain invalid values Jerry realizes that some additional work on the data needs to take place before analysis begins.

Relational Data

View of Data (Denormalized Data)

Contoh Missing Data Dataset: MissingDataSet.csv

How to Handle Missing Data? Ignore the tuple: Usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: Tedious + infeasible? Fill in it automatically with A global constant: e.g., “unknown”, a new class?! The attribute mean The attribute mean for all samples belonging to the same class: smarter The most probable value: inference-based such as Bayesian formula or decision tree