Www.company.com Lab4 CPIT 440 Data Mining and Warehouse.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Introduction to Data Mining with XLMiner

Data Mining: Concepts and Techniques
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Data Preprocessing.
Survey Research & Understanding Statistics
Copyright 2003, Paradigm Publishing Inc. CHAPTER 7 BACKNEXTEND 7-1 LINKS TO OBJECTIVES Create a Chart Create in a Separate Worksheet Create in a Separate.
Introduction to Excel 2007 Bar Graphs & Histograms Psych 209 February 1st, 2011.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms Psych 209.
 1.1: Introduction  1.2: Descriptions  1.2.1: White wine description  1.2.2: Brest Tissue description  1.3: Conclusion.
1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and.
Lab2 CPIT 440 Data Mining and Warehouse.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Chapter 3 – Descriptive Statistics
LINDSEY BREWER CSSCR (CENTER FOR SOCIAL SCIENCE COMPUTATION AND RESEARCH) UNIVERSITY OF WASHINGTON September 17, 2009 Introduction to SPSS (Version 16)
Statistics Recording the results from our studies.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson Department of Computer and Information Science,
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
BUS304 – Chapter 6 Sample mean1 Chapter 6 Sample mean  In statistics, we are often interested in finding the population mean (µ):  Average Household.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
Collecting Data Name Number of Siblings Preferred Football Team Star Sign Hand Span.
June 21, Objectives  Enable the Data Analysis Add-In  Quickly calculate descriptive statistics using the Data Analysis Add-In  Create a histogram.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
SPSS Homework Practice The Neuroticism Measure = S = 6.24 n = 54 How many people likely have a neuroticism score between 29 and 34?
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Exploring Data: Summary Statistics and Visualizations
Data Mining: Concepts and Techniques
UNIT-2 Data Preprocessing
STAT 206: Chapter 6 Normal Distribution.
Practical Database Design and Tuning
Classification & Prediction
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
CSCI N317 Computation for Scientific Applications Unit Weka
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Lab4 CPIT 440 Data Mining and Warehouse

Lab4: Outlines Data Mining Process Data Gathering and Preparation (Preprocessing) –Techniques of the Data Preprocessing Data Integration Techniques Data Cleaning Techniques Data Transformation Techniques Data Discritization Techniques –Definition and Exercises CPIT 440 Data Mining and Warehouse

Data Mining Process CPIT 440 Data Mining and Warehouse

Data Gathering and Preparation The data understanding phase involves data collection and exploration. You can take a closer look at the data, you can determine how well it addresses the business problem. You might decide to remove some of the data or add additional data. Data preparation can significantly improve the information that can be discovered through data mining. CPIT 440 Data Mining and Warehouse

The data preparation phase covers all the tasks involved in creating the case table you will use to build the model. Tasks include data cleansing, binning and transformation. For example, –you might transform a DATE_OF_BIRTH column to AGE; –you might insert the average income in cases where the INCOME column is null. CPIT 440 Data Mining and Warehouse Data Gathering and Preparation

Data Preprocessing Techniques Data Integration Techniques: –Correlation (Numerical Data) by using Excel –Correlation (Categorical Data-Chi-Square Test) by using Excel Data Cleaning Techniques: –Fill the Missing Values by using ODM –Outlier Treatment for Reducing Noise by using ODM Data Transformation Technique: –Normalization by using ODM Data Discritization Technique: –Discritization by using ODM CPIT 440 Data Mining and Warehouse

Data Integration Technique Definition: Sometimes too much information can reduce the effectiveness of data mining. Data sets with many attributes may contain groups of attributes that are: Irrelevant attributes which is simply add noise to the data and affect model accuracy. –Noise increases the size of the model and the time and system resources needed for model building and scoring. CPIT 440 Data Mining and Warehouse

Data Integration Technique Or, correlated attributes that may actually be measuring the same underlying feature. –Their presence together in the build data can skew the logic of the algorithm and affect the accuracy of the model. To minimize the effects of noise, the technique like correlation is sometimes a desirable preprocessing step for data mining. CPIT 440 Data Mining and Warehouse

Data Integration Technique Exercises: Correlation (Numerical Data) by using Excel. Open Excel file Corr.xlsx Correlation Results will always be between -1 and 1 –1 = Positive Correlation –0 = No Correlation –-1 = Negative Correlation CPIT 440 Data Mining and Warehouse

Data Cleaning Technique CPIT 440 Data Mining and Warehouse 1.Fill the Missing Values by using ODM: –When building or applying a model, Oracle Data Mining automatically replaces missing values of –numerical attributes: with the mean, max/min, avg, specific value or zero values. –categorical attributes with the mode.

Exercise –Open ODM and import File demo_missing.csv Take a view on this file in the attribute length_of_residence there are some data missing; –Now we will apply a technique of data cleaning to fill out the missing data. From ODM open Data  Transform  Missing Value CPIT 440 Data Mining and Warehouse

Exercise –This will open Missing Value Transformation Wizard CPIT 440 Data Mining and Warehouse

Exercise –In the 4 th step of wizard Select the Column (attribute) on which you are going to apply missing Value technique and then press on Transform button. –You will see three option select Replace With – Mean. –Continue with next button till finish. CPIT 440 Data Mining and Warehouse

Exercise See the difference by using histogram, between Missing Data and after Fill Out Data. CPIT 440 Data Mining and Warehouse With Missing After solving Missing

Data Cleaning Technique 2. Outlier Treatment for Reducing Noise by using ODM: –A value is considered an outlier if it deviates significantly from most other values in the column. –The presence of outliers can have a skewing effect on the data and then can result in the inaccurate model –Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers. CPIT 440 Data Mining and Warehouse

Exercise –Import File demo_outliers.csv Take a view on this file in the attribute years_details_listed, there are some outliers (Noise), means there are some values under this attribute which are very far from other. CPIT 440 Data Mining and Warehouse

Exercise –Now we will apply a technique of data cleaning to reduce this noise from the data. –Open Data  Transform  Outlier Treatment CPIT 440 Data Mining and Warehouse

Exercise – This will open Outlier Treatment Transformation Wizard –In the 4 th Step of wizard Select the Column (attribute) on which you are going to apply outlier treatment technique –then press std.deviation button then select edge/null values to be replaced with. CPIT 440 Data Mining and Warehouse

Exercise –Continue with next button till finish. CPIT 440 Data Mining and Warehouse

Exercise See the difference by using histogram, between Noisy data and after outlier treatment applied. CPIT 440 Data Mining and Warehouse

Data Transformation Technique: Normalization by using ODM: –Normalization is the technique that transforming numerical values into a specific range, such as [–1.0…1.0] or [0.0…1.0] CPIT 440 Data Mining and Warehouse

Exercise –Import File demo_original.csv Take a view on this file in the attribute family_income_indicator, we will apply normalize technique. CPIT 440 Data Mining and Warehouse

Exercise –Open Data  Transform  Normalize –This will open Normalize Transformation Wizard –In the 3 rd Step of wizard Select the Column (attribute) on which you are going to apply normalize technique and –then press Define button then select min-max transformation algorithm. CPIT 440 Data Mining and Warehouse

Exercise Continue with next button till finish. CPIT 440 Data Mining and Warehouse

Exercise Notice the difference by using histogram, before and after normalization. CPIT 440 Data Mining and Warehouse

Data Discritization Technique: Discritization by using ODM –Also called binning, is a technique for reducing the cardinality of continuous and discrete data. –It groups related values together in bins to reduce the number of distinct values. –Discritization can improve resource utilization and model build response time dramatically without significant loss in model quality. CPIT 440 Data Mining and Warehouse

Exercise –Import File demo_original.csv Take a view on this file in the attribute family_income_indicator, we will apply discritize technique. –Open Data  Transform  Discritize –This will open Discritize Transformation Wizard –In the 4 th Step of wizard Select the Column (attribute) on which you are going to apply discritize technique and –then press Equal Width button then write 10 number of bins. CPIT 440 Data Mining and Warehouse

Exercise Continue with next button till finish. CPIT 440 Data Mining and Warehouse

Exercise See the difference by using histogram, before and after discritization. CPIT 440 Data Mining and Warehouse Before After