Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Data Mining: Concepts and Techniques
Data Preprocessing.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Data Preprocessing CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Business Systems Intelligence: 2. Data Preparation
Basic Data Mining Techniques
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Spatial and Temporal Data Mining
Data Preprocessing.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Data Mining: Concepts and Techniques — Chapter 2 —
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
The Knowledge Discovery Process; Data Preparation & Preprocessing
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
8/24/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Lecture 3: Data Preprocessing Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 3 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &
Data Preprocessing Pertemuan 03 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
Preprocessing of Lifelog September 18, 2008 Sung-Bae Cho 0.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
November 24, Data Mining: Concepts and Techniques.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining Lecture 6. Course Syllabus Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Bzupages.comData Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 3 — ©Jiawei Han and Micheline.
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
June 26, C.Eng 714 Spring  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest,
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining: Concepts and Techniques
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Data Preprocessing Modified from
©Jiawei Han and Micheline Kamber
Dedicated to the memory of
Data Mining: Concepts and Techniques
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Data Mining: Concepts and Techniques
Resource: J. Han and other books
Presentation transcript:

Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot

2 Data Reduction Method (1): Regression  Regression Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Y =  +  X Two parameters,  and  specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y 1, Y 2, …, X 1, X 2, …. Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Y = b0 + b1 X1 + b2 X2

3 Data Reduction Method (2): Histograms  Divide data into buckets and store average (sum) for each bucket  Singleton Represent a single attribute/frequency pair  Continuous ranges may also be represented  Partitioning rules: Equal-width: equal bucket range Equal-frequency (or equal-depth)

4 Data Reduction Method (3): Clustering  Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms

5 Data Reduction Method (3): Clustering..  consider the root of a B+-tree, with pointers to the data keys 986, 3396, 5411, 8392, and  Suppose that the tree contains 10,000 tuples with keys ranging from 1 to  The data in the tree can be approximated by an equal-frequency histogram of six buckets for the key ranges 1 to 985, 986 to 3395, 3396 to 5410, 5411 to 8391, 8392 to 9543, and 9544 to  Each bucket contains roughly 10,000/6 items.

6 Sampling  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Possible samples Simple random sample without replacement (size n)  All tuples are equally likely Simple random sample with replacement (size n) Cluster sample  Retrieving clusters samples of tuples Stratified samples  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data

7 Sampling…

8 Discretization and Concept Hierarchy  Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.  Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

9 Discretization and Concept Hierarchy Generation for Numeric Data  Typical methods: All the methods can be applied recursively Binning  Top-down split, unsupervised Histogram analysis  Top-down split, unsupervised Clustering analysis  Either top-down split or bottom-up merge, unsupervised

10 Concept Hierarchy Generation for Categorical Data  Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country  Specification of a hierarchy for a set of values by explicit data grouping {Urbana, Champaign, Chicago} < Illinois  Specification of only a partial set of attributes E.g., only street < city, not others  Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state, country}

11 Automatic Concept Hierarchy Generation  Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values