1 March 9, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 4 — Classification.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Classification: Alternative Techniques
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ The generated tree may overfit the training data –Too many branches,
Classification Techniques: Decision Tree Learning
Lazy vs. Eager Learning Lazy vs. eager learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Lecture outline Classification Decision-tree classification.
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Instance Based Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán) 1.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Classification and Prediction
Classification.
INSTANCE-BASE LEARNING
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
CS Instance Based Learning1 Instance Based Learning.
Chapter 7 Decision Tree.
Module 04: Algorithms Topic 07: Instance-Based Learning
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Mohammad Ali Keyvanrad
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Basic Data Mining Technique
Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.
Feature Selection: Why?
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 Instance Based Learning Ata Kaban The University of Birmingham.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Classification and Prediction
Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Chapter 6 Classification and Prediction
Classification and Prediction
K Nearest Neighbor Classification
CS 685: Special Topics in Data Mining Jinze Liu
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

1 March 9, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 4 — Classification

2 What Is Classification? The goal of data classification is to organize and categorize data in distinct classes –A model is first created based on the data distribution –The model is then used to classify new data –Given the model, a class can be predicted for new data Classification = prediction for discrete and nominal values –With classification, I can predict in which bucket to put the ball, but I can ’ t predict the weight of the ball

3 Prediction, Clustering, Classification What is Prediction? –The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes –models continuous-valued functions, i.e., predicts unknown or missing values –A model is first created based on the data distribution –The model is then used to predict future or unknown values Supervised vs. Unsupervised Classification –Supervised Classification = Classification We know the class labels and the number of classes –Unsupervised Classification = Clustering We do not know the class labels and may not know the number of classes

4 Classification and Prediction Typical applications –Credit approval –Target marketing –Medical diagnosis –Fraud detection

5 Classification: 3 Step Process 1. Model construction (Learning): –Each record (instance) is assumed to belong to a predefined class, as determined by one of the attributes, called the class label –The set of all records used for construction of the model is called training set –The model is usually represented in the form of classification rules, (IF-THEN statements) or decision trees 2. Model Evaluation (Accuracy): –Estimate accuracy rate of the model based on a test set –The known label of test sample is compared with the classified result from model –Accuracy rate: percentage of test set samples correctly classified by the model –Test set is independent of training set otherwise over-fitting will occur 3. Model Use (Classification): –The model is used to classify unseen instances (assigning class labels) –Predict the value of an actual attribute

6 Process (1): Model Construction Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)

7 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

8 Accuracy of Supervised Learning Holdout Approach –A certain amount of data (Usually, one-third) for testing and remainder (two-third) for training K-Fold cross validation –Split the data into k subsets of equal size, then each subset is turn is used for testing and remainder for training

9 Issues: Data Preparation Data cleaning –Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) –Remove the irrelevant or redundant attributes Data transformation –Normalize data

10 Issues: Evaluating Classification Methods Accuracy –classifier accuracy: predicting class label –predictor accuracy: guessing value of predicted attributes Speed –time to construct the model (training time) –time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability –understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

11 Other Issues Function complexity and amount of training data – If the true function is simple then an learning algorithm will be able to learn it from a small amount of data, otherwise the function will only be learnable from a very large amount of training data. Dimensionality of the input space Noise in the output values Heterogeneity of the data. –If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Dependency of features

12 Decision Tree - Review of Basics What exactly is a Decision Tree? A tree where each branching node represents a choice between two or more alternatives, with every branching node being part of a path to a leaf node (bottom of the tree). The leaf node represents a decision, derived from the tree for the given input. ● How can Decision Trees be used to classify instances of data? Instead of representing decisions, leaf nodes represent a particular classification of a data instance, based on the given set of attributes (and their discrete values) that define the instance of data, which is kind of like a relational tuple for illustration purposes.

13 Review of Basics (cont’d) It is important that data instances have boolean or discrete data values for their attributes to help with the basic understanding of ID3, although there are extensions of ID3 that deal with continuous data.

14 How does ID3 relate to Decision Trees, then? ID3, or Iterative Dichotomiser 3 Algorithm, is a Decision Tree learning algorithm. The name is correct in that it creates Decision Trees for “dichotomizing” data instances, or classifying them discretely through branching nodes until a classification “bucket” is reached (leaf node). By using ID3 and other machine-learning algorithms from Artificial Intelligence, expert systems can engage in tasks usually done by human experts, such as doctors diagnosing diseases by examining various symptoms (the attributes) of patients (the data instances) in a complex Decision Tree. Of course, accurate Decision Trees are fundamental to Data Mining and Databases.

15 Other Decision Trees ID3 J48 C4.5 SLIQ SPRINT PUBLIC RainForest BOAT Data Cube-based Decision Tree

16 Tree Presentation

17 Decision Tree Induction: Training Dataset

18 age? overcast student?credit rating? <=30 >40 noyes no fairexcellent yesno Output: A Decision Tree for “buys_computer”

19 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

20 Entropy S is the training example is the proportion of positive examples in S is the proportion of negative examples in S Entropy measures the impurity of S For Non-Boolean classification :

21 Gain (S,A) = Expected reduction in entropy due to sorting on A Attribute Selection Measure: Information Gain (ID3/C4.5)

22 Class P: buys_computer = “yes” Class N: buys_computer = “no” Attribute Selection: Information Gain Gain for Age

23 Attribute Selection: Information Gain

24 ID3 Algorithm

25 Computing Information-Gain for Continuous- Value Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A –Sort the value A in increasing order –Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 –The point with the minimum expected information requirement for A is selected as the split-point for A Split: –D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point

26 Gain Ratio for Attribute Selection and Cost of Attributes What happens if you choose Day as root ?? Attribute with Cost

27 Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

28 Enhancements to Basic Decision Tree Induction Allow for continuous-valued attributes –Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values –Assign the most common value of the attribute –Assign probability to each of the possible values Attribute construction –Create new attributes based on existing ones that are sparsely represented –This reduces fragmentation, repetition, and replication

29 Instance Based Learning Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple Eager learning (eg. Decision trees, SVM, NN): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify

30 Instance-based Learning Its very similar to a Desktop!!

31 Example Image Scene Classification

32 Instance-based Learning When To Consider IBL Instances map to points Less than 20 attributes per instance Lots of training data Advantages: Learn complex target functions(Class) Don't lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes

33 K-Nearest Neighbor Learning (k-NN) k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). – If k = 1, then the object is simply assigned to the class of its nearest neighbor.

34 K-Nearest Neighbor Learning (k-NN)- Cont. Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance: it measures the minimum number ofsubstitutions required to change one string into the other, or the number of errors that transformed one string into the other. ) can be used.

35 Distance or Similarity Measures Common Distance Measures: –Manhattan distance: –Euclidean distance: –Cosine similarity:

36 KNN - Algorithm Key Idea: Just store all training examples Nearest Neighbour : –Given query instance X q, First locate nearest training example X n, Then estimate K Nearest Neighbour : –Given x q, take vote among its K nearest nbrs ( If discrete-value for class) –Take mean of f Values of k nearest nbrs (if real-value)

37 Figure K Nearest Neighbors Example X Stored training set patterns X input pattern for classification --- Euclidean distance measure to the nearest three patterns

38 two one four three five six seven Eight ? Which one belongs to Mondrian ?

39 Training data Test instance

40 Normalization Min-max normalization: to [new_min A, new_max A ] Z-score normalization (μ: mean, σ: standard deviation): Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1

41 Normalised training data Test instance

42 Distances of test instance from training data Classification 1-NNYes 3-NNYes 5-NNNo 7-NNNo

43 Example Classify Theatre ?

44 Nearest neighbors algorithms: illustration e1 1-nearest neighbor: the concept represented by e1 5-nearest neighbors: q1 is classified as negative q1

45 Voronoi diagram query point q f nearest neighbor q i

46 Variant of kNN: Distance-Weighted kNN We might want to weight nearer neighbors more heavily Then it makes sense to use all training examples instead of just k

47 Difficulties with k-nearest neighbour algorithms Have to calculate the distance of the test case from all training cases There may be irrelevant attributes amongst the attributes – curse of dimensionality For instance, we have 20 attribute, however 2 of them are appropriate attributes, non-relevant attributes effect to results Solution : Assign appropriate Weight to the attributes by using cross-validation method.

48 How to choose “k” Large k: –less sensitive to noise (particularly class noise) –better probability estimates for discrete classes –larger training sets allow larger values of k Small k: –captures fine structure of problem space better –may be necessary with small training sets Balance must be struck between large and small k As training set approaches infinity, and k grows large, kNN becomes Bayes optimal

49 تکليف 1.Use Weka to classify and evaluate Letter Recognition dataset through IBL 2.Write down 3 new challenges in KNN