Download presentation
Presentation is loading. Please wait.
Published byRandell Tyler Modified over 9 years ago
1
Classification and Prediction (Data Mining: Concepts and Techniques)
2017年4月24日星期一 Data Mining: Concepts and Techniques
2
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Other Classification Methods Prediction Performance evaluation Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
3
An Example of Classification (Fruit Classifier)
Class label Classifier Apple input features output shape=round color = red Classifier Orange shape=round color = orange Classifier Mango oval, red, orange, yellow 2017年4月24日星期一 Data Mining: Concepts and Techniques
4
A Graphical Model for Classifier
y Classifier input features output class label : x1 x2 xn 2017年4月24日星期一 Data Mining: Concepts and Techniques
5
Model Representation for Classifier
y Classifier input features output class label : x1 x2 xn The model is in a form of y = f (x1, x2, …, xn) Main problems in classifier model construction: What are x1, …, xn in order to construct an effective f ? How to get the model f given x1, …, xn ? How to collect training data with class label y for creating model f 2017年4月24日星期一 Data Mining: Concepts and Techniques
6
Main Data Mining Techniques Supervised Learning
Use a training set to construct a model for the outcome forecast of future events. Two main types Classification constructing models that distinguish classes for future forecast Applications: loan approval, customer classification, recognition of finger print Model representation: decision-tree, neural network Prediction constructing models that predict unknown numerical values Applications : price prediction of various securities, assets Model representation: neural network, linear regression 6
7
Classification vs. Prediction
Use a training set to construct a model for the outcome forecast of future events. Classification predicts categorical class labels constructs a classification model to classify new data Prediction predicts numerical values Constructs a continuous-valued function to predict unknown or missing values Typical Applications credit card approval medical diagnosis & treatment Pattern recognition 7
8
Classification vs. Prediction
Classifier input features output class label : (category/nominal value) predicted value Predictor input features output : (continuous value) 2017年4月24日星期一 Data Mining: Concepts and Techniques
9
Classification—A Two-Step Process
Model construction Training set : the data set used for model construction Class label : each tuple/sample is assumed to belong to a predefined class (determined by the class label attribute) Model representation: classification rules, decision trees, or mathematical formulae Model usage: for classifying future unknown objects Test set : a data set independent of training set Performance evaluation : to evaluate how good the model is The known class label (y) of each test sample is compared with the classified result (y’) from the classification model Accuracy rate : the percentage of test samples that are correctly classified by the model (the ratio of y=y’s) 2017年4月24日星期一 Data Mining: Concepts and Techniques
10
Data Mining: Concepts and Techniques
Classification—A Two-Step Process Model construction Training Data (I, O) Classification Learning Algorithms Classifier Model Model usage Classifier Model input features output class label : 2017年4月24日星期一 Data Mining: Concepts and Techniques
11
Classification Process (1): Example for Model Construction
Learning Algorithms Training Data IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) input features class label tenured = f (rank, years) 2017年4月24日星期一 Data Mining: Concepts and Techniques
12
Classification Process (2): Example for Model Use
use of the model Classifier Performance evaluation Predict tenured? Accuracy Testing Data Unseen Data (Jeff, Professor, 4) 2017年4月24日星期一 Data Mining: Concepts and Techniques
13
Supervised vs. Unsupervised Learning
Supervised learning (classification) Aim : establish a classifier model Supervision : The training data (observations, measurements, etc.) are accompanied by class labels indicating the class of the observations Unsupervised learning (clustering) The class labels of training data is unknown Aim : establish the classes or clusters for the data 2017年4月24日星期一 Data Mining: Concepts and Techniques
14
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
15
Issues regarding classification & prediction
1). Data Preparation (Data Preprocessing) Data cleaning Preprocess data in order to reduce noise and handle missing values Feature relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data 2017年4月24日星期一 Data Mining: Concepts and Techniques
16
Issues regarding classification and prediction
2). Performance Evaluation of Classification Methods Predictive accuracy Speed scalability (for Big Data Analysis) time to construct the model time to use the model Space scalability (for Big Data Analysis) Memory/disk required to construct/use the model Robustness handling noise and missing values Interpretability: understanding and insight provided by the model Goodness of rules size and compactness of classification rules 2017年4月24日星期一 Data Mining: Concepts and Techniques
17
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
18
Data Mining: Concepts and Techniques
Decision Tree Induction Algorithm (A Learning Algo. for Classification Model) Decision Tree Induction Given : a set of training data (<I,O>=< x1, …, xn, y>) Aim (Find f, where y = f (x1, x2, …, xn) and f in DT form) To construct a minimal decision tree in order to effectively classify future unknown samples Decision tree Representation : a flow-chart-like tree structure Internal node denotes a test on an attribute of a sample Branch represents an outcome of the test Leaf node represent a class labels or a class distribution 2017年4月24日星期一 Data Mining: Concepts and Techniques
19
Classification in Decision Tree Induction
Generation of decision tree : consisting of two phases Tree construction Tree is constructed one node by one node in a top-down manner by using training examples. Tree pruning Identify and remove branching subtrees that reflect noise or outliers Use of decision tree: classifying an unknown sample Test the attribute values of a sample (with unknown class label) against the decision tree 2017年4月24日星期一 Data Mining: Concepts and Techniques
20
An Example of Training Dataset ( For buys_PC )
Input features Class label This follows an example from Quinlan’s ID3 2017年4月24日星期一 Data Mining: Concepts and Techniques
21
A Decision Tree for Predicting buys_PC
Buy_PC = f (age, student, credit rating) (age, student, credit rating) : test (input) attribute : class label for Buy_PC : attribute value ? f no yes fair excellent <= 30 > 40 30..40 student? age? credit rating? Buy_PC=y/n 21
22
Extracting Classification Rules from Trees
Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a condition The leaf node holds the class prediction Examples of Extracted Rules IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” 2017年4月24日星期一 Data Mining: Concepts and Techniques
23
ID3 Algorithm for Decision Tree Induction
Assumption: Attributes are categorical If continuous-valued, they are discretized in advance Idea of decision tree induction algorithm: Tree is constructed one node by one node in a top-down recursive divide-and-conquer manner Key : Find the most discriminating attribute at each node At start, all the training samples are located at the root node At each node, training samples at this node are used to select the most discriminating attribute on the basis of a heuristic or statistical measure (attribute selection measure) partitioned training samples into node branches based on the most discriminating attribute and its associated data values of training samples 2017年4月24日星期一 Data Mining: Concepts and Techniques
24
Partitioning of Training Data at a Node of a Decision Tree
age, income, student and credit_rating are tested. age is the most discriminating attribute. age? <= 30 > 40 30..40 ? ? ? 24
25
Stopping Conditions (for Decision Tree Induction)
Conditions for stopping recursive data partitioning All samples at a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 2017年4月24日星期一 Data Mining: Concepts and Techniques
26
Data Mining: Concepts and Techniques
Attribute Selection Measure (Find the most discriminating attribute at each node) Information gain (ID3/C4.5) All attributes are assumed to be categorical Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes 2017年4月24日星期一 Data Mining: Concepts and Techniques
27
Information Gain (of ID3/C4.5) (A Measure for Attribute Selection)
How to find the most discriminating attribute for each node Idea : find the attribute with the highest information gain at each node Assume there are two classes, P and N, in training examples Let the set of training examples S contains p elements of class P n elements of class N The information amount, needed to decide if an arbitrary example in S belongs to P or N, is defined as Entropy : 2017年4月24日星期一 Data Mining: Concepts and Techniques
28
Information Gain (in Decision Tree Induction)
Assume that using attribute Ak the set S will be partitioned into sets {S1, S2 , …, Sv} (i.e., # of Ak ’s values= v) If Si contains pi examples of P and ni examples of N, the entropy (the expected information amount needed to classify objects in all subtrees of S is The information amount gained by branching on Ak Find the attribute Ai with maximal gain for this node Example 2017年4月24日星期一 Data Mining: Concepts and Techniques
29
An Example (Attribute Selection by Information Gains)
Total split entropy for age: Data Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute entropies for atrributes : age, income, student, credit_rating at root node equation Information Gain for age: Similarly Therefore, attribute age is selected at root node Entropy after splitting using age Data Splitting 2017年4月24日星期一 Data Mining: Concepts and Techniques
30
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index gini (T) is defined as where pj is the data percentage of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively by using attribute Ak, the gini index after splitting ginik (T) is defined as The attribute providing the smallest ginik(T) is chosen to split the node (need to enumerate all possible splitting points for attribute Ak). entropy 2017年4月24日星期一 Data Mining: Concepts and Techniques
31
Avoid Overfitting in ID3/C4.5
The generated tree may overfit the training data Result in poor accuracy for unseen samples If too many tree branches exist, some may reflect anomalies due to noise or outliers Two pruning approaches to avoid overfitting Prepruning: Halt tree construction early, i.e., Don’t split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Get a sequence of progressively pruned trees from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree” 2017年4月24日星期一 Data Mining: Concepts and Techniques
32
Enhancements to basic decision tree induction
Allow for continuous-valued attributes Define new discrete-valued attributes by dynamically partitioning the continuous attribute values into a set of discrete intervals Handle missing attribute values by Assigning the most common value of the attribute Assigning a probability to each of the possible values 2017年4月24日星期一 Data Mining: Concepts and Techniques
33
Why decision tree induction
Decision tree induction a classification learning algorithm. Classification — a typical problem extensively studied by statisticians and machine learning researchers Why decision tree induction for classification? convertible to simple and easy to understand classification rules (if-then rules) can use SQL queries to access databases for each rule to find its associated data and rule coverage rate comparable classification accuracy with other methods relatively faster learning speed (than other classification methods) 2017年4月24日星期一 Data Mining: Concepts and Techniques
34
Presentation of Classification Results
2017年4月24日星期一 Data Mining: Concepts and Techniques
35
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Estimating classification Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
36
Bayesian Classification: Why?
Probabilistic prediction and learning: Predict multiple hypotheses Calculate a probability for each hypothesis Incremental learning : Each training example incrementally increases/decreases the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Standard: Even when Bayesian methods are computationally intractable, they can provide a benchmark standard against other methods 2017年4月24日星期一 Data Mining: Concepts and Techniques
37
Data Mining: Concepts and Techniques
Bayesian Theorem Given training data D, posterior probability of a hypothesis h, denoted as P(h|D), follows the Bayes theorem The value of P(h|D) can be obtained according to the values of P(h), P(D), P(D|h) D h 2017年4月24日星期一 Data Mining: Concepts and Techniques
38
Naive Bayes Classification for m classes
Find f for D, where y = f (x1, x2, …, xn) and f in NBC form Definition X =<x1, x2, …, xn >: denoting a n-dimensional sample Ci|X : the hypothesis “sample X is of class Ci” P(Ci|X) = probability that sample X is of class Ci There are m probabilities : P(C1|X), P(C2|X), …, P(Cm|X) f : assign sample X to class Ci if P(Ci|X) is maximal among P(C1|X), P(C2|X),… P(Ci|X), …, P(Cm|X), i.e., The naive Bayesian classifier f assigns an unknown sample X to class Ci if and only if (X is most likely of class Ci according to probabilities) 2017年4月24日星期一 Data Mining: Concepts and Techniques
39
Estimating a-posteriori probabilities
How to find maximal one among P(C1|X), P(C2|X) …, P(Cm|X) According to Bayes theorem: P(Ci|X) = P(X|Ci)·P(Ci) / P(X) for 1 ≤ i ≤ m P(X) is difficult to obtain, but it is a constant for computing P(Ci|X)’s for all m classes, where 1 ≤ i ≤ m So, only the values of P(X|Ci)·P(Ci)’s are reqired for 1 ≤ i ≤ m P(Ci) = relative freq of samples in class Ci => can be computed from the training data Remaining problem: How to compute P(X|Ci) ? 2017年4月24日星期一 Data Mining: Concepts and Techniques
40
Naïve Bayesian Classification
Naïve assumption: attribute independence P(X|Ci) = P(<x1,…,xn>|Ci) = P(x1|Ci)·…·P(xn|Ci) Why this assumption ? Make P(X|Ci) computable. X may not be in Ci , therefore P(X|Ci) is not computable Require a minimal # of training data If k-th attribute of X is categorical: P(xk|Ci) is estimated as the relative freq of samples having value xk as k-th attribute (Ak=xk) in class Ci , 1 ≤ k ≤ n If k-th attribute is continuous: P(xk|Ci) is estimated via Gaussian density function (normal distribution, a function modeled by mean and variance) for k-th attribute by using data in class Ci 2017年4月24日星期一 Data Mining: Concepts and Techniques
41
Play-tennis example (Predict playing tennis or not on a given day)
Given : the following training data Problem : Predict whether to play tennis on a particular day ? Given : an unseen input sample <rain, hot, high, false> Will play tennis ? P : Play, 9 records N : Not play, 5 records 2017年4月24日星期一 Data Mining: Concepts and Techniques
42
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false> Want to predict “play tennis or not?” Then, the problem is to test “if P(p|X) > P(n|X) ?” i.e., “if P(X|p)·P(p) / P(X) > P(X|n)·P(n) / P(X) ” (Bayesian Theorem) According to Naïve Bayesian Classification, test if P(X|p) ·P(p) > P(X|n) ·P(n) ? P(X|p)·P(p) = P(<rain, hot, high, false>|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) P(X|n)·P(n) = P(<rain, hot, high, false>|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) Need to know the values of above probabilities. 2017年4月24日星期一 Data Mining: Concepts and Techniques
43
Play-tennis example: estimating P(p), P(n),
Step 2 outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 X = <rain, hot, high, false> P(X|p) ·P(p) > P(X|n) ·P(n) ? Step 1 P(p) = 9/14 P(n) = 5/14 2017年4月24日星期一 Data Mining: Concepts and Techniques
44
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false> Want to predict “play tennis or not?” Then, the problem is to test “if P(p|X) > P(n|X) ?” i.e., “if P(X|p)·P(p) / P(X) > P(X|n)·P(n) / P(X) ” (Bayesian Theorem) According to Naïve Bayesian Classification, test if P(X|p) ·P(p) > P(X|n) ·P(n) ? P(X|p)·P(p) = P(<rain, hot, high, false>|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = P(X|n)·P(n) = P(<rain, hot, high, false>|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) 2/5·2/5·4/5·2/5·5/14 = Sample X is classified in class n (P(X|p) ·P(p) < P(X|n) ·P(n) ) 2017年4月24日星期一 Data Mining: Concepts and Techniques
45
The Independence Assumption
Makes computation possible Yields optimal classifiers when the assumption is satisfied But is seldom satisfied in practice, as attributes (variables) are often correlated. Can attempt to overcome this limitation by: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Sample size must be large enough compared to NBC 2017年4月24日星期一 Data Mining: Concepts and Techniques
46
Bayesian Belief Networks (I)
Family History Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer Assumption : Variables Family History and Smoker are correlated. PositiveXRay Dyspnea Bayesian Belief Networks 2017年4月24日星期一 Data Mining: Concepts and Techniques
47
Bayesian Belief Networks (II)
Bayesian belief network allows class conditional independencies between subsets of the variables A graphical model of causal relationships Several classes of problems in learning Bayesian belief networks Given network structure of all related variables => Easy Given network structure of some related variables => Hard When the network structure is unknown => Harder 2017年4月24日星期一 Data Mining: Concepts and Techniques
48
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Discriminative Classification Other Classification Methods Prediction Estimating classification Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
49
Classification: A Mathematical Mapping
Binary classification as a mathematical mapping computes 2-class categorical labels is a binary function f: X Y mathematically y = f(X), X X ≡ n, y Y = {+1, –1} (or = {0, 1}) X : input, y : output Example : classification of personal homepage, SPAM mail (An application of automatic document classification ) y = +1 or –1 (1/0; yes/no; true/false; positive/negative) X = <x1, x2, …, xn > (a keyword frequency vector for a Web page) x1 : # of keyword 1, e.g., “homepage” x2 : # of keyword 2, e.g., “welcome” ….. 2017年4月24日星期一 Data Mining: Concepts and Techniques
50
Linear Classification Problems
In linear classification problems, the classification is accomplished by a linear hyperplane. Example : a 2D binary classification problem A sample is a 2-D point The data above the red line belongs to class ‘x’ The data below red line belongs to class ‘o’ The data can be linearly classified by the red line Classifier examples: SVM Perceptron (an ANN) x x x x x x x o o x x x o o x o o o o o o o o o 2017年4月24日星期一 Data Mining: Concepts and Techniques
51
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Backpropagation Other Classification Methods Prediction Estimating classification Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
52
Artificial Neural Networks (A Network of Artificial Neurons)
Advantages prediction accuracy is generally high robust, works when training examples contain errors output may be discrete, real-valued, or a vector of discrete or real-valued attributes fast evaluation of the learned target function Criticism (somewhat) long training time for an optimal model difficult to understand the learned function (weights) not easy to incorporate prior domain knowledge 2017年4月24日星期一 Data Mining: Concepts and Techniques
53
Architecture of a Typical Artificial Neural Network
(Multi-layer Perceptron) . . 2017年4月24日星期一 Data Mining: Concepts and Techniques
54
A Neuron as a Simple Computing Element
θ Threshold (bias) Connection Weights function meaning 2017年4月24日星期一 Data Mining: Concepts and Techniques
55
A Neuron as a Simple Computing Element
Steps of a neuron’s computation computes the weighted sum of the input signals compares the result with a threshold value Produce an output based on a transfer or activation function as follows: 2017年4月24日星期一 Data Mining: Concepts and Techniques
56
Various Activation Functions f of a Neuron
Hidden neuron Output neuron Output neuron (function approximation) 2017年4月24日星期一 Data Mining: Concepts and Techniques
57
Construction of Classification Model via Network Training
The objective of network training obtain a set of connection weights that makes almost all the training tuples classified correctly Steps Initialize weights with random values Feed one of the training samples into the network Do the followings for each neuron layer by layer Compute the net input to the neuron as a weighted summation of all the inputs to the neuron Compute the output value using the activation function Compute the error by backpropagation algorithm Adjust the weights and the bias according to the error Go to Step 2 until convergence Loop 57
58
Backpropagation Training Algorithm
I).input is propagated forward II). weights are updated according to backward propagated errors Output vector Output nodes OkO ErrkO … … k error OjH … ErrjH Hidden nodes … j IjH OiI … Input nodes i : learning rate Hidden layer I : input O : output Input vector The j-th neuron 58
59
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian classification Classification by Backpropagation Other classification methods Prediction Estimating classification accuracy Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
60
Other Classification Methods
SVM—Support Vector Machines k-nearest neighbor classifier Case-based reasoning Rough set approach Fuzzy set approaches 2017年4月24日星期一 Data Mining: Concepts and Techniques
61
SVM—Support Vector Machines
A classification method for both linear and nonlinear data For nonlinear data, a nonlinear mapping is used to transform the training data into a higher dimension With the new dimension, it searches for the optimal linear separating hyperplane (i.e., “decision boundary”) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a linear hyperplane SVM finds this separating hyperplane using support vectors (“essential” training tuples) and margins (間隔幅度, defined by the support vectors) 2017年4月24日星期一 Data Mining: Concepts and Techniques
62
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (for margin maximization) Used both for classification and prediction Applications: handwritten digit recognition, object recognition, speaker identification 2017年4月24日星期一 Data Mining: Concepts and Techniques
63
SVM — General Concept (Find a decision boundary with a maximal margin)
Decision boundary 2 is better than decision boundary 1 decision boundary 2 2017年4月24日星期一 Data Mining: Concepts and Techniques
64
SVM— Margins and Support Vectors
Worse support vectors Better support vectors Large Margin Small Margin 2017年4月24日星期一 Data Mining: Concepts and Techniques
65
SVM — Case 1 When Data Is Linearly Separable
Let data D={(X1, y1), …, (X|D|, y|D|)} be the set of training tuples, where Xi is associated with the class label yi There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the separating hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) 2017年4月24日星期一 Data Mining: Concepts and Techniques
66
SVM – Case 1 : Linearly Separable
A separating hyperplane can be written as W ● X + b = 0 where W={w1, w2, …, wn} is a weight vector and b a scalar (bias) For a 2-D space, a line L ax + by +c=0 can be written as w0 + w1 x1 + w2 x2 = 0 The hyperplanes defining the two sides of the margin: H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1 Support vectors : any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) This becomes a constrained (convex) quadratic optimization problem: linear constraints and quadratic objective function Quadratic Programming (QP) Lagrangian multipliers 2017年4月24日星期一 Data Mining: Concepts and Techniques
67
Why Is SVM Effective on High Dimensional Data?
The complexity of SVM classifier is characterized by the # of support vectors rather than the dimensionality or # of the data The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH) If all other training examples are removed and the SVM training is repeated, the same separating hyperplane would be found The set of support vectors can be used to compute an (upper) bound on the expected error rate of an SVM classifier, which is independent of the data dimensionality Thus, an SVM with a small number of support vectors can still have good generalization, even when the dimensionality of the data is high 2017年4月24日星期一 Data Mining: Concepts and Techniques
68
SVM — Case 2 Linearly Inseparable
Transform the original input data into a higher dimensional space A 3D input vector is mapped into a new 6D space Search for a linear separating hyperplane in the 6D space <x1, x2, x3> <z1, z2, z3, z4, z5, z6> z1=x1, z2=x2, z3=x3 2017年4月24日星期一 Data Mining: Concepts and Techniques
69
k-NN (k-Nearest Neighbor) Algorithm
A sample X is represented as A training sample corresponds to a point in an n-D space. The nearest neighbors are defined in terms of Euclidean distance: When given an unknown sample xq, a k-NN classifier searches the sample space for the k training samples nearest to the sample xq. Then decide the class of the unknown sample by majority vote. The value of k is decided heuristically. . _ + xq 2017年4月24日星期一 Data Mining: Concepts and Techniques
70
Discussion on the k-NN Algorithm
k-NN algorithm works only for numeric-valued data Enhancement : distance-weighted k-NN algorithm Weight the contribution of the k neighbors according to their distance to the query point (sample) xq giving greater weight to closer neighbors Similarly, works only for numeric-valued data Robust to noisy data by averaging k nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by many irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant attributes. 2017年4月24日星期一 Data Mining: Concepts and Techniques
71
Data Mining: Concepts and Techniques
Rough Set Approach Rough sets are used to approximately or “roughly” define equivalent classes A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C) Rough set can also be used for feature reduction. A discernibility matrix is used to detect redundant attributes. 2017年4月24日星期一 Data Mining: Concepts and Techniques
72
Data Mining: Concepts and Techniques
Fuzzy Set Approaches Example Application : Credit approval Credit Approval Rule : IF (years_employed >=2) AND (income >=50K), THEN Credit = “approval” Problem : What if a customer has had a job for at least two years and her income is $49K ? Should she be approved or not ? Solution : Fuzzy set approaches IF (years_employed is medium) AND (income is high), THEN Credit is approval 2017年4月24日星期一 Data Mining: Concepts and Techniques
73
Data Mining: Concepts and Techniques
fuzzy membership graph Fuzzy Set Approaches Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated as For income, 49K is transformed into Each applicable rule of the rule set contributes a vote for membership in the categories Typically, the truth values for each predicted category are summed up with weights for making decision 2017年4月24日星期一 Data Mining: Concepts and Techniques
74
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Other Classification Methods Prediction Estimating classification accuracy Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
75
Data Mining: Concepts and Techniques
What Is Prediction? Prediction is similar to classification Step 1: construct a model Step 2: use the model to predict unknown value Major method for prediction is regression Linear multiple regression Non-linear regression Other method : artificial neural networks Main difference between prediction and classification Classification predict categorical class label Prediction models continuous-valued functions 2017年4月24日星期一 Data Mining: Concepts and Techniques
76
Regression Analysis and Log-Linear Models in Prediction
Linear regression: Y = + X Two parameters , and specify the line They are estimated by using the training samples Using least squares criterion to the training samples: (X1, Y1), (X2, Y2) …, (Xn, Yn) Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn Many nonlinear functions can be transformed into the above. Log-linear models: Example : Estimate probability: p(a, b, c, d) = αabc abdγacd bcd log p(a, b, c, d) = log abc +log abd+logγacd +log bcd 2017年4月24日星期一 Data Mining: Concepts and Techniques
77
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Estimating accuracy Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
78
Classifier Accuracy Measures
P’ (computed) N’ (computed) P True positive False negative N False positive True negative Classifier Accuracy Measures classes yes (computed) no (computed) total recognition(%) buy_computer = yes 6954 46 7000 99.34 buy_computer = no 412 2588 3000 86.27 7366 2634 10000 95.42 Accuracy of a classifier M, acc(M): percentage of test samples that are correctly classified by the classifier M (created by training set) Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j , an entry in a confusion matrix, indicates # of samples in class i that are labeled by the classifier as class j Alternative performance measures (e.g., for cancer diagnosis) sensitivity = TP/P /* true positive recognition rate */ specificity = TN/N /* true negative recognition rate */ precision = TP/(TP + FP) accuracy = (TP+TN)/(P + N) = sensitivity * P/(P + N) + specificity * N/(P + N) This model can also be used for cost-benefit analysis 2017年4月24日星期一 Data Mining: Concepts and Techniques
79
Error Measures for Prediction
Measure how far off the predicted value is from the actual known value Loss function: measures the error betw. yi and yi’ (predicted value) Absolute error: | yi – yi’| Squared error: (yi – yi’)2 Test error: the average loss over the test set Mean absolute error (MAE): Mean squared error (MSE): Relative absolute error (RAE): Relative squared error (RSE): The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-squared error, similarly, root relative squared error 2017年4月24日星期一 Data Mining: Concepts and Techniques
80
Data Mining: Concepts and Techniques
Performance Evaluation of Classification (Methods for Estimating Average Classification Accuracy) Partition: Training-and-testing use two independent data sets: training set (2/3), test set(1/3) Used for data set with large number of samples k-fold cross-validation Randomly divide the data set into k subsets : S1, S2, …,Sk At iteration i , the Si subset is used as test set and the remaining k-1 subsets are used as training data A total of k times for computing average accuracy Used for data set of moderate size Bootstrapping (leave-one-out) Similar to k-fold cross-validation with k set to s, where s is the number of initial samples Used for small data set 2017年4月24日星期一 Data Mining: Concepts and Techniques
81
10-fold Cross-Validation
Data set Randomly divided into 10 subsets 1 2 3 4 5 6 7 8 9 10 At iteration 10 1 2 3 4 5 6 7 8 9 10 Nine out of ten are used for training the classifier The tenth subset is used as the test set in iteration 10 1 2 3 4 5 6 7 8 9 10 Repeat iterations : 1 to 10 Test set Training set 2017年4月24日星期一 Data Mining: Concepts and Techniques
82
Model Comparison by ROC Curves
diagonal line (coin model) ROC (Receiver Operating Characteristics) curves: for visual comparison of the performance of classification models Vertical axis represents TP (true positive) rate Horizontal axis represents FP (false positive) rate The plot also shows a diagonal line Model 1 is better than model 2 2017年4月24日星期一 Data Mining: Concepts and Techniques
83
Model Comparison by ROC Curves
Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve (AUC) is a measure of the performance of the model A model with perfect accuracy will have an area of 1.0 The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model 2017年4月24日星期一 Data Mining: Concepts and Techniques
84
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary 2017年4月24日星期一 Data Mining: Concepts and Techniques
85
Data Mining: Concepts and Techniques
Summary Classification is an extensively studied problem (mainly in statistics, machine learning & AI) Classification issue : How to create a classifier, i.e., find f, where y = f (x1, x2, …, xn) and f in DT, NBC, ANN,… forms Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is an important issue for applications related to Big Data, Clouds 2017年4月24日星期一 Data Mining: Concepts and Techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.