Classification and Prediction.  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Bayesian Learning Provides practical learning algorithms
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ The generated tree may overfit the training data –Too many branches,
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lazy vs. Eager Learning Lazy vs. eager learning
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
More Classifier and Accuracy Measure of Classifiers
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Bayesian classifiers.
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Evaluating Hypotheses
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
INSTANCE-BASE LEARNING
Experimental Evaluation
CS Instance Based Learning1 Instance Based Learning.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Rule Generation [Chapter ]
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Naive Bayes Classifier
Classification and Prediction (cont.) Pertemuan 10 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Basic Data Mining Technique
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
CpSc 881: Machine Learning Instance Based Learning.
CpSc 810: Machine Learning Instance Based Learning.
Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.
Data Mining and Decision Support
Classification Today: Basic Problem Decision Trees.
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Chapter 7. Classification and Prediction
Naive Bayes Classifier
Data Mining: Concepts and Techniques (3rd ed
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Classification Bayesian Classification 2018年12月30日星期日.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Machine Learning: UNIT-4 CHAPTER-1
Classification and Prediction
A task of induction to find patterns
Presentation transcript:

Classification and Prediction

 What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian Classification  Classification based on concepts from association rule mining  Other Classification Methods  Prediction  Classification accuracy  Summary

What is Bayesian Classification?  Bayesian classifiers are statistical classifiers  For each new sample they provide a probability that the sample belongs to a class (for all classes)  Example: sample John (age=27, income=high, student=no, credit_rating=fair) P(John, buys_computer=yes) = 20% P(John, buys_computer=no) = 80%

Bayesian Classification: Why?  Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayes’ Theorem  Given a data sample X, the posteriori probability of a hypothesis h, P(h|X) follows the Bayes theorem  Example: Given that for John (X) has  age=27, income=high, student=no, credit_rating=fair We would like to find P(h):  P(John, buys_computer=yes)  P(John, buys_computer=no)  For P(John, buys_computer=yes) we are going to use: P(age=27  income=high  student=no  credit_rating=fair) given that P(buys_computer=yes) P(buys_computer=yes) P( age=27  income=high  student=no  credit_rating=fair)  Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Naïve Bayesian Classifier  A simplified assumption: attributes are conditionally independent:  Notice that the class label C j plays the role of the hypothesis.  The denominator is removed because the probability of a data sample P(X) is constant for all classes.  Also, the probability P(X|C j ) of a sample X given a class C j is replaced by: P(X|C j ) = Π P(v i |C j ), X=v 1  v 2 ...  v n  This is the naive hypothesis (attribute independence assumption)

Naïve Bayesian Classifier  Example: Given that for John (X)  age=27, income=high, student=no, credit_rating=fair P(John, buys_computer=yes) = P(buys_computer=yes)* P(age=27|buys_computer=yes)* P(income=high |buys_computer=yes)* P(student=no |buys_computer=yes)* P(credit_rating=fair |buys_computer=yes)  Greatly reduces the computation cost, by only counting the class distribution.  Sensitive to cases where there are strong correlations between attributes E.g. P(age=27  income=high) >> P(age=27)*P(income=high)

Naive Bayesian Classifier Example play tennis?

Naive Bayesian Classifier Example 9 5

 Given the training set, we compute the probabilities:  We also have the probabilities P = 9/14 N = 5/14

Naive Bayesian Classifier Example  The classification problem is formalized using a- posteriori probabilities:  P(C|X) = prob. that the sample tuple X= is of class C.  E.g. P(class=N | outlook=sunny,windy=true,…)  Assign to sample X the class label C such that P(C|X) is maximal  Naïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C)

Naive Bayesian Classifier Example  To classify a new sample X: outlook = sunny temperature = cool humidity = high windy = false  Prob(P|X) = Prob(P)*Prob(sunny|P)*Prob(cool|P)* Prob(high|P)*Prob(false|P) = 9/14*2/9*3/9*3/9*6/9 = 0.01  Prob(N|X) = Prob(N)*Prob(sunny|N)*Prob(cool|N)* Prob(high|N)*Prob(false|N) = 5/14*3/5*1/5*4/5*2/5 =  Therefore X takes class label N

 Second example X =  P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 =  P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 =  Sample X is classified in class N (don ’ t play) Naive Bayesian Classifier Example

Categorical and Continuous Attributes  Naïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C)  If i-th attribute is categorical: P(x i |C) is estimated as the relative freq of samples having value x i as i-th attribute in class C  If i-th attribute is continuous: P(x i |C) is estimated thru a Gaussian density function  Computationally easy in both cases

The independence hypothesis…  … makes computation possible  … yields optimal classifiers when satisfied  … but is seldom satisfied in practice, as attributes (variables) are often correlated.  Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first

Bayesian Belief Networks (I)  A directed acyclic graph which models dependencies between variables (values)  If an arc is drawn from node Y to node Z, then Z depends on Y Z is a child (descendant) of Y Y is a parent (ancestor) of Z  Each variable is conditionally independent of its nondescendants given its parents

Bayesian Belief Networks (II) Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea LC ~LC (FH, S)(FH, ~S)(~FH, S)(~FH, ~S) Bayesian Belief Networks The conditional probability table for the variable LungCancer

Bayesian Belief Networks (III)  Using Bayesian Belief Networks: P(v 1,..., v n ) = Π P(v i /Parents(v i ))  Example: P(LC = yes  FH = yes  S = yes) = P(FH = yes)* P(S = yes)* P(LC = yes|FH = yes  S = yes) = P(FH = yes)* P(S = yes)*0.8

Bayesian Belief Networks (IV)  Bayesian belief network allows a subset of the variables conditionally independent  A graphical model of causal relationships  Several cases of learning Bayesian belief networks Given both network structure and all the variables: easy Given network structure but only some variables When the network structure is not known in advance

Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian Classification  Classification based on concepts from association rule mining  Other Classification Methods  Prediction  Classification accuracy  Summary

Associative classification  General idea: Discover association rules of the form X  Y, where:  X is a set of values that appear frequently together  Y is a class label Rank these rules in decreasing confidence/support order Add at the end of this list a default rule with the most common class label A new tuple to be classified is tested against the rules in order, until the X is satisfied. The class label Y is then selected

Instance-Based Methods  Instance-based learning: Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  Typical approaches k-nearest neighbor approach  Instances represented as points in a Euclidean space. Locally weighted regression  Constructs local approximation Case-based reasoning  Uses symbolic representations and knowledge- based inference

The k-Nearest Neighbor Algorithm  All instances correspond to points in the n-D space.  The nearest neighbor are defined in terms of Euclidean distance.  The target function could be discrete- or real- valued.  For discrete-valued function, the k-NN returns the most common value among the k training examples nearest to xq.  Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.. _ + _ xqxq + _ _ + _ _

Discussion on the k-NN Algorithm  Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query point x q  give greater weight to closer neighbors Similarly, for real-valued target functions  Robust to noisy data by averaging k-nearest neighbors  Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant attributes.

Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian Classification  Classification based on concepts from association rule mining  Other Classification Methods  Prediction  Classification accuracy  Summary

What Is Prediction?  Prediction is similar to classification First, construct a model Second, use model to predict unknown value  Major method for prediction is regression  Linear and multiple regression  Non-linear regression  Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions

 Predictive modeling: Predict data values or construct generalized linear models based on the database data.  One can only predict value ranges or category distributions  Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction  Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Predictive Modeling in Databases

Regress Analysis and Log-Linear Models in Prediction  Linear regression : Y =  +  X Two parameters,  and  specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of (x 1,y 1 ),(x 2,y 2 ),...,(x s,y S ):  Multiple regression : Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. E.g., Y=b0+b 1 X+b 2 X 2 +b 3 X 3, X1=X, X2=X 2, X3=X 3  Log-linear models : The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) =  ab  ac  ad  bcd

Regression x y y = x + 1 X1 Y1 (salary) (years of experience) Example of linear regression

Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian Classification  Classification based on concepts from association rule mining  Other Classification Methods  Prediction  Classification accuracy  Summary

Classification Accuracy: Estimating Error Rates  Partition: Training-and-testing (the holdout method) use two independent data sets, e.g., training set (2/3), test set(1/3) used for data set with large number of samples variation: random subsampling  Cross-validation divide the data set into k subsamples use k-1 subsamples as training data and one sub- sample as test data --- k-fold cross-validation for data set with moderate size In practice, 10-fold cross validation is applied.  Bootstrapping (leave-one-out) for small size data

Boosting and Bagging  Boosting increases classification accuracy Applicable to decision trees or Bayesian classifier  Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor  Boosting requires only linear time and constant space

Boosting Technique (II) — Algorithm  Assign every example an equal weight 1/N  For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h (t) under w (t) Calculate the error of h(t) and re-weight the examples based on the error Normalize w (t+1) to sum to 1  Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

Example of Bagging Data C1C1 C2C2 CTCT S1S1 S2S2 STST w1w1 w2w2 wTwT Combine votes Class prediction

Classification Accuracy  Is the accuracy measure appropriate enough?  If the class labels do not have the same probability then accuracy may be a misleading measure Example:  two class labels: cancer/ not cancer  only 3-4% of training samples with cancer  classification accuracy 90% is not acceptable

Alternative Measures for Accuracy  Assume two class labels (positive, negative)  sensitivity = t_pos/pos The number of true positive samples divided by the total number of positive samples  specificity = t_neg/neg The number of true negative samples divided by the total number of negative samples  precision = t_pos/(t_pos+f_pos) The number of true positive samples divided by the number of true positive + false positive samples

Cases with samples belonging to more than one classes  In some cases a sample may belong to more than one class (overlapping classes)  In this case accuracy cannot be used to rate the classifier  The classifier instead of a class label returns a probability class distribution  Class prediction is considered correct if it agrees with first or second classifier’s guess