Algorithms for Classification: The Basic Methods.

Slides:



Advertisements
Similar presentations
Lecture 3: CBR Case-Base Indexing
Advertisements

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Decision Trees.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Instance-based representation
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Data Mining with Naïve Bayesian Methods
Evaluation.
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.
K Nearest Neighbor Classification Methods Qiang Yang.
Inferring rudimentary rules
K Nearest Neighbor Classification Methods Qiang Yang.
Probabilistic techniques. Machine learning problem: want to decide the classification of an instance given various attributes. Data contains attributes.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Data Mining Schemes in Practice. 2 Implementation: Real machine learning schemes  Decision trees: from ID3 to C4.5  missing values, numeric attributes,
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Decision-Tree Induction & Decision-Rule Induction
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning 5. Parametric Methods.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Decision Trees an introduction.
Data Science Algorithms: The Basic Methods
Algorithms for Classification:
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Bayesian Classification
Machine Learning Techniques for Data Mining
Clustering.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Algorithms for Classification: The Basic Methods

2 Outline  Simplicity first: 1R  Naïve Bayes

3  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes are known for the examples used to build the classifier.  A classifier can be a set of rules, a decision tree, a neural network, etc.  Typical applications: credit approval, direct marketing, fraud detection, medical diagnosis, ….. Classification

4 Simplicity first  Simple algorithms often work very well!  There are many kinds of simple structure, eg:  One attribute does all the work  All attributes contribute equally & independently  A weighted linear combination might do  Instance-based: use a few prototypes  Use simple logical rules  Success of method depends on the domain witten&eibe

5 Inferring rudimentary rules  1R: learns a 1-level decision tree  I.e., rules that all test one particular attribute  Basic version  One branch for each value  Each branch assigns most frequent class  Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch  Choose attribute with lowest error rate (assumes nominal attributes) witten&eibe

6 Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate  Note: “missing” is treated as a separate attribute value witten&eibe

7 Evaluating the weather attributes AttributeRulesErrorsTotal errors Outlook Sunny  No 2/54/14 Overcast  Yes 0/4 Rainy  Yes 2/5 Temp Hot  No* 2/45/14 Mild  Yes 2/6 Cool  Yes 1/4 Humidity High  No 3/74/14 Normal  Yes 1/7 Windy False  Yes 2/85/14 True  No* 3/6 OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo * indicates a tie witten&eibe

8 Dealing with numeric attributes  Discretize numeric attributes  Divide each attribute’s range into intervals  Sort instances according to attribute’s values  Place breakpoints where the class changes (the majority class)  This minimizes the total error  Example: temperature from weather data Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No OutlookTemperatureHumidityWindyPlay Sunny85 FalseNo Sunny8090TrueNo Overcast8386FalseYes Rainy7580FalseYes …………… witten&eibe

9 The problem of overfitting  This procedure is very sensitive to noise  One instance with an incorrect class label will probably produce a separate interval  Also: time stamp attribute will have zero errors  Simple solution: enforce minimum number of instances in majority class per interval witten&eibe

10 Discretization example  Example (with min = 3):  Final result for temperature attribute Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No witten&eibe

11 With overfitting avoidance  Resulting rule set: AttributeRulesErrorsTotal errors Outlook Sunny  No 2/54/14 Overcast  Yes 0/4 Rainy  Yes 2/5 Temperature  77.5  Yes 3/105/14 > 77.5  No* 2/4 Humidity  82.5  Yes 1/73/14 > 82.5 and  95.5  No 2/6 > 95.5  Yes 0/1 Windy False  Yes 2/85/14 True  No* 3/6 witten&eibe

12 Discussion of 1R  1R was described in a paper by Holte (1993)  Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data)  Minimum number of instances was set to 6 after some experimentation  1R’s simple rules performed not much worse than much more complex decision trees  Simplicity first pays off! Very Simple Classification Rules Perform Well on Most Commonly Used Datasets Robert C. Holte, Computer Science Department, University of Ottawa witten&eibe

13 Bayesian (Statistical) modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent (given the class value)  I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)  Independence assumption is almost never correct!  But … this scheme works well in practice witten&eibe

14 Probabilities for weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo witten&eibe

15 Probabilities for weather data OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue?  A new day: Likelihood of the two classes For “yes” = 2/9  3/9  3/9  3/9  9/14 = For “no” = 3/5  1/5  4/5  3/5  5/14 = Conversion into a probability by normalization: P(“yes”) = / ( ) = P(“no”) = / ( ) = OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 witten&eibe

16 Bayes’s rule  Probability of event H given evidence E :  A priori probability of H :  Probability of event before evidence is seen  A posteriori probability of H :  Probability of event after evidence is seen Thomas Bayes Born:1702 in London, England Died:1761 in Tunbridge Wells, Kent, England witten&eibe from Bayes “Essay towards solving a problem in the doctrine of chances” (1763)

17 Naïve Bayes for classification  Classification learning: what’s the probability of the class given an instance?  Evidence E = instance  Event H = class value for instance  Naïve assumption: evidence splits into parts (i.e. attributes) that are independent witten&eibe

18 Weather data example OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? Evidence E Probability of class “yes” witten&eibe

19 The “zero-frequency problem”  What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”)  Probability will be zero!  A posteriori probability will also be zero! (No matter how likely the other values are!)  Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)  Result: probabilities will never be zero! (also: stabilizes probability estimates) witten&eibe

20 *Modified probability estimates  In some cases adding a constant different from 1 might be more appropriate  Example: attribute outlook for class yes  Weights don’t need to be equal (but they must sum to 1) SunnyOvercastRainy witten&eibe

21 Missing values  Training: instance is not included in frequency count for attribute value-class combination  Classification: attribute will be omitted from calculation  Example: OutlookTemp.HumidityWindyPlay ?CoolHighTrue? Likelihood of “yes” = 3/9  3/9  3/9  9/14 = Likelihood of “no” = 1/5  4/5  3/5  5/14 = P(“yes”) = / ( ) = 41% P(“no”) = / ( ) = 59% witten&eibe

22 Numeric attributes  Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)  The probability density function for the normal distribution is defined by two parameters:  Sample mean   Standard deviation   Then the density function f(x) is witten&eibe Karl Gauss, great German mathematician

23 Statistics for weather data  Example density value: OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny2364, 68,65, 71,65, 70,70, 85,False6295 Overcast4069, 70,72, 80,70, 75,90, 91,True33 Rainy3272, …85, …80, …95, … Sunny2/93/5  =73  =75  =79  =86 False6/92/59/145/14 Overcast4/90/5  =6.2  =7.9  =10.2  =9.7 True3/93/5 Rainy3/92/5 witten&eibe

24 Classifying a new day  A new day:  Missing values during training are not included in calculation of mean and standard deviation OutlookTemp.HumidityWindyPlay Sunny6690true? Likelihood of “yes” = 2/9    3/9  9/14 = Likelihood of “no” = 3/5    3/5  5/14 = P(“yes”) = / ( ) = 20.9% P(“no”) = / ( ) = 79.1% witten&eibe

25 *Probability densities  Relationship between probability and density:  But: this doesn’t change calculation of a posteriori probabilities because  cancels out  Exact relationship: witten&eibe

26 Naïve Bayes: discussion  Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)  Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class  However: adding too many redundant attributes will cause problems (e.g. identical attributes)  Note also: many numeric attributes are not normally distributed (  kernel density estimators) witten&eibe

27 Naïve Bayes Extensions  Improvements:  select best attributes (e.g. with greedy search)  often works as well or better with just a fraction of all attributes  Bayesian Networks witten&eibe

28 Summary  OneR – uses rules based on just one attribute  Naïve Bayes – use all attributes and Bayes rules to estimate probability of the class given an instance.  Simple methods frequently work well, but …  Complex methods can be better (as we will see)