Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.

Slides:



Advertisements
Similar presentations
Naïve Bayes Classifier
Advertisements

Decision Trees.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Instance-based representation
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Data Mining with Naïve Bayesian Methods
Evaluation.
Classification: Decision Trees
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.
K Nearest Neighbor Classification Methods Qiang Yang.
Inferring rudimentary rules
Algorithms for Classification: The Basic Methods.
K Nearest Neighbor Classification Methods Qiang Yang.
Probabilistic techniques. Machine learning problem: want to decide the classification of an instance given various attributes. Data contains attributes.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Chapter 4: Algorithms CS 795.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Decision-Tree Induction & Decision-Rule Induction
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning 5. Parametric Methods.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
Algorithms for Classification:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Science Algorithms: The Basic Methods
Lecture 15: Text Classification & Naive Bayes
Decision Tree Saed Sayad 9/21/2018.
Classification Techniques: Bayesian Classification
Bayesian Classification
Machine Learning Techniques for Data Mining
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh

Outline Inferring rudimentary rules Statistical modeling

Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, eg: –One attribute does all the work –All attributes contribute equally & independently –A weighted linear combination might do –Instance-based: use a few prototypes –Use simple logical rules Success of method depends on the domain

Inferring rudimentary rules 1R: learns a 1-level decision tree –I.e., rules that all test one particular attribute Basic version –One branch for each value –Each branch assigns most frequent class –Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch –Choose attribute with lowest error rate (assumes nominal attributes)

Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follow: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate

Evaluating the weather attributes OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo Attribute class Table 1. Weather data (Nominal)

Evaluating the weather attributes There are 3 classes of attribute Outlook (Sunny, Overcast and Rainy) There are 14 instances of attribute Outlook

Evaluating the weather attributes AttributeRulesErrorsTotal errors OutlookSunny->No2/54/14 Overcast->Yes0/4 Rainy->Yes2/5 TempHot->No*2/45/14 Mild->Yes2/6 Cool->Yes1/4 HumidityHigh->No3/74/14 Normal->Yes1/7 WindyFalse->Yes2/85/14 True->No*3/6 Table 2. Rules for Weather data (Nominal)

Dealing with numeric attributes Discretize numeric attributes Divide each attribute’s range into intervals –Sort instances according to attribute’s values –Place breakpoints where class change –This minimizes the total error

Temperature from weather data (See Table 3&4) Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N Dealing with numeric attributes

AttributeRulesErrorsTotal errors Temperature Yes0/11/14 >64.5->No0/1 >66.5 and Yes0/3 >70.5 and No1/3 >73.5 and Yes0/2 >77.5 and No0/1 >80.5 and Yes0/2 >84->No0/1 Table 5. Rules for temperature from weather data (overfitting)

The problem of overfitting This procedure is very sensitive to noise –One instance with an incorrect class label will probably produce a separate interval Simple solution: –Enforce minimum number of instances in majority class per interval

The problem of overfitting Example (with min = 3) Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N Y N Y Y Y | N N Y Y Y | N Y Y N

With overfitting avoidance AttributeRulesErrorsTotal errors Temperature Yes3/105/ >No*2/4 Table 6. Rules for temperature from weather data (With overfitting avoidance)

Statistical modeling Opposite of 1R: use all the attributes Two assumptions: Attributes are –equally important –statistically independent (given the class value) Independence assumption is never correct! But…this scheme works well in practice.

Probabilities for the weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 Table 6. Probabilities for the weather data

Probabilities for the weather data A new day OutlookTemperatureHumidityWindyPlay SunnyCoolHighTrue? Likelihood of the two classes For “Yes” = 2/9*3/9*3/9*3/9*9/14= For “No” = 3/5*1/5*4/5*3/5*5/14= Conversion into a probability by normalization: P(“Yes”) = /( )=0.205 P(“No”) = /( )=0.795

Bayes’s rule Probability of event H given evidence E: A priori probability of H: –Probability of event before evidence is seen. A posteriori probability of H: –Probability of event after evidence is seen.

Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? Naïve assumption: evidence splits into parts (i.e. attributes) that are independent

Weather data example OutlookTemperatureHumidityWindyPlay SunnyCoolHighTrue?

The “zero-frequency problem” What if an attribute value does not occur with every class value? (e.g. “Humidity=high” for class “yes”) –Probability will be zero –A posteriori probability will also be zero! Remedy: add 1 to the count to every attribute value-class combination Result: probabilities will never be zero!

Modified probability estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes SunnyOvercastRainy Weights don’t need to be equal (but they must sum to 1)

Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: OutlookTemp.HumidityWindyPlay ?CoolHighTrue? Likelihood of the two classes For “Yes” = 3/9*3/9*3/9*9/14= For “No” = 1/5*4/5*3/5*5/14= Conversion into a probability by normalization: P(“Yes”) = /( )=0.41 P(“No”) = /( )=0.59

Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (giver the class) The probability density function for the normal distribution is defined by two parameters: –Sample mean –Standard deviation –Then the density function f(x) is

Statistics for weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny2364,68,7 0,72,75,… 65,71,7 2,80,85,… 65,70,7 0,75,80,… 70,85,9 0,91,95, … False6295 Overcast40True33 Rainy32 Sunny2/93/5False6/92/59/145/14 Overcast4/90/5True3/93/5 Rainy3/92/5 Example density value:

Classifying a new day A new day OutlookTemperatureHumidityWindyPlay Sunny6690True? Likelihood of the two classes For “Yes” = 2/9*0.0340*0.0221*3/9*9/14= For “No” = 3/5*0.0221*0.0381*3/5*5/14= Conversion into a probability by normalization: P(“Yes”) = /( )=0.25 P(“No”) = /( )=0.75

Multinomial naïve Bayes I Version of naïve Bayes used for document classification using bag of words model n 1, n 2,…, n k :number of times word I occurs in document P 1, P 2,…, P k :probability of obtaining word i when sampling from document in class H Probability of observing document E given class H (based on multinomial distribution): Ignores probability of generating a document of the right length (probability assumed constant for each class)

Multinomial naïve Bayes II Suppose dictionary has two words, yellow and blue Suppose Pr[yellow|H]=75% and Pr[blue|H]=25% Suppose E is the document “blue yellow blue” Probability of observing document: Suppose there is another class H’ that has Pr[yellow|H]=10% and Pr[blue|H]=90%

Naïve Bayes: discussion Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification does not require accurate probability estimates. However: adding too many redundant attributes will cause problems Note also: many numeric attributes are not normally distributed!