Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Lazy vs. Eager Learning Lazy vs. eager learning
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
1 Input and Output Thanks: I. Witten and E. Frank.
Naïve Bayes: discussion
Primer Parcial -> Tema 3 Minería de Datos Universidad del Cauca.
CMPUT 466/551 Principal Source: CMU
Classification and Decision Boundaries
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.
Instance-based representation
Machine Learning: finding patterns. 2 Outline  Machine learning and Classification  Examples  *Learning as Search  Bias  Weka.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Algorithms for Classification: The Basic Methods.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 6 Decision Trees
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Module 04: Algorithms Topic 07: Instance-Based Learning
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Decision Trees.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes, missing values, and noisy data.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
CSCI 347, Data Mining Chapter 4 – Functions, Rules, Trees, and Instance Based Learning.
RULE-BASED CLASSIFIERS
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Classification Algorithms Covering, Nearest-Neighbour.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Data Science Algorithms: The Basic Methods
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Science Algorithms: The Basic Methods
Data Mining Practical Machine Learning Tools and Techniques
Clustering.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Classification Algorithms
Machine Learning in Practice Lecture 23
CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.
Chapter 7: Transformations
Data Mining CSCI 307, Spring 2019 Lecture 21
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining CSCI 307, Spring 2019 Lecture 11
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classification Algorithms – Continued

2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

3 Generating Rules  Decision tree can be converted into a rule set  Straightforward conversion:  each path to the leaf becomes a rule – makes an overly complex rule set  More effective conversions are not trivial  (e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy)

4 Covering algorithms  Strategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class)  This approach is called a covering approach because at each stage a rule is identified that covers some of the instances

5 Example: generating a rule If true then class = a

6 Example: generating a rule, II If x > 1.2 then class = a If true then class = a

7 Example: generating a rule, III If x > 1.2 then class = a If x > 1.2 and y > 2.6 then class = a If true then class = a

8 Example: generating a rule, IV If x > 1.2 then class = a If x > 1.2 and y > 2.6 then class = a If true then class = a  Possible rule set for class “b”:  More rules could be added for “perfect” rule set If x  1.2 then class = b If x > 1.2 and y  2.6 then class = b

9 Rules vs. trees  Corresponding decision tree: (produces exactly the same predictions)  But: rule sets can be more clear when decision trees suffer from replicated subtrees  Also: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

10 A simple covering algorithm  Generates a rule by adding tests that maximize rule’s accuracy  Similar to situation in decision trees: problem of selecting an attribute to split on  But: decision tree inducer maximizes overall purity  Each new test reduces rule’s coverage: witten&eibe

11 Selecting a test  Goal: maximize accuracy  t total number of instances covered by rule  p positive examples of the class covered by rule  t – p number of errors made by rule  Select test that maximizes the ratio p/t  We are finished when p/t = 1 or the set of instances can’t be split any further witten&eibe

12 Example: contact lens data, 1  Rule we seek:  Possible tests: Age = Young2/8 Age = Pre-presbyopic Age = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope Astigmatism = no Astigmatism = yes Tear production rate = Reduced Tear production rate = Normal If ? then recommendation = hard witten&eibe

13 Example: contact lens data, 2  Rule we seek:  Possible tests: Age = Young2/8 Age = Pre-presbyopic1/8 Age = Presbyopic1/8 Spectacle prescription = Myope3/12 Spectacle prescription = Hypermetrope1/12 Astigmatism = no0/12 Astigmatism = yes4/12 Tear production rate = Reduced0/12 Tear production rate = Normal4/12 If ? then recommendation = hard witten&eibe

14 Modified rule and resulting data  Rule with best test added:  Instances covered by modified rule: AgeSpectacle prescriptionAstigmatismTear production rateRecommended lenses YoungMyopeYesReducedNone YoungMyopeYesNormalHard YoungHypermetropeYesReducedNone YoungHypermetropeYesNormalhard Pre-presbyopicMyopeYesReducedNone Pre-presbyopicMyopeYesNormalHard Pre-presbyopicHypermetropeYesReducedNone Pre-presbyopicHypermetropeYesNormalNone PresbyopicMyopeYesReducedNone PresbyopicMyopeYesNormalHard PresbyopicHypermetropeYesReducedNone PresbyopicHypermetropeYesNormalNone If astigmatism = yes then recommendation = hard witten&eibe

15 Further refinement, 1  Current state:  Possible tests: Age = Young2/4 Age = Pre-presbyopic Age = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope Tear production rate = Reduced Tear production rate = Normal If astigmatism = yes and ? then recommendation = hard witten&eibe

16 Further refinement, 2  Current state:  Possible tests: Age = Young2/4 Age = Pre-presbyopic1/4 Age = Presbyopic1/4 Spectacle prescription = Myope3/6 Spectacle prescription = Hypermetrope1/6 Tear production rate = Reduced0/6 Tear production rate = Normal4/6 If astigmatism = yes and ? then recommendation = hard witten&eibe

17 Modified rule and resulting data  Rule with best test added:  Instances covered by modified rule: AgeSpectacle prescriptionAstigmatismTear production rateRecommended lenses YoungMyopeYesNormalHard YoungHypermetropeYesNormalhard Pre-presbyopicMyopeYesNormalHard Pre-presbyopicHypermetropeYesNormalNone PresbyopicMyopeYesNormalHard PresbyopicHypermetropeYesNormalNone If astigmatism = yes and tear production rate = normal then recommendation = hard witten&eibe

18 Further refinement, 3  Current state:  Possible tests: Age = Young Age = Pre-presbyopic Age = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope If astigmatism = yes and tear production rate = normal and ? then recommendation = hard witten&eibe

19 Further refinement, 4  Current state:  Possible tests:  Tie between the first and the fourth test  We choose the one with greater coverage Age = Young2/2 Age = Pre-presbyopic1/2 Age = Presbyopic1/2 Spectacle prescription = Myope3/3 Spectacle prescription = Hypermetrope1/3 If astigmatism = yes and tear production rate = normal and ? then recommendation = hard witten&eibe

20 The result  Final rule:  Second rule for recommending “hard lenses”: (built from instances not covered by first rule)  These two rules cover all “hard lenses”:  Process is repeated with other two classes If astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard witten&eibe

21 Pseudo-code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the instances covered by R from E witten&eibe

22 Rules vs. decision lists  PRISM with outer loop removed generates a decision list for one class  Subsequent rules are designed for rules that are not covered by previous rules  But: order doesn’t matter because all rules predict the same class  Outer loop considers all classes separately  No order dependence implied  Problems: overlapping rules, default rule required

23 Separate and conquer  Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms:  First, a rule is identified  Then, all instances covered by the rule are separated out  Finally, the remaining instances are “conquered”  Difference to divide-and-conquer methods:  Subset covered by rule doesn’t need to be explored any further witten&eibe

24 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

25 Linear models  Work most naturally with numeric attributes  Standard technique for numeric prediction: linear regression  Outcome is linear combination of attributes  Weights are calculated from the training data  Predicted value for first training instance a (1) witten&eibe

26 Minimizing the squared error  Choose k +1 coefficients to minimize the squared error on the training data  Squared error:  Derive coefficients using standard matrix operations  Can be done if there are more instances than attributes (roughly speaking)  Minimizing the absolute error is more difficult witten&eibe

27 Regression for Classification  Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t  Prediction: predict class corresponding to model with largest output value (membership value)  For linear regression this is known as multi-response linear regression witten&eibe

28 *Theoretical justification Model Instance Observed target value (either 0 or 1) True class probability Constant We want to minimize this The scheme minimizes this witten&eibe

29 *Pairwise regression  Another way of using regression for classification:  A regression function for every pair of classes, using only instances from these two classes  Assign output of +1 to one member of the pair, –1 to the other  Prediction is done by voting  Class that receives most votes is predicted  Alternative: “don’t know” if there is no agreement  More likely to be accurate but more expensive witten&eibe

30 Logistic regression  Problem: some assumptions violated when linear regression is applied to classification problems  Logistic regression: alternative to linear regression  Designed for classification problems  Tries to estimate class probabilities directly  Does this using the maximum likelihood method  Uses this linear model: P= Class probability witten&eibe

31 Discussion of linear models  Not appropriate if data exhibits non-linear dependencies  But: can serve as building blocks for more complex schemes (i.e. model trees)  Example: multi-response linear regression defines a hyperplane for any two given classes: witten&eibe

32 Comments on basic methods  Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR  But: combinations of them can (  Neural Nets) witten&eibe

33 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

34 Instance-based representation  Simplest form of learning: rote learning  Training instances are searched for instance that most closely resembles new instance  The instances themselves represent the knowledge  Also called instance-based learning  Similarity function defines what’s “learned”  Instance-based learning is lazy learning  Methods:  nearest-neighbor  k-nearest-neighbor  … witten&eibe

35 The distance function  Simplest case: one numeric attribute  Distance is the difference between the two attribute values involved (or a function thereof)  Several numeric attributes: normally, Euclidean distance is used and attributes are normalized  Nominal attributes: distance is set to 1 if values are different, 0 if they are equal  Are all attributes equally important?  Weighting the attributes might be necessary witten&eibe

36 Instance-based learning  Distance function defines what’s learned  Most instance-based schemes use Euclidean distance: a (1) and a (2) : two instances with k attributes  Taking the square root is not required when comparing distances  Other popular metric: city-block (Manhattan) metric  Adds differences without squaring them witten&eibe

37 Normalization and other issues  Different attributes are measured on different scales  need to be normalized: v i : the actual value of attribute i  Nominal attributes: distance either 0 or 1  Common policy for missing values: assumed to be maximally distant (given normalized attributes) or witten&eibe

38 Discussion of 1-NN  Often very accurate  … but slow:  simple version scans entire training data to derive a prediction  Assumes all attributes are equally important  Remedy: attribute selection or weights  Possible remedies against noisy instances:  Take a majority vote over the k nearest neighbors  Removing noisy instances from dataset (difficult!)  Statisticians have used k-NN since early 1950s  If n   and k/n  0, error approaches minimum witten&eibe

39 Summary  Simple methods frequently work well  robust against noise, errors  Advanced methods, if properly used, can improve on simple methods  No method is universally best

40 Exploring simple ML schemes with WEKA  1R (evaluate on training set)  Weather data (nominal)  Weather data (numeric) B=3 (and B=1)  Naïve Bayes: same datasets  J4.8 (and visualize tree)  Weather data (nominal)  PRISM: Contact lens data  Linear regression: CPU data