Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

Slides:

Advertisements

Similar presentations

COMP3740 CR32: Knowledge Management and Adaptive Systems

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Lazy vs. Eager Learning Lazy vs. eager learning

Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

1 Input and Output Thanks: I. Witten and E. Frank.

Naïve Bayes: discussion

Primer Parcial -> Tema 3 Minería de Datos Universidad del Cauca.

CMPUT 466/551 Principal Source: CMU

Classification and Decision Boundaries

Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)

Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.

Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.

Instance-based representation

Induction of Decision Trees

These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.

Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.

Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.

Algorithms for Classification: The Basic Methods.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Chapter 6 Decision Trees

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Module 04: Algorithms Topic 07: Instance-Based Learning

Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Decision Trees.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes, missing values, and noisy data.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.

Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

Algorithms for Classification: The Basic Methods.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

CSCI 347, Data Mining Chapter 4 – Functions, Rules, Trees, and Instance Based Learning.

RULE-BASED CLASSIFIERS

Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.

Classification Algorithms Covering, Nearest-Neighbour.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.

Linear Models & Clustering Presented by Kwak, Nam-ju 1.

Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Data Science Algorithms: The Basic Methods

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Data Science Algorithms: The Basic Methods

Data Mining Practical Machine Learning Tools and Techniques

Classification Algorithms

CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.

Chapter 7: Transformations

Data Mining CSCI 307, Spring 2019 Lecture 21

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Data Mining CSCI 307, Spring 2019 Lecture 23

Data Mining CSCI 307, Spring 2019 Lecture 11

Data Mining CSCI 307, Spring 2019 Lecture 6

Presentation transcript:

Classification Algorithms – Continued

2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

3 Generating Rules  Decision tree can be converted into a rule set  Straightforward conversion:  each path to the leaf becomes a rule – makes an overly complex rule set  More effective conversions are not trivial  (e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy)

4 Covering algorithms  Strategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class)  This approach is called a covering approach because at each stage a rule is identified that covers some of the instances

5 Example: generating a rule If true then class = a

6 Example: generating a rule, II If x > 1.2 then class = a If true then class = a

7 Example: generating a rule, III If x > 1.2 then class = a If x > 1.2 and y > 2.6 then class = a If true then class = a

8 Example: generating a rule, IV If x > 1.2 then class = a If x > 1.2 and y > 2.6 then class = a If true then class = a  Possible rule set for class “b”:  More rules could be added for “perfect” rule set If x  1.2 then class = b If x > 1.2 and y  2.6 then class = b

9 Rules vs. trees  Corresponding decision tree: (produces exactly the same predictions)  But: rule sets can be more clear when decision trees suffer from replicated subtrees  Also: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

10 A simple covering algorithm  Generates a rule by adding tests that maximize rule’s accuracy  Similar to situation in decision trees: problem of selecting an attribute to split on  But: decision tree inducer maximizes overall purity  Each new test reduces rule’s coverage: witten&eibe

11 Selecting a test  Goal: maximize accuracy  t total number of instances covered by rule  p positive examples of the class covered by rule  t – p number of errors made by rule  Select test that maximizes the ratio p/t  We are finished when p/t = 1 or the set of instances can’t be split any further witten&eibe

12 Example: contact lens data  Rule we seek:  Possible tests: Age = Young2/8 Age = Pre-presbyopic1/8 Age = Presbyopic1/8 Spectacle prescription = Myope3/12 Spectacle prescription = Hypermetrope1/12 Astigmatism = no0/12 Astigmatism = yes4/12 Tear production rate = Reduced0/12 Tear production rate = Normal4/12 If ? then recommendation = hard witten&eibe

13 Modified rule and resulting data  Rule with best test added:  Instances covered by modified rule: AgeSpectacle prescriptionAstigmatismTear production rateRecommended lenses YoungMyopeYesReducedNone YoungMyopeYesNormalHard YoungHypermetropeYesReducedNone YoungHypermetropeYesNormalhard Pre-presbyopicMyopeYesReducedNone Pre-presbyopicMyopeYesNormalHard Pre-presbyopicHypermetropeYesReducedNone Pre-presbyopicHypermetropeYesNormalNone PresbyopicMyopeYesReducedNone PresbyopicMyopeYesNormalHard PresbyopicHypermetropeYesReducedNone PresbyopicHypermetropeYesNormalNone If astigmatism = yes then recommendation = hard witten&eibe

14 Further refinement  Current state:  Possible tests: Age = Young2/4 Age = Pre-presbyopic1/4 Age = Presbyopic1/4 Spectacle prescription = Myope3/6 Spectacle prescription = Hypermetrope1/6 Tear production rate = Reduced0/6 Tear production rate = Normal4/6 If astigmatism = yes and ? then recommendation = hard witten&eibe

15 Modified rule and resulting data  Rule with best test added:  Instances covered by modified rule: AgeSpectacle prescriptionAstigmatismTear production rateRecommended lenses YoungMyopeYesNormalHard YoungHypermetropeYesNormalhard Pre-presbyopicMyopeYesNormalHard Pre-presbyopicHypermetropeYesNormalNone PresbyopicMyopeYesNormalHard PresbyopicHypermetropeYesNormalNone If astigmatism = yes and tear production rate = normal then recommendation = hard witten&eibe

16 Further refinement  Current state:  Possible tests:  Tie between the first and the fourth test  We choose the one with greater coverage Age = Young2/2 Age = Pre-presbyopic1/2 Age = Presbyopic1/2 Spectacle prescription = Myope3/3 Spectacle prescription = Hypermetrope1/3 If astigmatism = yes and tear production rate = normal and ? then recommendation = hard witten&eibe

17 The result  Final rule:  Second rule for recommending “hard lenses”: (built from instances not covered by first rule)  These two rules cover all “hard lenses”:  Process is repeated with other two classes If astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard witten&eibe

18 Pseudo-code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the instances covered by R from E witten&eibe

19 Rules vs. decision lists  PRISM with outer loop removed generates a decision list for one class  Subsequent rules are designed for rules that are not covered by previous rules  But: order doesn’t matter because all rules predict the same class  Outer loop considers all classes separately  No order dependence implied  Problems: overlapping rules, default rule required

20 Separate and conquer  Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms:  First, a rule is identified  Then, all instances covered by the rule are separated out  Finally, the remaining instances are “conquered”  Difference to divide-and-conquer methods:  Subset covered by rule doesn’t need to be explored any further witten&eibe

21 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

22 Linear models  Work most naturally with numeric attributes  Standard technique for numeric prediction: linear regression  Outcome is linear combination of attributes  Weights are calculated from the training data  Predicted value for first training instance a (1) witten&eibe

23 Minimizing the squared error  Choose k +1 coefficients to minimize the squared error on the training data  Squared error:  Derive coefficients using standard matrix operations  Can be done if there are more instances than attributes (roughly speaking)  Minimizing the absolute error is more difficult witten&eibe

24 Regression for Classification  Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t  Prediction: predict class corresponding to model with largest output value (membership value)  For linear regression this is known as multi-response linear regression witten&eibe

25 *Theoretical justification Model Instance Observed target value (either 0 or 1) True class probability Constant We want to minimize this The scheme minimizes this witten&eibe

26 *Pairwise regression  Another way of using regression for classification:  A regression function for every pair of classes, using only instances from these two classes  Assign output of +1 to one member of the pair, –1 to the other  Prediction is done by voting  Class that receives most votes is predicted  Alternative: “don’t know” if there is no agreement  More likely to be accurate but more expensive witten&eibe

27 Logistic regression  Problem: some assumptions violated when linear regression is applied to classification problems  Logistic regression: alternative to linear regression  Designed for classification problems  Tries to estimate class probabilities directly  Does this using the maximum likelihood method  Uses this linear model: P= Class probability witten&eibe

28 Discussion of linear models  Not appropriate if data exhibits non-linear dependencies  But: can serve as building blocks for more complex schemes (i.e. model trees)  Example: multi-response linear regression defines a hyperplane for any two given classes: witten&eibe

29 Comments on basic methods  Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR  But: combinations of them can (  Neural Nets) witten&eibe

30 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)

31 Instance-based representation  Simplest form of learning: rote learning  Training instances are searched for instance that most closely resembles new instance  The instances themselves represent the knowledge  Also called instance-based learning  Similarity function defines what’s “learned”  Instance-based learning is lazy learning  Methods:  nearest-neighbor  k-nearest-neighbor  … witten&eibe

32 The distance function  Simplest case: one numeric attribute  Distance is the difference between the two attribute values involved (or a function thereof)  Several numeric attributes: normally, Euclidean distance is used and attributes are normalized  Nominal attributes: distance is set to 1 if values are different, 0 if they are equal  Are all attributes equally important?  Weighting the attributes might be necessary witten&eibe

33 Instance-based learning  Distance function defines what’s learned  Most instance-based schemes use Euclidean distance: a (1) and a (2) : two instances with k attributes  Taking the square root is not required when comparing distances  Other popular metric: city-block (Manhattan) metric  Adds differences without squaring them witten&eibe

34 Normalization and other issues  Different attributes are measured on different scales  need to be normalized: v i : the actual value of attribute i  Nominal attributes: distance either 0 or 1  Common policy for missing values: assumed to be maximally distant (given normalized attributes) or witten&eibe

35 Discussion of 1-NN  Often very accurate  … but slow:  simple version scans entire training data to derive a prediction  Assumes all attributes are equally important  Remedy: attribute selection or weights  Possible remedies against noisy instances:  Take a majority vote over the k nearest neighbors  Removing noisy instances from dataset (difficult!)  Statisticians have used k-NN since early 1950s  If n   and k/n  0, error approaches minimum witten&eibe

36 Summary  Simple methods frequently work well  robust against noise, errors  Advanced methods, if properly used, can improve on simple methods  No method is universally best