Classification Algorithms

Slides:



Advertisements
Similar presentations
1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Advertisements

Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.
Machine Learning Instance Based Learning & Case Based Reasoning Exercise Solutions.
RIPPER Fast Effective Rule Induction
Lazy vs. Eager Learning Lazy vs. eager learning
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
1 Input and Output Thanks: I. Witten and E. Frank.
Naïve Bayes: discussion
Primer Parcial -> Tema 3 Minería de Datos Universidad del Cauca.
Classification and Decision Boundaries
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Instance-based representation
Induction of Decision Trees
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
Chapter 6 Decision Trees
Module 04: Algorithms Topic 07: Instance-Based Learning
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
Issues with Data Mining
Decision Trees.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
CSCI 347, Data Mining Chapter 4 – Functions, Rules, Trees, and Instance Based Learning.
RULE-BASED CLASSIFIERS
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Classification Algorithms Covering, Nearest-Neighbour.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Machine Learning: Ensemble Methods
DECISION TREES An internal node represents a test on an attribute.
k-Nearest neighbors and decision tree
Fast Effective Rule Induction
Data Science Algorithms: The Basic Methods
Data Mining – Algorithms: Instance-Based Learning
Instance Based Learning
Artificial Intelligence
Classification Nearest Neighbor
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
ID3 Algorithm.
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
Classification Nearest Neighbor
Nearest-Neighbor Classifiers
Clustering.
COSC 4335: Other Classification Techniques
Machine Learning in Practice Lecture 17
CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.
Nearest Neighbors CSC 576: Data Mining.
Data Mining Classification: Alternative Techniques
Nearest Neighbor Classifiers
Chapter 7: Transformations
CSE4334/5334 Data Mining Lecture 7: Classification (4)
Data Mining CSCI 307, Spring 2019 Lecture 21
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Data Mining CSCI 307 Spring, 2019
Data Mining CSCI 307, Spring 2019 Lecture 23
Data Mining CSCI 307, Spring 2019 Lecture 11
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classification Algorithms Covering, Nearest-Neighbour

Covering algorithms Strategy for generating a rule set directly: for each class in turn find rule set that covers all examples in it (excluding examples not in the class) This approach is called a covering approach at each stage a rule is identified that covers some of the examples

Example: generating a rule entire dataset

Example: generating a rule We now split along X axis and most of the cases on

Example: generating a rule a single rule describes a convex concept

Example: generating a rule ‘b’ is not convex  need extra rules for else part possible rule set for class ‘b’ more rules could be added for a ‘perfect’ rule set

Rules vs. Trees Corresponding decision tree: (produces exactly the same predictions)

A simple covering algorithm: PRISM Generate a rule by adding tests that maximize a rule’s accuracy Each additional test reduces a rule’s coverage: Similar to a situation in decision trees: problem of selecting an attribute to split on But: decision tree inducer maximizes overall purity

Selecting a test Goal: maximize accuracy t total number of examples covered by rule p positive examples of the class covered by rule Select test that maximizes the ratio p/t We are finished when p/t = 1 or the set of examples cannot be split any further - Search for maximal accuracy, alternatives possible such as to include largest cover of the dataset

Contact lens data age spectacle prescription astigmatism tear production rate recommendation young myope no reduced none normal soft yes hard hypermetrope pre-presbyopic presbyopic

Example: contact lens data Rule we seek: Possible tests: if ? then recommendation = hard Age = Young 2/8 Age = Pre-presbyopic 1/8 Age = Presbyopic Spectacle prescription = Myope 3/12 Spectacle prescription = Hypermetrope 1/12 Astigmatism = no 0/12 Astigmatism = yes 4/12 Tear production rate = Reduced Tear production rate = Normal p /t = # postive with Age=Young/ # Age = Young

Modified rule and resulting data Rule with best test added (4/12): Examples covered by modified rule: if astigmatism = yes then recommendation = hard Age Spectacle prescription Astigmatism Tear production rate Recommendation Young Myope Yes Reduced None Normal Hard Hypermetrope Pre-presbyopic Presbyopic

Further refinement Current state: Possible tests: if astigmatism = yes and ? then recommendation = hard Age = Young 2/4 Age = Pre-presbyopic 1/4 Age = Presbyopic Spectacle prescription = Myope 3/6 Spectacle prescription = Hypermetrope 1/6 Tear production rate = Reduced 0/6 Tear production rate = Normal 4/6 p/t rates are based on, given the already done selection, order does matter…

Modified rule and resulting data Rule with best test added: Examples covered by modified rule: if astigmatism = yes and tear production rate = normal then recommendation = hard Age Spectacle prescription Astigmatism Tear production rate Recommendation Young Myope Yes Normal Hard Hypermetrope Pre-presbyopic None Presbyopic

Further refinement Current state: Possible tests: Tie between the first and the fourth test We choose the one with greater coverage if astigmatism = yes and tear production rate = normal and ? then recommendation = hard Age = Young 2/2 Age = Pre-presbyopic 1/2 Age = Presbyopic Spectacle prescription = Myope 3/3 Spectacle prescription = Hypermetrope 1/3

The result Final rule: Second rule for recommending “hard” lenses: (built from examples not covered by first rule) These two rules cover all “hard” lenses Process is repeated with other two classes if astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard if age = young and astigmatism = yes and tear production rate = normal then recommendation = hard

Pseudo-code for PRISM For each class C Initialize D to the dataset While D contains examples in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the examples covered by R from D

Separate and conquer Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms: First, a rule is identified Then, all examples covered by the rule are separated out Finally, the remaining examples are “conquered” Difference to divide-and-conquer methods: Subset covered by rule doesn’t need to be explored any further

Instance-based Classification k-nearest neighbours

Instance-based representation Simplest form of learning: rote learning Training set is searched for example that most closely resembles new example The examples themselves represent the knowledge Also called instance-based learning Instance-based learning is lazy learning Similarity function defines what’s “learned” Methods: nearest neighbour k-nearest neighbours … Rote = routine, learn by repetition Visualisatie van k-nn

1-NN example

The distance function Simplest case: one numeric attribute Distance is the difference between the two attribute values involved (or a function thereof) Several numeric attributes: normally, Euclidean distance is used and attributes are normalized Are all attributes equally important? Weighting the attributes might be necessary

The distance function Most instance-based schemes use Euclidean distance: a(1) and a(2): two instances with k attributes Taking the square root is not required when comparing distances Other popular metric: city-block (Manhattan) metric Adds differences without squaring them Distance aansluiten bij voorbeeld Voorbeeld van hamming / manhattan distance

Normalization and other issues Different attributes are measured on different scales Need to be normalized: vi : the actual value of attribute i Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes) or

Nearest Neigbour & Voronoi diagram

k-NN example k-NN approach: perform a majority vote to derive label

Discussion of 1-NN Can be very accurate … but slow: But: curse of dimensionality Every added dimension increases distances Exponentially more training data needed … but slow: simple version scans entire training data to make prediction better training set representations: kD-tree, ball tree,... Assumes all attributes are equally important Remedy: attribute selection or weights Possible remedy against noisy examples: Take a majority vote over the k nearest neighbours

Summary Simple methods frequently work well robust against noise, errors Advanced methods, if properly used, can improve on simple methods No method is universally best