School of Computer Science & Engineering

School of Computer Science & Engineering
Artificial Intelligence Nearest Neighbor Classifier Dae-Won Kim School of Computer Science & Engineering Chung-Ang University

In last class, we learned bayesian classification approaches.

However, we see some limitations.

Limit 1. We rarely have class-conditional probability P(x|class), thus it should be estimated.

Samples are too small for class-conditional estimation.

Limit 2. Samples and features are independent each other.

They are often not independent.

Limit 3. All parametric densities are uni-modal (normal distribution).

Many practical problems involve multi-modal densities.

We want to start without the assumption that the forms of the underlying densities are known.

These methods are called Non-parametric approaches.

There are two types of non-parametric methods.

Parzen window vs. Nearest neighbor

Parzen window estimates multi-modal P(x|class) from sample patterns.

Nearest Neighbor try to solve the unknown “best” window function.

NN algorithms bypass probability and go directly to a posterior probability P(class|x).

A cell of volume V around x and capture k samples where ki samples turned out to be wi.

Let x’ be the closest training pattern to a test pattern x, then the NN rule is to assign the label with x’.

If the number of data is large, the error of NN is never worse than twice the Bayes rate.

The K-NN classify x by assigning it the most frequent label among the k nearest samples and use a voting.

Therefore, the NN-type methods are called instance-based classifier, without explicit learning.

Issues: - the number of k, - distance measure Feature weighting Scalability issue: linear scan

Nonparametric Approach
All Parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi-modal densities Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known There are two types of nonparametric methods: Estimating density function P(x | j ) : Parzen window Bypass probability and go directly to a-posteriori probability P(j | x) : k-NN

K-Nearest-Neighbor Estimation
Motivation: solve the unknown “best” window function Let the cell volume be a function of the training data Center a cell about x and let it grows until it captures kn samples kn are called the kn nearest-neighbors of x 2 possibilities can occur: Density is high near x; therefore the cell will be small which provides a good resolution Density is low; therefore the cell will grow large and stop until higher density regions are reached

K-Nearest-Neighbor Estimation
Estimate a-posteriori probabilities P(i | x) from a set of n labeled samples Let’s place a cell of volume V around x and capture k samples ki samples amongst k turned out to be labeled i then: pn(x, i) = ki / n An estimate for pn(i| x) is: ki/k is the fraction of the samples within the cell that are labeled i If k is large and the cell sufficiently small, the performance will approach the best possible

Nearest-Neighbor Rule
Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes Let x’  Dn be the closest prototype to a test point x then the nearest-neighbor rule for classifying x is to assign it the label associated with x’ If the number of prototype is large (unlimited), the error rate of the nearest-neighbor classifier is never worse than twice the Bayes rate If n  , it is always possible to find x’ sufficiently close so that:P(i | x’)  P(i | x) Classify x by assigning it the label most frequently represented among the k nearest samples and use a voting

Other simple classifiers.

Linear Discriminant Classifier

Linear Classifier 1. In previous classifiers 2. Linear classifier
Underlying probability densities were known (or given) The training samples was used to estimate the parameters of probabilities 2. Linear classifier Instead, we assume the proper forms of discriminant functions Use the samples to estimate the values of parameters of the classifier They may not be optimal, but they are very simple to use Attractive candidates for initial, trial classifiers

Linear Discriminant Functions
“Finding a linear discriminant function is formulated as a problem of minimizing a criterion function, i.e., training error.” 1. Definition It is a function that is a linear combination of the components of x g(x) = wtx + w0 where w is the weight vector and w0 the bias A two-category classifier with a discriminant function uses the following rule: Decide 1 if g(x) > 0 and 2 if g(x) < 0 when g(x) is linear, this decision surface is a hyperplane decision regions for a linear machine are convex this restriction limits the flexibility and accuracy of the classifier applicable for unimodal application

Linear Discriminant Functions
2. Training Samples: x1, …, xn, and labels w1, w2 g(x) = wtx  Find a weight vector ‘w’ that classifies all of the samples correctly Decide 1 if g(x) > 0 and 2 if g(x) < 0 By normalization, wtx > 0 : solution vector or separating vector  not unique 3. Gradient descent procedure Unconstrained optimization problem: find ‘w’ that minimize J(w) Start with some w(1) and compute gradient vector J(w(1)). w(2) is obtained by moving distance from w(1) in the direction of steepest descent. w(k+1) = w(k) - (k) J(w(k))  is learning rate. If it is too large, the process will overshoot and diverge. An alternative method: Newton’s method (second-order)

Neural Network Classifier, SVMs

Generalized perceptron training procedure
1. Perceptron criterion function Simplest: J(w) is the number of samples misclassified by w An example, J(w) =  (- wt y) where y  Y is the set of samples misclassified by ‘w’ J(w) is never negative, being zero if w is a solution vector 2. Minimizing the perceptron the gradient of J(w)  J(w) =  (- y) updating rule  w(k+1) = w(k) + (k)  (y) 3. Relaxation procedure Generalized perceptron training procedure A broader criterion functions and minimization methods An example, 1) J(w) =  (- wt y)2  J: continuous and a smoother search 2) relaxation with margin  avoid the useless solution w = 0

Multilayer Neural Networks
1. Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient Central difficulty was the choice of the appropriate nonlinear functions A “brute” approach: polynomial function  too many parameters No automatic method for determining the nonlinearities 2. Multilayer neural networks or multilayer perceptrons Layered topology of linear discriminants provides nonlinear mapping The form of the nonlinearity is learned from the training data ‘Backpropagation’ is the most popular learning method Optimal topology depends on the problem at hand

A Thee-Layer Neural Network
1. Net activation the inner product of inputs with weights at the hidden layer each hidden unit emits an output that is a nonlinear function or its activation each output unit similarly computes its net activation based on the hidden unit an output unit computes the nonlinear function of its net: zk = f(netk)

2. An example nety1=x1+x2+0.5 nety2 = x1+x2-1.5 netz = 0.7*y *y2 - 1

3. Expressive power Q: Can every decision be implemented by a three-layer network? A: Any continuous function can be implemented, given sufficient number of hidden units Two-layer network classifier can only implement a linear decision boundary

Backpropagation Algorithm
Network learning Learn the interconnection weights based on the training patterns and outputs Computes an error for each hidden unit, and drive a learning rule “Feedforward” and “Learning”

Decision Trees

Decision Trees In previous classifiers,
Feature vectors of real-valued numbers and there has been distance measures How can we use nominal data for classification? How can we efficiently learn categories such nonmetric data? Classification of ‘fruits’ based on their color and shape Apple = (red, small_sphere), Watermelon=(green, big_sphere)

Decision Trees 1. Decision tree 2. Issues of tree-growing algorithms
natural and intuitive to classify a pattern through a sequence of questions a sequence of questions is displayed in a directed decision tree begins at the root node, which asks for the value of property of the pattern the links must be mutually distinct and exhaustive each leaf node bears a category easy interpretation, rapid classification, easy to incorporate prior knowledge rule-based classification 2. Issues of tree-growing algorithms how many decision outcomes or splits will there be at a node? which property should be tested at a node? when should a node be declared a leaf?

The desire to find reliable answers demands more powerful classification algorithms with better understanding of the data. Pattern Recognition, Oracle Mining in Spring Class

Imbalance and Sampling Ensemble, Bootstrapping, Bagging
Cross Validation

School of Computer Science & Engineering

Similar presentations

Presentation on theme: "School of Computer Science & Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of Computer Science & Engineering

Similar presentations

Presentation on theme: "School of Computer Science & Engineering"— Presentation transcript:

Similar presentations

About project

Feedback