Bayesian Decision Theory

Bayesian Decision Theory
Basic Concepts Discriminant Functions The Normal Density ROC Curves

Bayes Formula How do we combine a priori and class-conditional probabilities to know the probability of a state of nature? Bayes Formula. P(cj | x) = p(x|cj) P(cj) / p(x) prior probability evidence posterior probability likelihood Bayes Decision: Choose c1 if P(c1|x) > P(c2|x); otherwise choose c2.

Simplifying Bayes Rule
Bayes Formula. P(cj | x) = p(x|cj) P(cj) / p(x) The evidence p(x) is the same for all states or classes so we can dispense with it to make a decision. Rule: Choose c1 if p(x|c1)P(c1) > p(x|c2)P(c2); otherwise decide c2 Observations: If p(x|c1) = p(x|c2) then the decision depends on the prior probabilities only. If P(c1) = P(c2) then the decision depends on the likelihoods only.

Loss Function Let {c1,c2, …} be the possible states of nature.
Let {a1, a2, …} be the possible actions. Loss function λ(ai|cj) is the loss incurred for taking action ai when the state of nature is cj. Suppose we observe a particular x and think about taking action ai. If the true state is wj, the loss will be λ(ai|cj). Expected loss R(ai|x) = Σj λ(ai|cj) P(cj|x) (this is also called the conditional risk.) Decision: Select the action that minimizes the conditional risk ( ** best possible performance ** )

Discriminant Functions
How do we represent pattern classifiers? The most common way is through discriminant functions. Remember we use {c1,c2, …} to be the possible states of nature. For each class we create a discriminant function gi(x). The classifier assigns class ci if gi(x) > gj(x) for all j = i Our classifier is a network or machine that computes c discriminant functions.

Signal Detection Theory
Suppose we want to detect a single pulse from a signal. We assume the signal has some random noise. When the signal is present we observe a normal distribution with mean u2. When the signal is not present we observe a normal distribution with mean u1. We assume same standard deviation. Can we measure the discriminability of the problem? Can we do this independent of the threshold x*? Discriminability: d’ = | u2 – u1 | / σ

Figure 2.19

Signal Detection Theory
How do we find d’ if we do not know u1, u2, or x*? From the data we can compute: P( x > x* | c2) a hit. P( x > x* | c1) a false alarm. P( x < x* | c2) a miss. P( x < x* | c1) a correct rejection. If we plot a point in a space representing hit and false alarm rates, then we have a ROC (receiver operating characteristic) curve. With it we can distinguish between discriminability and bias.

Figure 2.20

Linear Discriminant Functions
Least Squares Method Fisher’s Linear Discriminant Probabilistic Generative Models

Linear Discriminant Functions
A discriminant function is a linear combination of the components of x: g(x) = wt x + w0 where wt is the weight vector w0 is the bias or threshold weight For the two-class problem we can use the following decision rule: Decide c1 if g(x) > 0 and c2 if g(x) < 0. For the general case we will have one discriminant function for each class.

Geometry for Linear Models
What is machine learning?

Figure 5.3

Linear Machines To avoid the problem of ambiguous regions we can use linear machines: We define c linear discriminant functions and choose the one with highest value for a given x. gk(x) = wkt x + wk0 k = 1, …, c In this case the decision regions are convex and thus are limited in flexibility and accuracy.

Figure 5.4

Generalized Linear Discriminant Functions
We could even add more terms wijk xi xj xk and obtain the class of polynomial discriminant functions. The generalized form is g(x) = Σi wi yi(x) g(x) = wt y Where the summation goes over all functions yi(x). The yi(x) functions are called the phi or φ functions. The function is now linear on the yi(x). The functions map a d-dimensional x-space into a d’ dimensional y-space. Example: g(x) = w1 + w2x + w3x y = (1 x x2 ) t

Figure 5.5

Least Squares And how do we compute y(x)?
How do we find the values of w0, w1, w2, …, wd? We can simply find the w that minimizes an error function E(w): E(w) = ½ Σ (g(x,w) – t)2 Problems: Lacks robustness; assumes target vector is Gaussian. What is machine learning?

Least Squares Logistic regression Least squares
What is machine learning? Logistic regression Least squares

Fisher’s Linear Discriminant
The idea is to project the data on one single dimension. We choose a projection that maximizes class separation, and minimizes the variance within each class. Find w that maximize a function J(w) = (m2 – m1)2 / s12 + s22 J(w) = wT SB w / wT Sw w Where SB is the between-class covariance matrix And SW is the within class covariance matrix What is machine learning?

Fisher’s Linear Discriminant
What is machine learning? Wrong Right

Probabilistic Generative Models
We first compute g(x) = w1x1 + w2x2 + … + wdxd + w0 But instead we wish to have P(Ck|x). To get conditional probabilities we compute a logistic function: L(g(x)) = 1 / ( 1 + exp(-g(x)) ) And L(y) = P(Ck|x) if the two classes can be modeled as a Gaussian distribution with equal covariance matrix. What is machine learning?

Decision Trees Definition Mechanism Splitting Functions
Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes What is machine learning?

Information Gain IG(A) = H(S) - Σv (Sv/S) H (Sv)
H(S) is the entropy of all examples H(Sv) is the entropy of one subsample after partitioning S based on all possible values of attribute A. What is machine learning?

Gain Ratio Let’s define the entropy of the attribute:
H(A) = - Σ pj log pj Where pj is the probability that attribute A takes value Vj. Then GainRatio(A) = IG(A) / H(A) What is machine learning?

How deep should the tree be? Overfitting the Data
A tree overfits the data if we let it grow deep enough so that it begins to capture “aberrations” in the data that harm the predictive power on unseen examples: t2 Possibly just noise, but the tree is grown larger to capture these examples What is machine learning? humidity t3 size

Overtting the Data: Definition
Assume a hypothesis space H. We say a hypothesis h in H overfits a dataset D if there is another hypothesis h’ in H where h has better classification accuracy than h’ on D but worse classification accuracy than h’ on D’. training data What is machine learning? overfitting testing data Size of the tree

Causes for Overtting the Data
What causes a hypothesis to overfit the data? Random errors or noise Examples have incorrect class label or incorrect attribute values. Coincidental patterns By chance examples seem to deviate from a pattern due to the small size of the sample. Overfitting is a serious problem that can cause strong performance degradation. What is machine learning?

1.) Grow the tree to learn the training data
Decision Tree Pruning What is machine learning? 1.) Grow the tree to learn the training data 2.) Prune tree to avoid overfitting the data

Methods to Validate the New Tree
Training and Validation Set Approach Divide dataset D into a training set TR and a validation set TE Build a decision tree on TR Test pruned trees on TE to decide the best final tree. What is machine learning? Dataset D Training TR Validation TE

1) Consider all internal nodes in the tree.
Reduced Error Pruning Main Idea: 1) Consider all internal nodes in the tree. For each node check if removing it (along with the subtree below it) and assigning the most common class to it does not harm accuracy on the validation set. Pick the node n* that yields the best performance and prune its subtree. 4) Go back to (2) until no more improvements are possible. What is machine learning?

1) Convert the tree into a rule-based system.
Rule Post-Pruning Main Idea: 1) Convert the tree into a rule-based system. Prune every single rule first by removing redundant conditions. 3) Sort rules by accuracy. What is machine learning?

Possible rules after pruning (based on validation set):
Example x1 Original tree 1 x3 x2 1 1 A C B A What is machine learning? Rules: ~x1 & ~x2 -> Class A ~x1 & x2 -> Class B x1 & ~x3 -> Class A x1 & x3 -> Class C Possible rules after pruning (based on validation set): ~x > Class A ~x1 & x2 -> Class B ~x > Class A x1 & x > Class C

Discretizing Continuous Attributes
Example: attribute temperature. 1) Order all values in the training set 2) Consider only those cut points where there is a change of class 3) Choose the cut point that maximizes information gain What is machine learning? temperature

Perceptron as one Type of Linear Discriminants
Introduction Design of Primitive Units Perceptrons What is machine learning?

Design of Primitive Units Perceptrons
Definition.- It’s a step function based on a linear combination of real-valued inputs. If the combination is above a threshold it outputs a 1, otherwise it outputs a –1. x1 w1 x2 w2 {1 or –1} Σ What is machine learning? w0 wn xn X0=1

Design of Primitive Units Learning Perceptrons
1 if w0 + w1x1 + w2x2 + … + wnxn > 0 -1 otherwise O(x1,x2,…,xn) = To simplify our notation we can represent the function as follows: O(X) = sgn(WX) where sgn(y) = 1 if y > 0 -1 otherwise What is machine learning? Learning a perceptron means finding the right values for W. The hypothesis space of a perceptron is the space of all weight vectors.

Design of Primitive Units
Design a two-input perceptron that implements the Boolean function X1 OR ~X2 (X1 OR not X2). Assume the independent weight is always +0.5 (assume W0 = and X0 = 1). You simply have to provide adequate values for W1 and W2. x1 W1=? What is machine learning? Σ W2=? W0 = +0.5 x2 x0=1

Design of Primitive Units Perceptron Algorithms
How do we learn the weights of a single perceptron? Perceptron rule Delta rule Algorithm for learning using the perceptron rule: Assign random values to the weight vector Apply the perceptron rule to every training example Are all training examples correctly classified? Yes. Quit No. Go back to Step 2. What is machine learning?

Design of Primitive Units A. Perceptron Rule
The perceptron training rule: For a new training example X = (x1, x2, …, xn), update each weight according to this rule: wi = wi + Δwi Where Δwi = η (t-o) xi t: target output o: output generated by the perceptron η: constant called the learning rate (e.g., 0.1) What is machine learning?

Design of Primitive Units B. The Delta Rule
What happens if the examples are not linearly separable? To address this situation we try to approximate the real concept using the delta rule. The key idea is to use a gradient descent search. We will try to minimize the following error: E = ½ Σi (ti – oi) 2 where the sum goes over all training examples. Here oi is the inner product WX and not sgn(WX) as with the perceptron algorithm. What is machine learning?

The idea is to find a minimum in the space of weights and the error function E: E(W) What is machine learning? w1 w2

Design of Primitive Units Derivation of the Rule
The gradient of E with respect to weight vector W, denoted as E(W) : E(W) is a vector with the partial derivatives of E with respect to each weight wi. Key concept: The gradient vector points in the direction with the steepest increase in E. Δ Δ What is machine learning?

For a new training example X = (x1, x2, …, xn), update each weight according to this rule: wi = wi + Δwi Where η: learning rate (e.g., 0.1) What is machine learning?

Design of Primitive Units The gradient
How do we compute E(W)? It is easy to see that So that gives us the following equation: ∆ wi = η Σi (ti – oi) xi Δ What is machine learning?

Design of Primitive Units The algorithm using the Delta Rule
Algorithm for learning using the delta rule: Assign random values to the weight vector Continue until the stopping condition is met Initialize each ∆wi to zero For each example: Update ∆wi: ∆wi = ∆wi + n (t – o) xi Update wi: wi = wi + ∆wi Until error is small What is machine learning?

Design of Primitive Units Difference between Perceptron and Delta Rule
The perceptron is based on an output from a step function, whereas the delta rule uses the linear combination of inputs directly. The perceptron is guaranteed to converge to a consistent hypothesis assuming the data is linearly separable. The delta rules converges in the limit but it does not need the condition of linearly separable data. What is machine learning?

Artificial Neural Networks
Introduction Design of Primitive Units Perceptrons The Backpropagation Algorithm What is machine learning?

To make nonlinear partitions on the space we need to define
One Single Unit To make nonlinear partitions on the space we need to define each unit as a nonlinear function (unlike the perceptron). One solution is to use the sigmoid unit. x1 w1 x2 g(x) What is machine learning? w2 Σ w0 wn xn O = σ(g(x)) = 1 / 1 + e –g(x) X0=1

Function σ is called the sigmoid or logistic function.
More Precisely O(x1,x2,…,xn) = σ ( WX ) where: σ ( WX ) = 1 / 1 + e -WX Function σ is called the sigmoid or logistic function. It has the following property: d σ(y) / dy = σ(y) (1 – σ(y)) What is machine learning?

Backpropagation Algorithm
Goal: To learn the weights for all links in an interconnected multilayer network. We begin by defining our measure of error: E(W) = ½ Σm Σk (tmk – omk) 2 k varies along the output nodes and m over the training examples. The idea is to use again a gradient descent over the space of weights to find a global minimum (no guarantee). What is machine learning?

Create a network with nin input nodes, nhidden internal nodes,
Algorithm Create a network with nin input nodes, nhidden internal nodes, and nout output nodes. Initialize all weights to small random numbers. Until error is small do: For each example X do Propagate example X forward through the network Propagate errors backward through the network What is machine learning?

Propagating Error Backward
For each output node k compute the error: δk = Ok (1-Ok)(tk – Ok) For each hidden unit h, calculate the error: δh = Oh (1-Oh) Σk Wkh δk Update each network weight: Wji = Wji + ΔWji where ΔWji = η δj Xji (Wji and Xji are the input and weight of node i to node j) What is machine learning?

Generalization and Overfitting
One obvious stopping point for backpropagation is to continue iterating until the error is below some threshold; this can lead to overfitting. Validation set error Error What is machine learning? Training set error Number of weight updates

Use a validation set and stop until the error is small in this set.
Solutions Use a validation set and stop until the error is small in this set. Use 10 fold cross validation. Use weight decay; the weights are decreased slowly on each iteration. What is machine learning?

Learning Rates Different learning rates affect significantly the performance of a neural network. Optimal Learning Rate: Leads to the error minimum in one learning step. It’s been found that a principled method to set the learning rate is to assign a value “separately” for each weight. What is machine learning?

What is machine learning?

The weight update rule can be modified so as to depend
Adding Momentum The weight update rule can be modified so as to depend on the last iteration. At iteration s we have the following: ΔWji (s) = η δj Xji + αΔWji (s-1) Where α ( 0 <= α <= 1) is a constant called the momentum. It increases the speed along a local minimum. It increases the speed along flat regions. What is machine learning?

Recurrent Networks (Time Series Analysis)
Recurrent networks have found application in time series prediction. Main ideas: The output units are “fed back” and duplicated as auxiliary inputs. During classification a pattern is presented to the input units. The feedforward flow is done, and the outputs serve as auxiliary input nodes. This produces new activations, and new outputs. What is machine learning?

What is machine learning?

Deep learning Introduction Classes of Deep Learning Networks
Unsupervised or Generative Learning Supervised Learning Hybrid Networks References Images obtained from multiple sources, including Wikipedia, articles, and blogs from the internet.

Introduction General Setting
Deep Learning was too difficult to do because of the credit-assignment problem. Recent discoveries have tackled this barrier making it possible to learn deep network architectures. Who to blame? What is machine learning?

Introduction We want to capture compact, high-level representations in an efficient and iterative manner. Learning takes place at several levels of representations. Think about a hierarchy of concepts of increasing complexity. Low levels concepts are the foundation for high level concepts. What is machine learning?

Deep Networks for Unsupervised Learning
There are no class labels during the learning process. There are many types of generative or unsupervised deep networks. Energy-based deep networks are very popular. Example: Deep Auto Encoder.

Deep Learning Auto Encoder Auto Encoder

No. of output features = No input features Decoder Auto Encoder Encoder Intermediate nodes encode the original data.

“Deep” Auto Encoder Using a Step-Wise Mechanism Key idea: Pre-train each layer as an auto-encoder.

Deep Networks for Supervised Learning
The network can be trained directly using backpropagation if the right activation function is used. ReLU: O.K. Sigmoid: Wrong What is machine learning?

Rectified Linear Unit: ReLU The function is defined as follows: f(x)= max(0,x) It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer. In sigmoid function units can saturate. What is machine learning?

Commercial Planes, Military Planes
Deep Networks for Unsupervised Learning The idea is to disentangle factors of variation and to attain high level representations. Commercial Planes, Military Planes Engine, Main Fuselage Small Object Parts What is machine learning? Edges and Contours Pixel Information

Deep Stacking Network General Architecture Networks are trained separately but in sequence. Output layer is input to a new network Original input can be re-used as input

Deep Learning for Supervised Learning
Design of a Convolutional Neural Network CNN A CNN has input, output and hidden units. Hidden units can be of 3 types: Convolutional Pooling Fully Connected output Convolutional hidden Pooling Fully Connected input

Design of a Convolutional Neural Network CNN Why are CNNs important when dealing with images? Instead a CNN with have local set of weights only. Each neuron will be connected to a few close by neurons only (idea of receptive field)

Convolutional Neural Networks Convolution Operation We need to learn the kernel K and share those parameters across the entire image.

Convolutional Neural Networks Vertical and horizontal filters:

Design of a Convolutional Neural Network CNN Layers alternate between convolutional layers and pooling layers:

Design of a Convolutional Neural Network CNN Pooling aggressively reduces the dimensionality of the feature space. The idea is as follows: We partition the image into a set of non-overlapping rectangles. For each region we simply output the maximum value of that region (set of pixels). This is called “Max Pooling”.

Design of a Convolutional Neural Network CNN Full convolutional neural network : Apply convolution, pooling (or subsampling) iteratively. Finally apply fully connected neural network:

Kernel Methods: Support Vector Machines
What is machine learning?

Support Vector Machines
What are support vector machines (SVMs)? A very popular classifier that is based on the concepts previously discussed on linear discriminants and the new concept of margins. To begin, SVMs preprocess the data by representing all examples in a higher dimensional space. With sufficiently high dimensions the classes can be separated by a hyperplane.

The Margin

The Goal in Support Vector Machines
Now, let t be 1 or – 1 depending on the example x being of class positive or negative. A separating hyperplane ensures that: t g(x) >= 0 The goal in support vector machines is to find the separating hyperplane with the “largest” margin. Margin is the distance between the hyperplane and the closest example to it.

The Support Vectors Now the distance from a pattern x to a hyperplane
is g(x) / ||w||. So let’s change our objective to finding a vector w that maximizes the margin m in the equation: t g(x) / ||w|| >= m We can also say that the support vectors are those patterns x for which t g(x) / ||w|| = 1, because we can rescale the w vector and leave the hyperplane in the same place. Support vectors are equally close to the hyperplane. These are the patterns that are most difficult to separate. These are the most “informative” patterns.

The Support Vectors We said we want to find a vector w that maximizes
the equation: t g(x) / ||w|| >= 1 This means all we really need to do is to maximize ||w|| -1 under certain constraints. So we have the following optimization problem: arg min w ½ ||w||2 subject to t g(x) >= 1 This can be solved using Lagrange Multipliers

The Support Vectors What happens when there are unavoidable errors?
arg min w ½ ||w|| λ ∑ ei subject to t g(xi) >= 1 - ei where ei is the error incurred by example xi These are known as slack variables.

The Support Vectors We can write this in a dual form (Karush-Kuhn-Tucker construction). max ∑ i – ½ ∑ ∑ i j ti tj (xi . xj) subject to <= i <= λ and ∑ i xi = 0

The Support Vectors The final result is a set of i, one for each training example. The optimal hyperplane can be expressed in the dual representation as: f(x) = ∑ ti i < xi . x > + b where w = ∑ ti i xi

The Support Vectors We can use kernel functions to map from the
original space to a new space. max ∑ i – ½ ∑ ∑ i j ti tj ((xi) .  (xj) ) subject to <= i <= λ and ∑ i xi = 0

The Support Vectors Computing the dot product is simplified:
Polynomial kernels: (xi) .  (xj) = ∑ xi xj + ∑ xi2 xj2 + … But fortunately that is equal to: (1 + xi . xj ) 2 = K( xi, xj ) In general all we need is to compute the dot product of all examples in the original space. This results in the Gram matrix K

The Support Vectors The final formulation is as follows:
max ∑ i – ½ ∑ ∑ i j ti tj K (xi . xj) subject to <= i <= λ and ∑ i xi = 0

Kernels and the Gram Matrix _______________________________________
The final formulation is as follows: max ∑ i – ½ ∑ ∑ i j yi yj K (xi, xj) subject to <= i and ∑ i yi = 0

The Gram Matrix _______________________________________
Figure extracted from a tutorial based on the book “Support Vector Machines” by John ShaweTaylor and Nello Christianini, 2000, Cambridge University Press

The Gram Matrix _______________________________________
Mercer’s Theorem Establishes the conditions into which a continuous and symmetric kernel corresponds to an inner product in a feature space. The kernel Matrix K is symmetric positive definite (has positive eigenvalues or zt K z >=0 for non-zero vector z). Every symmetric positive definite matrix can be seen as a kernel matrix (an inner product in a space).

The Gram Matrix _______________________________________
Figure extracted from a tutorial based on the book “Support Vector Machines” by John ShaweTaylor and Nello Christianini, 2000, Cambridge University Press

Polynomial Kernel _______________________________________
The general form is defined as follows: K(xi, xj) = [ ( xi . xj ) + c ] q where c is a constant and q is the degree of the polynomial. It can be formally shown that a polynomial kernel follows Mercer’s conditions.

Polynomial kernel of degree 2.

Polynomial kernel of degree 3.

Bayesian Decision Theory

Similar presentations

Presentation on theme: "Bayesian Decision Theory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Decision Theory

Similar presentations

Presentation on theme: "Bayesian Decision Theory"— Presentation transcript:

Similar presentations

About project

Feedback