Bayesian Learning, Regression-based learning. Overview  Bayesian Learning  Full  MAP learning  Maximum Likelihood Learning  Learning Bayesian Networks.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Slides from: Doug Gray, David Poole
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Linear Regression.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Model Assessment, Selection and Averaging
Chapter 4: Linear Models for Classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
Neural Networks Marco Loog.
Machine Learning CMPT 726 Simon Fraser University
Bayesian Learning and Learning Bayesian Networks.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Bayesian Learning Rong Jin.
Learning Bayesian Networks
Sparse vs. Ensemble Approaches to Supervised Learning
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification Part 3: Artificial Neural Networks
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Naive Bayes Classifier
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Statistical Learning (From data to distributions).
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Linear Models for Classification
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Web-Mining Agents Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:
Lecture 1.31 Criteria for optimal reception of radio signals.
Deep Feedforward Networks
Naive Bayes Classifier
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CS 416 Artificial Intelligence
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Bayesian Learning, Regression-based learning

Overview  Bayesian Learning  Full  MAP learning  Maximum Likelihood Learning  Learning Bayesian Networks (Fully observable)  Regression and Logistic Regression

Full Bayesian Learning  In Decision Trees (and in all non-Bayesian learning methods) the idea is always to find the best model that explains some observations  In contrast, full Bayesian learning sees learning as Bayesian updating of a probability distribution over the hypothesis space, given data H is the hypothesis variable Possible hypotheses (values of H) h 1 …, h n P(H) = prior probability distribution over hypotesis space  j th observation d j gives the outcome of random variable D j training data d= d 1,..,d k

 Given the data so far, each hypothesis h i has a posterior probability: P(h i |d) = αP(d| h i ) P(h i ) (Bayes theorem) where P(d| h i ) is called the likelihood of the data under each hypothesis  Predictions over a new entity X are a weighted average over the prediction of each hypothesis: P(X|d) = = ∑ i P(X, h i |d) = ∑ i P(X| h i,d) P(h i |d) = ∑ i P(X| h i ) P(h i |d) ~ ∑ i P(X| h i ) P(d| h i ) P(h i ) The weights are given by the data likelihood and prior of each h  No need to pick one best-guess hypothesis! The data does not add anything to a prediction given an hp Full Bayesian Learning

 Suppose we have 5 types of candy bags 10% are 100% cherry candies (h 100, P(h 100 )= 0.1) 20% are 75% cherry + 25% lime candies (h 75, P(h 75 )= 0.2) 40% are 50% cherry + 50% lime candies (h 50, P(h 50 )= 0.4) 20% are 25% cherry + 75% lime candies (h 25, P(h 25 )= 0.2) 10% are 100% lime candies (h 0,P(h 100 )= 0.1) The we observe candies drawn from some bag  Let’s call θ the parameter that defines the fraction of cherry candy in a bag, and h θ the corresponding hypothesis  Which of the five kinds of bag has generated my 10 observations? P(h θ |d).  What flavour will the next candy be? Prediction P(X|d) Example

 If we re-wrap each candy and return it to the bag, our 10 observations are independent and identically distributed, i.i.d, so P(d| h θ ) = ∏ j P(d j | h θ ) for j=1,..,10  For a given h θ, the value of P(d j | h θ ) is P(d j = cherry| h θ ) = θ; P(d j = lime|h θ ) = (1-θ)  And given N observations, of which c are cherry and l = N-c lime Binomial distribution: probability of # of successes in a sequence of N independent trials with binary outcome, each of which yields success with probability θ.  For instance, after observing 3 lime candies in a row: P([lime, lime, lime| h 50 ) = because the probability of seeing lime for each observation is 0.5 under this hypotheses

Example  If we re-wrap each candy and return it to the bag, our 10 observations are independent and identically distributed, i.i.d, so P(d| h θ ) = ∏ j P(d j | h θ ) for j=1,..,10  For a given h θ, the value of P(d j | h θ ) is P(d j = cherry| h θ ) = θ; P(d j = lime|h θ ) = (1-θ)  And given N observations, of which c are cherry and l = N-c lime Binomial distribution: probability of # of successes in a sequence of N independent trials with binary outcome, each of which yields success with probability θ.  For instance, after observing 3 lime candies in a row: P([lime, lime, lime| h 50 ) = because the probability of seeing lime for each observation is 0.5 under this hypotheses

Example  If we re-wrap each candy and return it to the bag, our 10 observations are independent and identically distributed, i.i.d, so P(d| h θ ) =  For a given h θ, the value of P(d j | h θ ) is P(d j = cherry| h θ ) = ; P(d j = lime|h θ ) =  And given N observations, of which c are cherry and l = N-c lime Binomial distribution: probability of # of successes in a sequence of N independent trials with binary outcome, each of which yields success with probability θ.  For instance, after observing 3 lime candies in a row: P([lime, lime, lime| h 50 ) =

 Initially, the hp with higher priors dominate (h 50 with prior = 0.4)  As data comes in, the true hypothesis (h 0 ) starts dominating, as the probability of seeing this data given the other hypotheses gets increasingly smaller After seeing three lime candies in a row, the probability that the bag is the all-lime one starts taking off P(h 100 |d) P(h 75 |d) P(h 50 |d) P(h 25 |d) P(h 0 |d) P(h i |d) = αP(d| h i ) P(h i ) Posterior Probability of H

Prediction Probability  The probability that the next candy is lime increases with the probability that the bag is an all-lime one ∑ i P(next candy is lime| h i ) P(h i |d)

Overview  Full Bayesian Learning  MAP learning  Maximun Likelihood Learning  Learning Bayesian Networks Fully observable With hidden (unobservable) variables

MAP approximation  Full Bayesian learning seems like a very safe bet, but unfortunately it does not work well in practice Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)  Very common approximation: Maximum a posterior (MAP) learning: Instead of doing prediction by considering all possible hypotheses, as in P(X|d) = ∑ i P(X| h i ) P(h i |d) Make predictions based on h MAP that maximises P(h i |d) I.e., maximize P(d| h i ) P(h i ) P(X|d)~ P(X| h MAP )

MAP approximation  Map is a good approximation when P(X |d) ≈ P(X| h MAP ) In our example, h MAP is the all-lime bag after only 3 candies, predicting that the next candy will be lime with p =1 the bayesian learner gave a prediction of 0.8, safer after seeing only 3 candies P(h 100 |d) P(h 75 |d) P(h 50 |d) P(h 25 |d) P(h 0 |d)

Bias  As more data arrive, MAP and Bayesian prediction become closer, as MAP’s competing hypotheses become less likely  Often easier to find MAP (optimization problem) than deal with a large summation problem  P(H) plays an important role in both MAP and Full Bayesian Learning Defines the learning bias, i.e. which hypotheses are favoured  Used to define a tradeoff between model complexity and its ability to fit the data More complex models can explain the data better => higher P(d| h i ) danger of overfitting But simpler model have higher probability I.e. common learning bias is to penalize complexity

Overview  Full Bayesian Learning  MAP learning  Maximun Likelihood Learning  Learning Bayesian Networks Fully observable With hidden (unobservable) variables

Maximum Likelihood (ML)Learning  Further simplification over full Bayesian and MAP learning Assume uniform priors over the space of hypotheses MAP learning (maximize P(d| h i ) P(h i )) reduces to maximize P(d| h i )  When is ML appropriate?

Maximum Likelihood (ML) Learning  Further simplification over Full Bayesian and MAP learning Assume uniform prior over the space of hypotheses MAP learning (maximize P(d| h i ) P(h i )) reduces to maximize P(d| h i )  When is ML appropriate? Used in statistics as the standard (non-bayesian) statistical learning method by those who distrust subjective nature of hypotheses priors When the competing hypotheses are indeed equally likely (e.g. have same complexity) With very large datasets, for which P(d| h i ) tends to overcome the influence of P(h i ))

Overview  Bayesian Learning  Full  MAP learning  Maximum Likelihood Learning  Learning Bayesian Networks (Fully observable)

Learning BNets: Complete Data  We will start by applying ML to the simplest type of BNets learning: known structure Data containing observations for all variables All variables are observable, no missing data  The only thing that we need to learn are the network’s parameters

ML learning: example  Back to the candy example: New candy manufacturer that does not provide data on the probability of different types of bags, i.e. fraction θ of cherry candy Any θ is possible: continuum of hypotheses h θ Reasonable to assume that all θ are equally likely (we have no evidence of the contrary): uniform distribution P(h θ ) θ is a parameter for this simple family of models, that we need to learn  Simple network to represent this problem Flavor represents the event of drawing a cherry vs. lime candy from the bag P(F=cherry), or P(cherry) for brevity, is equivalent to the fraction θ of cherry candies in the bag  We want to infer θ by unwrapping N candies from the bag

 Unwrap N candies, c cherries and l = N-c lime (and return each candy in the bag after observing flavor)  As we saw earlier, this is described by a binomial distribution P(d| h θ ) = ∏ j P(d j | h θ ) = θ c (1- θ) l  With ML we want to find θ that maximizes this expression, or equivalently its log likelihood (L) L(P(d| h θ ) = log ( ∏ j P(d j | h θ )) = log (θ c (1- θ) l ) = clogθ + l log(1- θ) ML learning: example (cont’d)

 To maximise, we differentiate L(P(d| h θ ) with respect to θ and set the result to 0 ML learning: example (cont’d)  Doing the math gives the proportion of cherries in the bag is equal to the proportion (frequency) of in cherries in the data

General ML procedure  Express the likelihood of the data as a function of the parameters to be learned  Take the derivative of the log likelihood with respect of each parameter  Find the parameter value that makes the derivative equal to 0  The last step can be computationally very expensive in real- world learning tasks There is one more example for you to look at in the next few slides

Another example  The manufacturer choses the color of the wrapper probabilistically for each candy based on flavor, following an unknown distribution If the flavour is cherry, it chooses a red wrapper with probability θ 1 If the flavour is lime, it chooses a red wrapper with probability θ 2  The Bayesian network for this problem includes 3 parameters to be learned θ θ 1 θ 2

Another example  The manufacturer choses the color of the wrapper probabilistically for each candy based on flavor, following an unknown distribution If the flavour is cherry, it chooses a red wrapper with probability θ 1 If the flavour is lime, it chooses a red wrapper with probability θ 2  The Bayesian network for this problem includes 3 parameters to be learned θ θ 1 θ 2

Another example (cont’d)  P( W=green, F = cherry| h θθ 1 θ 2 ) = (*)  We unwrap N candies c are cherry and l are lime r c cherry with red wrapper, g c cherry with green wrapper r l lime with red wrapper, g l lime with green wrapper every trial is a combination of wrapper and candy flavor similar to event (*) above, so  P(d| h θθ 1 θ 2 ) =

Another example (cont’d)  P( W=green, F = cherry| h θθ 1 θ 2 ) = (*) = P( W=green|F = cherry, h θθ 1 θ 2 ) P( F = cherry| h θθ 1 θ 2 ) = θ (1-θ 1 )  We unwrap N candies c are cherry and l are lime r c cherry with red wrapper, g c cherry with green wrapper r l lime with red wrapper, g l lime with green wrapper every trial is a combination of wrapper and candy flavor similar to event (*) above, so  P(d| h θθ 1 θ 2 ) = ∏ j P(d j | h θθ 1 θ 2 ) = θ c (1-θ) l (θ 1 ) r c (1-θ 1 ) g c (θ 2 ) r l (1-θ 2 ) g l

Another example (cont’d)  I want to maximize the log of this expression clogθ + l log(1- θ) + r c log θ 1 + g c log(1- θ 1 ) + r l log θ 2 + g l log(1- θ 2 )  Take derivative with respect of each of θ, θ 1,θ 2 The terms that not containing the derivation variable disappear Frequencies again!

Another example (cont’d)  I want to maximize the log of this expression clogθ + l log(1- θ) + r c log θ 1 + g c log(1- θ 1 ) + r l log θ 2 + g l log(1- θ 2 )  Take derivative with respect of each of θ, θ 1,θ 2 The terms not containing the derivation variable disappear Frequencies again!

ML parameter learning in Bayesian nets

Very Popular Application  Naïve Bayes models: very simple Bayesian networks for classification Class variable (to be predicted) is the root node Attribute variables X i (observations) are the leaves  Naïve because it assumes that the attributes are conditionally independent of each other given the class  Deterministic prediction can be obtained by picking the most likely class  Scales up really well: with n boolean attributes we just need……. C X1X1 XiXi X2X2

Very Popular Application  Naïve Bayes models: very simple Bayesian networks for classification Class variable (to be predicted) is the root node Attribute variables X i (observations) are the leaves  Naïve because it assumes that the attributes are conditionally independent of each other given the class  Deterministic prediction can be obtained by picking the most likely class  Scales up really well: with n boolean attributes we just need 2n+1 parameters C X1X1 XiXi X2X2

Example  Naïve Classifier for the newsgroup reading example

Example  Naïve Classifier for the newsgroup reading example

Problem with ML parameter learning  With small datasets, some of the frequencies may be 0 just because we have not observed the relevant data  Generates very strong incorrect predictions: Simple common fix: initialize the count of every relevant event to 1 before counting the observations For more sophisticated strategies see textbook (p. 296)

Probability from Experts  As we mentioned in previous lectures, an alternative to learning probabilities from data is to get them from experts  Problems Experts may be reluctant to commit to specific probabilities that cannot be refined How to represent the confidence in a given estimate Getting the experts and their time in the first place  One promising approach is to leverage both sources when they are available Get initial estimates from experts Refine them with data

Combining Experts and Data  Get the expert to express her belief on event A as the pair i.e. how many observations of A they have seen (or expect to see) in m trials  Combine the pair with actual data If A is observed, increment both n and m If ⌐A is observed, increment m alone  The absolute values in the pair can be used to express the expert’s level of confidence in her estimate Small values (e.g., ) represent low confidence The larger the values, the higher the confidence WHY?

Combining Experts and Data  Get the expert to express her belief on event A as the pair i.e. how many observations of A they have seen (or expect to see) in m trials  Combine the pair with actual data If A is observed, increment both n and m If ⌐A is observed, increment m alone  The absolute values in the pair can be used to express the expert’s level of confidence in her estimate Small values (e.g., ) represent low confidence, as they are quickly dominated by data The larger the values, the higher the confidence as it takes more and more data to dominate the initial estimate (e.g. )

Overview  Bayesian Learning  Full  MAP learning  Maximum Likelihood Learning  Learning Bayesian Networks (Fully observable)  Regression and Logistic Regression

40 One more set of techniques fpr supervised learning Naïve Bayes and Decision Trees (in their basic incarnation) provide classification in terms of discrete labels y. y = f(attr 1, attr 2, …. attr n ) We will briefly see two techniques that go beyond Regression: y is continuous Logistic Regression: y is continuous but reduced to a binary classification

Linear Regression 41 h w (x) = w 1 x + w 0 problem of fitting a linear function to a set of training examples: input/output pairs with numeric values

Regression  Linear regression: problem of fitting a linear function h w (x) to a set of training examples (x j, y j ): input/output pairs with numeric values y = m x + b h w (x) = w 1 x + w 0  Find best values for parameters that “maximize goodness of fit” or “minimize loss”  Then most probable values of parameters found by minimizing squared- error loss: Loss(h w ) = Σ j (y j – h w (x j )) 2 42

Regression: Minimizing Loss Choose weights to minimize sum of squared errors 43

Regression: Minimizing Loss y = w 1 x + w 0 44 Loss(h w ) = Σ j (y j – h w (x j )) 2 Algebra gives an exact solution to the minimization problem

Multivariate Regression  Suppose we have features X 1,...,X n. A linear function of these features is a function of the form f w (X 1,...,X n ) = w 0 +w 1 × X w n × X n,  Given a set E of examples, where each example e ∈ E has values x i for feature X i observed value o e.  The predicted value is thus h w e =w 0 +w 1 × x w n × x n =∑ i=0 n w i × x i, where x 0 is defined to be 1.  The sum-of-squares error on examples E for target Y is Error E (w) = ∑ e ∈ E (o e. -h w e ) 2 In this linear case, as for the univariate version, the weights that minimize the error can be computed analytically, by equating the derivatives in each w i to 0

Squashed linear functions for classification  We can use a linear function for discrete classification  Let’s consider binary classification task, where the domain of the target variable is {0,1}.  For classification, use a squashed linear function of the form f w (X 1,...,X n ) = f( w 0 +w 1 × X w n × X n ), where f is an activation function, from real numbers into [0,1].

Activation Functions  Commonly used ones: f (w 0 + w 1 *X 1 + w 2 *X 2 +….+ w k *X k )  Examples: f(in) = 1 when in > t = 0 otherwise

Step Function  A step function represents a linear classifier with hard threshold implements a linear decision boundary (aka linear separator) that separates the classes involved in the classification Decision boundary  A step function was the basis for the perceptron, one of the first methods supervised classification and the basis for Neural Networks  Can’t use derivative to compute weights that minimize loss because step function is not derivable

Expressiveness of Step Function  Step function can only represent linearly separable functions. Linearly separable Linearly separable = FALSE = TRUE Linearly separable Non-Linearly separable Majority (I 1,I 2,I 3 )

Logistic Regression  Classification that uses a sigmoid function as activation function  Learning with logistic regression: still want to minimize the sum of squared errors over training set of examples E Err E (w) = ∑ e ∈ E (o e - h w e ) 2 = ∑ e ∈ E (o e - f(∑ i w i × x i )) 2.  Because f(x) is a non-linear function, hard to find solution analitically: Iterative computation of weights using Gradient Descent

Gradient Descent Search  Alter each weight in an amount proportional to the slope in that direction  Repeat until termination condition is met (error is “small enough”)  Look at the partial derivative of the error term on each example e with respect to each weight w j,   is a constant called the learning rate that determines how fast the algorithm moves toward the minimum of the error function

Gradient Descent Search each set of weights defines a point on the error surface  Search toward the minimum of the surface that describes the sum-of-squares errors as a function of all the weights in the logistic regression Given a point on the surface, look at the slope of the surface along the axis formed by each weight partial derivative of the surface Err with respect to each weight w j

Weight Update for Logistic Regression  The predicted value for example e is  The actual observed value is o e  Partial derivative with respect to weight w j

Weight Update for Logistic Regression  From this result and  The sigmoid activation function is used because its derivative is easy to compute analytically chain rulexfxfxf  ))(1)(()('  each weight w j in gradient descent for Logistic Regression is updated as follows

Expressiveness of Logistic Regression  Sigmoid activation function has similar limitations in expressiveness as step function Output of a two-input logistic regression Adjusting weights changes the orientation, location and steepness of cliff, but still can’t learn functions like XOR

Learning Goals for Bayesian learning and regression- based learning  For each of Bayesian learning, MAP and ML learning  Define how it works  apply it to compute P(h|data) and P(X|data)  Explain how parameters can be learned from data in fully observable Bayesian networks.  Compute the parameters for a specific network  Explain how parameters can be learned in partially observable Bayesian networks (see slides from next class) . 56