Machine Learning Bayes Learning Bai Xiao.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian Learning Provides practical learning algorithms
Visual Recognition Tutorial
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian classifiers.
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Visual Recognition Tutorial
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
Naive Bayes Classifier
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Bayesian Learning Provides practical learning algorithms
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Machine Learning 5. Parametric Methods.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Lecture 1.31 Criteria for optimal reception of radio signals.
Oliver Schulte Machine Learning 726
Chapter 7. Classification and Prediction
Qian Liu CSE spring University of Pennsylvania
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Naïve Bayes Classifier
Computer Science Department
Read R&N Ch Next lecture: Read R&N
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Irina Rish IBM T.J.Watson Research Center
Data Mining Lecture 11.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Networks: Motivation
Read R&N Ch Next lecture: Read R&N
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Classification Bayesian Classification 2018年12月30日星期日.
Chapter 20. Learning and Acting with Bayes Nets
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Read R&N Ch Next lecture: Read R&N
NAÏVE BAYES CLASSIFICATION
Naïve Bayes Classifier
Presentation transcript:

Machine Learning Bayes Learning Bai Xiao

Bayes Learning Example, what is Bayes learning Bayes rule Bayes learning and concept learning Maximum likelihood and minimum error square Bayes optimal classifier Naïve Bayes Bayesian Belief Network Conclusion

Bayesian Learning Provides practical learning algorithms Naïve Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) Provides foundations for machine learning Evaluating learning algorithms Guiding the design of new algorithms Learning from models : meta learning

Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Basic Formulas for Probabilities Product Rule : probability P(AB) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of Total Probability : if events A1, …., An are mutually exclusive with

Basic Approach Bayes Rule: P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D (posterior density ) P(D|h) = probability of D given h (likelihood of D given h) The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis )

Bayes Rule P(h) = prior probability of hypothesis h 先验概率 P(D) = prior probability of training data D 训练数据D 的先验概率 P(D|h) = probability of D given h 假设h成立的情况下,观察到数据D的概率 P(h|D) = probability of h given D, posterior probability 给定训练数据D, h成立的概率, h 的后验概率。

Bayes Rule How to use it ? Use bayes rule to choose h from H (hypotheses space) which is the maximum given the training data D. (maximum a posteriori, MAP极大后验) We delete P(D) cause P(D) is independent with h.

Bayes Rule Maximum Likelihood In many cases, all hypothesis have the same probability P(hi)=P(hj), so we need to find the maximum P(D|h) which is the likelihood of data D given h. The hypothesis which can maximize P(D|h) is the maximum likelihood, which is represented hML .

Bayes Learning Example, what is Bayes Learning A patient have cancer or not ? A patient takes a lab test and the result comes back positive. The test return a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the case in which the diseases is not present. Furthermore, 0.008 of the entire population have this cancer. Probability and Statistics can help us to solve this problem ? P(cancer) = 0.008 P(non cancer) = 0.992 P(positive|cancer) = 0.98 P(negative|cancer) = 0.02 P(positive| non cancer) = 0.03 P(negative| non cancer)= 0.97

Bayes Rule Example, use MAP to solve the patient problem A patient have cancer or not ? A patient takes a lab test and the result comes back positive. The test return a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the case in which the diseases is not present. Furthermore, 0.008 of the entire population have this cancer. Probability and Statistics can help us to solve this problem ? P(cancer) = P(h1) = 0.008 P(non cancer) = P(h2) = 0.992 P(positive|cancer) = P(D|h1) = 0.98 P(negative|cancer) = 0.02 P(positive| non cancer) = P(D|h2) = 0.03 P(negative| non cancer)= 0.97 P(D|h1) P(h1) = P(positive|cancer) P(cancer) = 0.98*0.008=0.0078 P(D|h2) P(h2) = P(positive|non cancer) P(non cancer) = 0.03*0.992=0.0298 MAP hMAP= h2=non cancer.

Bayes Learning – Minimum Risk(最小风险) Still the patient problem, we need to make the situation more complex. Bring the idea “risk” to the problem, the risk of making different decisions is also different. Two situations “the patient is healthy, but system says he have cancer” and “the patient has cancer, but the system says he is healthy”. Two mistakes, in such a situation, the “risk” of the latter one is more dangerous. Solution: we need to make decision not only use MAP(极大后验) but also consider “risk”, so combine them together.

Bayes Learning – Minimum Risk(最小风险) Consider “Risk” -- risk function (风险函数) For N hypotheses h1,…,hn λij means the loss when we choose hypothesis i, but actually the real hypothesis is j.

Bayes Learning – Minimum Risk(最小风险) Bayes rule Decision space For decision ai , λ can be chosen in c different j = 1,2,…,c with corresponding posterior probability So the expected loss when take decision ai is

Bayes Learning – Minimum Risk(最小风险) To use minimum risk based bayes learning method, we need to make decision with the minimum expected loss. Take the decision ak: Procedure: Compute the posterior probability Based on the risk function (risk table, in our case) compute the expected loss when take decision ak Find the minimum

Bayes Learning – Minimum Risk(最小风险) Again the patient problem,

Bayes Learning – Minimum Risk(最小风险) No risk, hMAP= h2=non cancer By introducing risk function, So we should take a2 decision, no cancer

Bayes Learning and Concept Learning Concept Learning – basic idea find a hypothesis(假设) from H space (hypotheses space) consistent with training data D. Under some constraints, concept learning algorithms i.e. Find-S algorithm, Version Space and List-Then-Eliminate algorithm, can also output MAP (maximum a posterior) hypothesis.

Bayes Learning and Concept Learning The three constraints: There is no noise in training data D. Traget concept c is contained in hypotheses space H. Each hypothesis has the same probability. Burte-Force MAP algorithm For each h in H space, compute its posterior probability Output the hypothesis with the highest posterior probability

Bayes Learning and Concept Learning For each h in H h is consistent with D h is inconsistent with D If h is inconsistent with D If h is consistent with D

Bayes Learning and Concept Learning P(D) VSH,D is the sub-set(子集) from H which is consistent with D. VSH,D 是H中与D一致的假设子集。 h is consistent with D otherwise

Bayes Learning and Concept Learning 初始所有假设具有相同的概率,当训练数据逐步出先后,不一致假设的概率为0,而整个概率的和仍为1,他们均匀分不到剩余的一致假设中。 At beginning, all hypotheses have the same probability. When more and more training data available, the inconsistent hypotheses probability become 0, however the sum of the rest hypotheses probability is still 1.

Maximum Likelihood and Minimum Error Square In this part we still learning Bayes, however we should tackle two problems. First, the training data is not discrete but continues(连续型数据). Second, there is noise (error) in the training data. More practical and useful. The output of the hypothesis can satisfy MES with the training data then we can say this hypothesis is the ML hypothesis.

Maximum Likelihood and Minimum Error Square Problem: m training data <xi,di>, di=f(xi)+ei, ei is noise with gaussian distribution. h is a function f: X --> R. Learning is to find the target function f from H. Target function f is maximum likelihood. In concept learning, h is consistent with D (no noise). Solution, minimum Error Squre

Maximum Likelihood and Minimum Error Square Example: Training example <xi,di> where di=f(xi)+ei, the maximum likelihood hypothesis hML is the one that minimizes the sum of the squared errors:

Maximum Likelihood and Minimum Error Square Why? Minimum Error Square == Maximum Likelihood in this case. All the training instances are independent, di=f(xi)+ei so The Gaussian distribution of the noise, mean = f(xi)=h(xi)

Maximum Likelihood and Minimum Error Square Use log function, log is monotonic function. The first part of the previous function is a constant so Also equals

Bayes Optimal Classifer Till now, all the problems are “given the training, what the most probability hypothesis”. However, we also interested in “given a new instance, what is the classification”. hMAP(x) is not most probable classification. Bayes Optimal Classifier

Bayes Optimal Classifer Example: P(h1|D)=0.4 , P(h2|D)=0.3 , P(h3|D)=0.3 Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=- What is the most probable classification of x? Bayes optimal classification Example again: P(h1|D)=0.4 , P(-|h1)=0, P(+|h1)=1 P(h2|D)=0.3 , P(-|h2)=1, P(+|h2)=0 P(h3|D)=0.3 , P(-|h3)=1, P(+|h3)=0 Therefore

Naïve Bayes Naïve Bayes classifer – classify new instance We still also interested in “given a new instance which is described by attributes, what is the classification”. Assume target function f: X V, where each instance x described by attributes <a1,a2,…,an>. Most probable value of f(x) is

Naïve Bayes Naïve Bayes assumption: Naïve Bayes classifier So we need to know P(vj) and P(ai|vj) Estimate P(vj) and P(ai|vj) from training data by counting their frequency.

Naïve Bayes Naïve Bayes Example New instance <sunny, cold, high, strong>, so PlayTennis <yes or no>.

Naïve Bayes Estimate P(vj) and P(ai|vj) Compute vNB So PlayTennis = no P(PlayTennis = yes) = 9/14 = 0.64 P(PlayTennis = no) = 5/14 = 0.36 P(Wind = strong|PlayTennis = yes) = 3/9 = 0.33 P(Wind = strong|PlayTennis = no) = 3/5 = 0.60 Compute vNB P(yes)P(sunny|yes)P(cold|yes)P(high|yes)P(strong|yes) = 0.0053 P(no)P(sunny|no)P(cold|no)P(high|no)P(strong|no) = 0.0206 So PlayTennis = no

Naive Bayesian Classifier (II) Given a training set, we can compute the probabilities

Play-tennis example: estimating P(xi|C) outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14

Example : Naïve Bayes Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No we have :

The independence hypothesis… … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first

Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj ) Classify_New_Instance (x) Typical estimation of P(ai | vj) Where n: examples with v=v; p is prior estimate for P(ai|vj) nc: examples with a=ai, m is the weight to prior

Bayesian Belief Network Naïve Bayes assumption of conditional independence so: But that’s to restrictive, because a1,a2,..,an are not always independent to each other especially in realistic problems.(变量之间不一定全部独立) Bayesian Belief Network describe conditional independence among subsets of variables.(贝叶斯信念网络可表述变量的一个子集上的条件独立性假定) Allows combining prior knowledge about (in)dependencies among variables with observed data.

Bayesian Belief Network Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if More compact, we can write Example: Thunder is conditionally independent of Rain, given Lightning In Naïve Bayes

Bayesian Belief Network Explanation of the Bayesian Belief Network Network represents a set of conditional independence assertions. Each node is conditionally independent of its nondescendants, given its immediate predecessors. (此变量在给定其直接前驱时条件独立于其费后继)。 Direct acyclic graph.(有向无环图)

Bayesian Belief Network Explanation of the Bayesian Belief Network – two parts The arc in the network represents the (in)dependent relationship between variables. There is a dependent probability table corresponding to each node (variable) in the network.

Bayesian Belief Network Explanation of the Bayesian Belief Network In general, where Parents(Yi) denotes immediate predecessors of Yi in graph

Bayesian Belief Network Inference in Bayesian Network, how can one infer the (probabilities of) values of one ore more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard. Learning of Bayesian Networks Network structure might be known or unknown Training examples might provides values of all or some network variables. If structure known and observe all variable, easy. Otherwise, Russtll et al (1995)

Maximum Likelihood Estimation MLE principle : We try to learn the parameters that maximize the likelihood function It is one of the most commonly used estimators in statistics and is intuitively appealing

What is a Bayesian Network ? A graphical model that efficiently encodes the joint probability distribution for a large set of variables

Definition A Bayesian Network for a set of variables X = { X1,…….Xn} contains network structure S encoding conditional independence assertions about X a set P of local probability distributions The network structure S is a directed acyclic graph And the nodes are in one to one correspondence with the variables X.Lack of an arc denotes a conditional independence.

Some conventions………. Variables depicted as nodes Arcs represent probabilistic dependence between variables Conditional probabilities encode the strength of dependencies

An Example Detecting Credit - Card Fraud Fraud Age Sex Gas Jewelry

Tasks Correctly identify the goals of modeling Identify many possible observations that may be relevant to a problem Determine what subset of those observations is worthwhile to model Organize the observations into variables having mutually exclusive and collectively exhaustive states. Finally we are to build a Directed A cyclic Graph that encodes the assertions of conditional independence

A technique of constructing a Bayesian Network The approach is based on the following observations : People can often readily assert causal relationships among the variables Casual relations typically correspond to assertions of conditional dependence To construct a Bayesian Network we simply draw arcs for a given set of variables from the cause variables to their immediate effects.In the final step we determine the local probability distributions.

Problems Steps are often intermingled in practice Judgments of conditional independence and /or cause and effect can influence problem formulation Assessments in probability may lead to changes in the network structure Haimonti Dutta , Department Of Computer And Information Science

Bayesian inference  x[m] x[m+1] x1 x2 On construction of a Bayesian network we need to determine the various probabilities of interest from the model Observed data Query Computation of a probability of interest given a model is probabilistic inference  x[m] x[m+1] x1 x2

Learning Probabilities in a Bayesian Network Problem : Using data to update the probabilities of a given network structure Thumbtack problem : We do not learn the probability of the heads , we update the posterior distribution for the variable that represents the physical probability of the heads The problem restated :Given a random sample D compute the posterior probability .

Assumptions to compute the posterior probability There is no missing data in the random sample D. Parameters are independent .

But…… Data may be missing and then how do we proceed ?????????

Obvious concerns…. Why was the data missing? Missing values Hidden variables Is the absence of an observation dependent on the actual states of the variables? We deal with the missing data that are independent of the state

Incomplete data (contd) Observations reveal that for any interesting set of local likelihoods and priors the exact computation of the posterior distribution will be intractable. We require approximation for incomplete data

The various methods of approximations for Incomplete Data Monte Carlo Sampling methods Gaussian Approximation MAP and Ml Approximations and EM algorithm

Gibb’s Sampling The steps involved : Start : Choose an initial state for each of the variables in X at random Iterate : Unassign the current state of X1. Compute the probability of this state given that of n-1 variables. Repeat this procedure for all X creating a new sample of X After “ burn in “ phase the possible configuration of X will be sampled with probability p(x).

Problem in Monte Carlo method Intractable when the sample size is large Gaussian Approximation Idea : Large amounts of data can be approximated to a multivariate Gaussian Distribution.

Criteria for Model Selection Some criterion must be used to determine the degree to which a network structure fits the prior knowledge and data Some such criteria include Relative posterior probability Local criteria

Relative posterior probability A criteria for model selection is the logarithm of the relative posterior probability given as follows : Log p(D /Sh) = log p(Sh) + log p(D /Sh) log prior log marginal likelihood

Local Criteria An Example : Ailment A Bayesian network structure for medical diagnosis Ailment Finding n Finding 1 Finding 2

To compute the relative posterior probability We assess the Priors To compute the relative posterior probability We assess the Structure priors p(Sh) Parameter priors p(s /Sh)

Priors on network parameters Key concepts : Independence Equivalence Distribution Equivalence

Illustration of independent equivalence Independence assertion : X and Z are conditionally independent given Y X Z Y X X Z Y Y Z

Priors on structures Various methods…. Assumption that every hypothesis is equally likely ( usually for convenience) Variables can be ordered and presence or absence of arcs are mutually independent Use of prior networks Imaginary data from domain experts

Benefits of learning structures Efficient learning --- more accurate models with less data Compare P(A) and P(B) versus P(A,B) former requires less data Discover structural properties of the domain Helps to order events that occur sequentially and in sensitivity analysis and inference Predict effect of the actions

Search Methods Problem : We are to find the best network from the set of all networks in which each node has no more than k parents Search techniques : Greedy Search Greedy Search with restarts Best first Search Monte Carlo Methods

Bayesian Networks for Supervised and Unsupervised learning Supervised learning : A natural representation in which to encode prior knowledge Unsupervised learning : Apply the learning technique to select a model with no hidden variables Look for sets of mutually dependent variables in the model Create a new model with a hidden variable Score new models possibly finding one better than the original.

What is all this good for anyway???????? Implementations in real life : It is used in the Microsoft products(Microsoft Office) Medical applications and Biostatistics (BUGS) In NASA Autoclass projectfor data analysis Collaborative filtering (Microsoft – MSBN) Fraud Detection (ATT) Speech recognition (UC , Berkeley )

Limitations Of Bayesian Networks Typically require initial knowledge of many probabilities…quality and extent of prior knowledge play an important role Significant computational cost(NP hard task) Unanticipated probability of an event is not taken care of.

Conclusion The foundation is Bayes rule. The Bayes rule is used to find the Maximum a Posterior (MAP) hypothesis given the training data. Given a new instance Bayes optimal classifier combine results from different hypotheses with their posterior probability to output the classification. If all attributes are independent, then we can use Naïve Bayes to give the MAP classification. However, if not all attributes are independent, then we can use Bayesian Belief Network.