Machine Learning Bayes Learning Bai Xiao.

Machine Learning Bayes Learning Bai Xiao

Bayes Learning Example, what is Bayes learning Bayes rule
Bayes learning and concept learning Maximum likelihood and minimum error square Bayes optimal classifier Naïve Bayes Bayesian Belief Network Conclusion

Bayesian Learning Provides practical learning algorithms
Naïve Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) Provides foundations for machine learning Evaluating learning algorithms Guiding the design of new algorithms Learning from models : meta learning

Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Basic Formulas for Probabilities
Product Rule : probability P(AB) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of Total Probability : if events A1, …., An are mutually exclusive with

Basic Approach Bayes Rule: P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D P(h|D) = probability of h given D (posterior density ) P(D|h) = probability of D given h (likelihood of D given h) The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis )

Bayes Rule P(h) = prior probability of hypothesis h 先验概率
P(D) = prior probability of training data D 训练数据D 的先验概率 P(D|h) = probability of D given h 假设h成立的情况下，观察到数据D的概率 P(h|D) = probability of h given D, posterior probability 给定训练数据D, h成立的概率, h 的后验概率。

Bayes Rule How to use it ? Use bayes rule to choose h from H (hypotheses space) which is the maximum given the training data D. (maximum a posteriori, MAP极大后验) We delete P(D) cause P(D) is independent with h.

Bayes Rule Maximum Likelihood
In many cases, all hypothesis have the same probability P(hi)=P(hj), so we need to find the maximum P(D|h) which is the likelihood of data D given h. The hypothesis which can maximize P(D|h) is the maximum likelihood, which is represented hML .

Bayes Learning Example, what is Bayes Learning
A patient have cancer or not ? A patient takes a lab test and the result comes back positive. The test return a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the case in which the diseases is not present. Furthermore, of the entire population have this cancer. Probability and Statistics can help us to solve this problem ? P(cancer) = P(non cancer) = 0.992 P(positive|cancer) = P(negative|cancer) = 0.02 P(positive| non cancer) = P(negative| non cancer)= 0.97

Bayes Rule Example, use MAP to solve the patient problem
A patient have cancer or not ? A patient takes a lab test and the result comes back positive. The test return a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the case in which the diseases is not present. Furthermore, of the entire population have this cancer. Probability and Statistics can help us to solve this problem ? P(cancer) = P(h1) = P(non cancer) = P(h2) = 0.992 P(positive|cancer) = P(D|h1) = P(negative|cancer) = 0.02 P(positive| non cancer) = P(D|h2) = P(negative| non cancer)= 0.97 P(D|h1) P(h1) = P(positive|cancer) P(cancer) = 0.98*0.008=0.0078 P(D|h2) P(h2) = P(positive|non cancer) P(non cancer) = 0.03*0.992=0.0298 MAP hMAP= h2=non cancer.

Bayes Learning – Minimum Risk(最小风险)
Still the patient problem, we need to make the situation more complex. Bring the idea “risk” to the problem, the risk of making different decisions is also different. Two situations “the patient is healthy, but system says he have cancer” and “the patient has cancer, but the system says he is healthy”. Two mistakes, in such a situation, the “risk” of the latter one is more dangerous. Solution: we need to make decision not only use MAP（极大后验） but also consider “risk”, so combine them together.

Consider “Risk” -- risk function (风险函数) For N hypotheses h1,…,hn λij means the loss when we choose hypothesis i, but actually the real hypothesis is j.

Bayes rule Decision space For decision ai , λ can be chosen in c different j = 1,2,…,c with corresponding posterior probability So the expected loss when take decision ai is

To use minimum risk based bayes learning method, we need to make decision with the minimum expected loss. Take the decision ak: Procedure: Compute the posterior probability Based on the risk function (risk table, in our case) compute the expected loss when take decision ak Find the minimum

Again the patient problem,

No risk, hMAP= h2=non cancer By introducing risk function, So we should take a2 decision, no cancer

Bayes Learning and Concept Learning
Concept Learning – basic idea find a hypothesis(假设) from H space (hypotheses space) consistent with training data D. Under some constraints, concept learning algorithms i.e. Find-S algorithm, Version Space and List-Then-Eliminate algorithm, can also output MAP (maximum a posterior) hypothesis.

The three constraints: There is no noise in training data D. Traget concept c is contained in hypotheses space H. Each hypothesis has the same probability. Burte-Force MAP algorithm For each h in H space, compute its posterior probability Output the hypothesis with the highest posterior probability

For each h in H h is consistent with D h is inconsistent with D If h is inconsistent with D If h is consistent with D

P(D) VSH,D is the sub-set(子集) from H which is consistent with D. VSH,D 是H中与D一致的假设子集。 h is consistent with D otherwise

初始所有假设具有相同的概率，当训练数据逐步出先后，不一致假设的概率为0，而整个概率的和仍为1，他们均匀分不到剩余的一致假设中。 At beginning, all hypotheses have the same probability. When more and more training data available, the inconsistent hypotheses probability become 0, however the sum of the rest hypotheses probability is still 1.

Maximum Likelihood and Minimum Error Square
In this part we still learning Bayes, however we should tackle two problems. First, the training data is not discrete but continues(连续型数据). Second, there is noise (error) in the training data. More practical and useful. The output of the hypothesis can satisfy MES with the training data then we can say this hypothesis is the ML hypothesis.

Problem: m training data <xi,di>, di=f(xi)+ei, ei is noise with gaussian distribution. h is a function f: X --> R. Learning is to find the target function f from H. Target function f is maximum likelihood. In concept learning, h is consistent with D (no noise). Solution, minimum Error Squre

Example: Training example <xi,di> where di=f(xi)+ei, the maximum likelihood hypothesis hML is the one that minimizes the sum of the squared errors:

Why? Minimum Error Square == Maximum Likelihood in this case. All the training instances are independent, di=f(xi)+ei so The Gaussian distribution of the noise, mean = f(xi)=h(xi)

Use log function, log is monotonic function. The first part of the previous function is a constant so Also equals

Bayes Optimal Classifer
Till now, all the problems are “given the training, what the most probability hypothesis”. However, we also interested in “given a new instance, what is the classification”. hMAP(x) is not most probable classification. Bayes Optimal Classifier

Bayes Optimal Classifer
Example: P(h1|D)=0.4 , P(h2|D)=0.3 , P(h3|D)=0.3 Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=- What is the most probable classification of x? Bayes optimal classification Example again: P(h1|D)=0.4 , P(-|h1)=0, P(+|h1)=1 P(h2|D)=0.3 , P(-|h2)=1, P(+|h2)=0 P(h3|D)=0.3 , P(-|h3)=1, P(+|h3)=0 Therefore

Naïve Bayes Naïve Bayes classifer – classify new instance
We still also interested in “given a new instance which is described by attributes, what is the classification”. Assume target function f: X V, where each instance x described by attributes <a1,a2,…,an>. Most probable value of f(x) is

Naïve Bayes Naïve Bayes assumption: Naïve Bayes classifier
So we need to know P(vj) and P(ai|vj) Estimate P(vj) and P(ai|vj) from training data by counting their frequency.

Naïve Bayes Naïve Bayes Example
New instance <sunny, cold, high, strong>, so PlayTennis <yes or no>.

Naive Bayesian Classifier (II)
Given a training set, we can compute the probabilities

Example : Naïve Bayes Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No we have :

The independence hypothesis…
… makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first

Naïve Bayes Algorithm Naïve_Bayes_Learn (examples)
for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj ) Classify_New_Instance (x) Typical estimation of P(ai | vj) Where n: examples with v=v; p is prior estimate for P(ai|vj) nc: examples with a=ai, m is the weight to prior

Bayesian Belief Network
Naïve Bayes assumption of conditional independence so: But that’s to restrictive, because a1,a2,..,an are not always independent to each other especially in realistic problems.(变量之间不一定全部独立) Bayesian Belief Network describe conditional independence among subsets of variables.(贝叶斯信念网络可表述变量的一个子集上的条件独立性假定) Allows combining prior knowledge about (in)dependencies among variables with observed data.

Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if More compact, we can write Example: Thunder is conditionally independent of Rain, given Lightning In Naïve Bayes

Explanation of the Bayesian Belief Network Network represents a set of conditional independence assertions. Each node is conditionally independent of its nondescendants, given its immediate predecessors. (此变量在给定其直接前驱时条件独立于其费后继)。 Direct acyclic graph.(有向无环图)

Explanation of the Bayesian Belief Network – two parts The arc in the network represents the (in)dependent relationship between variables. There is a dependent probability table corresponding to each node (variable) in the network.

Explanation of the Bayesian Belief Network In general, where Parents(Yi) denotes immediate predecessors of Yi in graph

Inference in Bayesian Network, how can one infer the (probabilities of) values of one ore more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard. Learning of Bayesian Networks Network structure might be known or unknown Training examples might provides values of all or some network variables. If structure known and observe all variable, easy. Otherwise, Russtll et al (1995)

Maximum Likelihood Estimation
MLE principle : We try to learn the parameters that maximize the likelihood function It is one of the most commonly used estimators in statistics and is intuitively appealing

What is a Bayesian Network ?
A graphical model that efficiently encodes the joint probability distribution for a large set of variables

Definition A Bayesian Network for a set of variables
X = { X1,…….Xn} contains network structure S encoding conditional independence assertions about X a set P of local probability distributions The network structure S is a directed acyclic graph And the nodes are in one to one correspondence with the variables X.Lack of an arc denotes a conditional independence.

Some conventions………. Variables depicted as nodes
Arcs represent probabilistic dependence between variables Conditional probabilities encode the strength of dependencies

An Example Detecting Credit - Card Fraud Fraud Age Sex Gas Jewelry

Tasks Correctly identify the goals of modeling
Identify many possible observations that may be relevant to a problem Determine what subset of those observations is worthwhile to model Organize the observations into variables having mutually exclusive and collectively exhaustive states. Finally we are to build a Directed A cyclic Graph that encodes the assertions of conditional independence

A technique of constructing a Bayesian Network
The approach is based on the following observations : People can often readily assert causal relationships among the variables Casual relations typically correspond to assertions of conditional dependence To construct a Bayesian Network we simply draw arcs for a given set of variables from the cause variables to their immediate effects.In the final step we determine the local probability distributions.

Problems Steps are often intermingled in practice
Judgments of conditional independence and /or cause and effect can influence problem formulation Assessments in probability may lead to changes in the network structure Haimonti Dutta , Department Of Computer And Information Science

Bayesian inference  x[m] x[m+1] x1 x2
On construction of a Bayesian network we need to determine the various probabilities of interest from the model Observed data Query Computation of a probability of interest given a model is probabilistic inference  x[m] x[m+1] x1 x2

Learning Probabilities in a Bayesian Network
Problem : Using data to update the probabilities of a given network structure Thumbtack problem : We do not learn the probability of the heads , we update the posterior distribution for the variable that represents the physical probability of the heads The problem restated :Given a random sample D compute the posterior probability .

Assumptions to compute the posterior probability
There is no missing data in the random sample D. Parameters are independent .

But…… Data may be missing and then how do we proceed ?????????

Obvious concerns…. Why was the data missing? Missing values
Hidden variables Is the absence of an observation dependent on the actual states of the variables? We deal with the missing data that are independent of the state

Incomplete data (contd)
Observations reveal that for any interesting set of local likelihoods and priors the exact computation of the posterior distribution will be intractable. We require approximation for incomplete data

The various methods of approximations for Incomplete Data
Monte Carlo Sampling methods Gaussian Approximation MAP and Ml Approximations and EM algorithm

Gibb’s Sampling The steps involved : Start :
Choose an initial state for each of the variables in X at random Iterate : Unassign the current state of X1. Compute the probability of this state given that of n-1 variables. Repeat this procedure for all X creating a new sample of X After “ burn in “ phase the possible configuration of X will be sampled with probability p(x).

Problem in Monte Carlo method
Intractable when the sample size is large Gaussian Approximation Idea : Large amounts of data can be approximated to a multivariate Gaussian Distribution.

Criteria for Model Selection
Some criterion must be used to determine the degree to which a network structure fits the prior knowledge and data Some such criteria include Relative posterior probability Local criteria

Relative posterior probability
A criteria for model selection is the logarithm of the relative posterior probability given as follows : Log p(D /Sh) = log p(Sh) log p(D /Sh) log prior log marginal likelihood

Local Criteria An Example : Ailment
A Bayesian network structure for medical diagnosis Ailment Finding n Finding 1 Finding 2

To compute the relative posterior probability We assess the
Priors To compute the relative posterior probability We assess the Structure priors p(Sh) Parameter priors p(s /Sh)

Priors on network parameters
Key concepts : Independence Equivalence Distribution Equivalence

Illustration of independent equivalence
Independence assertion : X and Z are conditionally independent given Y X Z Y X X Z Y Y Z

Priors on structures Various methods….
Assumption that every hypothesis is equally likely ( usually for convenience) Variables can be ordered and presence or absence of arcs are mutually independent Use of prior networks Imaginary data from domain experts

Benefits of learning structures
Efficient learning --- more accurate models with less data Compare P(A) and P(B) versus P(A,B) former requires less data Discover structural properties of the domain Helps to order events that occur sequentially and in sensitivity analysis and inference Predict effect of the actions

Search Methods Problem : We are to find the best network from the set of all networks in which each node has no more than k parents Search techniques : Greedy Search Greedy Search with restarts Best first Search Monte Carlo Methods

Bayesian Networks for Supervised and Unsupervised learning
Supervised learning : A natural representation in which to encode prior knowledge Unsupervised learning : Apply the learning technique to select a model with no hidden variables Look for sets of mutually dependent variables in the model Create a new model with a hidden variable Score new models possibly finding one better than the original.

What is all this good for anyway????????
Implementations in real life : It is used in the Microsoft products(Microsoft Office) Medical applications and Biostatistics (BUGS) In NASA Autoclass projectfor data analysis Collaborative filtering (Microsoft – MSBN) Fraud Detection (ATT) Speech recognition (UC , Berkeley )

Limitations Of Bayesian Networks
Typically require initial knowledge of many probabilities…quality and extent of prior knowledge play an important role Significant computational cost(NP hard task) Unanticipated probability of an event is not taken care of.

Conclusion The foundation is Bayes rule.
The Bayes rule is used to find the Maximum a Posterior (MAP) hypothesis given the training data. Given a new instance Bayes optimal classifier combine results from different hypotheses with their posterior probability to output the classification. If all attributes are independent, then we can use Naïve Bayes to give the MAP classification. However, if not all attributes are independent, then we can use Bayesian Belief Network.

Machine Learning Bayes Learning Bai Xiao.

Similar presentations

Presentation on theme: "Machine Learning Bayes Learning Bai Xiao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning Bayes Learning Bai Xiao.

Similar presentations

Presentation on theme: "Machine Learning Bayes Learning Bai Xiao."— Presentation transcript:

Similar presentations

About project

Feedback