Bayesian Learning Provides practical learning algorithms

Slides:



Advertisements
Similar presentations
Data Mining in Micro array Analysis
Advertisements

Data Pre-processing Data Cleaning : Sampling:
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Making Simple Decisions
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Decision Tree Learning - ID3
Bayesian Learning Provides practical learning algorithms
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
INC 551 Artificial Intelligence Lecture 11 Machine Learning (Continue)
Classification Algorithms
Oliver Schulte Machine Learning 726
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Probability and Bayesian Networks
Machine Learning CMPT 726 Simon Fraser University
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Bayesian Learning Rong Jin.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Bayes Classification.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.
NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
A survey on using Bayes reasoning in Data Mining Directed by : Dr Rahgozar Mostafa Haghir Chehreghani.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Naive Bayes Classifier
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.
1 Bayesian Learning. 2 Bayesian Reasoning Basic assumption –The quantities of interest are governed by probability distribution –These probability + observed.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 28 February 2007.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
CMPT 310 Simon Fraser University Oliver Schulte Learning.
CS479/679 Pattern Recognition Dr. George Bebis
Naive Bayes Classifier
Computer Science Department
Data Mining Lecture 11.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Machine Learning Bayes Learning Bai Xiao.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Presentation transcript:

Bayesian Learning Provides practical learning algorithms Naïve Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) Provides foundations for machine learning Evaluating learning algorithms Guiding the design of new algorithms Learning from models : meta learning

Basic Approach Bayes Rule: P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D (posterior density ) P(D|h) = probability of D given h (likelihood of D given h) The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis )

Basic Formulas for Probabilities Product Rule : probability P(AB) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of Total Probability : if events A1, …., An are mutually exclusive with

An Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

MAP Learner For each hypothesis h in H, calculate the posterior probability Output the hypothesis hmap with the highest posterior probability Comments: Computational intensive Providing a standard for judging the performance of learning algorithms Choosing P(h) and P(D|h) reflects our prior knowledge about the learning task

Minimum Description Length Principle Occam’s razor : “ choose the shortest explanation for the observed data: short trees; minimal rules….’’ MML: An abstraction of syntactical simplicity: Given a data set D Given a set of hypothesis {h1, h2, …, hn} that describe D MML Principle : always choose Hi for which that quantity: Mlength (hi) + Mlength (D | hi) is minimised MML: Best hypothesis is one which allows for most compact encoding of given data Example: Represent the set {2, 4, 6, ….., 100} Solution 1: list all 49 integers Solution 2: “All even numbers <= 100 and >= 2 MML: Solution2 is worth somet5hing As the size of set increases, the cost of solution 1 increases much more rapidly than cost of solution 1 If the set is random then the cost of solution 1 is minimum

Bayesian Interpretation of MML From Information Theory: The optimal (shortest expected coding length) code for an object (event) with probability p is -log2p bits So: Think about the decision tree example: prefer the hypothesis that minimizes: mlength(h) + mlength (misclassification)

Bayes Optimal Classifier Question: Given new instance x, what is its most probable classification? Hmap(x) is not the most probable classification! Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3 Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = - What is the most probable classification of x ? Bayes optimal classification: Example: P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0

Naïve Bayes Learner Assume target function f: X-> V, where each instance x described by attributes <a1, a2, …., an>. Most probable value of f(x) is: Naïve Bayes assumption: (attributes are conditionally independent)

Example : Naïve Bayes Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No we have :

Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj ) Classify_New_Instance (x) Typical estimation of P(ai | vj) Where n: examples with v=v; p is prior estimate for P(ai|vj) nc: examples with a=ai, m is the weight to prior

Bayesian Belief Networks Naïve Bayes assumption of conditional independence too restrictive But it is intractable without some such assumptions Bayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data. Bayesian Net Node = variables Arc = dependency DAG, with direction on arc representing causality

Bayesian Networks: Multi-variables with Dependency Bayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data. Bayesian Net Node = variables and each variable has a finite set of mutually exclusive states Arc = dependency DAG, with direction on arc representing causality To each variables A with parents B1, …., Bn there is attached a conditional probability table P (A | B1, …., Bn)

Bayesian Belief Networks Age, Occupation and Income determine if customer will buy this product. Given that customer buys product, whether there is interest in insurance is now independent of Age, Occupation, Income. P(Age, Occ, Inc, Buy, Ins ) = P(Age)P(Occ)P(Inc) P(Buy|Age,Occ,Inc)P(Int|Buy) Current State-of-the Art: Given structure and probabilities, existing algorithms can handle inference with categorical values and limited representation of numerical values Occ Age Income Buy X Interested in Insurance

General Product Rule

Nodes as Functions A node in BN is a conditional distribution function P(X|A=a, B=b) 0.1 0.3 0.6 l m h a ab ~ab a~b ~a~b A 0.1 0.3 0.6 0.7 0.2 0.1 0.4 0.2 0.2 0.5 0.3 l m h b B X input: parents state values output: a distribution over its own value

Special Case : Naïve Bayes h e1 e2 …………. en P(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)

Inference in Bayesian Networks Age Income How likely are elderly rich people to buy Sun? P( paper = Sun | Age>60, Income > 60k) House Owner Living Location Newspaper Preference EU Voting Pattern

Inference in Bayesian Networks Age Income How likely are elderly rich people who voted labour to buy Daily Mail? P( paper = DM | Age>60, Income > 60k, v = labour) House Owner Living Location Newspaper Preference EU Voting Pattern

Bayesian Learning Burglary Earthquake B E A C N ~b e a c n ………………... Alarm Newscast Call Input : fully or partially observable data cases Output : parameters AND also structure Learning Methods: EM (Expectation Maximisation) using current approximation of parameters to estimate filled in data using filled in data to update parameters (ML) Gradient Ascent Training Gibbs Sampling (MCMC)