Bayesian Learning Provides practical learning algorithms

Slides:



Advertisements
Similar presentations
Data Mining in Micro array Analysis
Advertisements

Data Pre-processing Data Cleaning : Sampling:
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Making Simple Decisions
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
INC 551 Artificial Intelligence Lecture 11 Machine Learning (Continue)
Classification Algorithms
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Oliver Schulte Machine Learning 726
Bayesian Classification
Naïve Bayes Classifier
Classification and Prediction.  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian classifiers.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Bayes Classification.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.
NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
Naive Bayes Classifier
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.
Bayesian Learning Provides practical learning algorithms
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Classification Today: Basic Problem Decision Trees.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
CMPT 310 Simon Fraser University Oliver Schulte Learning.
Oliver Schulte Machine Learning 726
Does Naïve Bayes always work?
Naive Bayes Classifier
Computer Science Department
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Naïve Bayes Classifier
Presentation transcript:

Bayesian Learning Provides practical learning algorithms Naïve Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) Provides foundations for machine learning Evaluating learning algorithms Guiding the design of new algorithms Learning from models : meta learning

Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Basic Formulas for Probabilities Product Rule : probability P(AB) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of Total Probability : if events A1, …., An are mutually exclusive with

Basic Approach Bayes Rule: P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D (posterior density ) P(D|h) = probability of D given h (likelihood of D given h) The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis )

An Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

MAP Learner For each hypothesis h in H, calculate the posterior probability Output the hypothesis hmap with the highest posterior probability Comments: Computational intensive Providing a standard for judging the performance of learning algorithms Choosing P(h) and P(D|h) reflects our prior knowledge about the learning task

Bayes Optimal Classifier Question: Given new instance x, what is its most probable classification? Hmap(x) is not the most probable classification! Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3 Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = - What is the most probable classification of x ? Bayes optimal classification: Example: P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0

Naïve Bayes Learner Assume target function f: X-> V, where each instance x described by attributes <a1, a2, …., an>. Most probable value of f(x) is: Naïve Bayes assumption: (attributes are conditionally independent)

Bayesian classification The classification problem may be formalized using a-posteriori probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook=sunny,windy=true,…) Idea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) P(X) is constant for all classes P(C) = relative freq of class C samples C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function Computationally easy in both cases

Naive Bayesian Classifier (II) Given a training set, we can compute the probabilities

Play-tennis example: estimating P(xi|C) outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14

Example : Naïve Bayes Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No we have :

The independence hypothesis… … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first

Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj ) Classify_New_Instance (x) Typical estimation of P(ai | vj) Where n: examples with v=v; p is prior estimate for P(ai|vj) nc: examples with a=ai, m is the weight to prior

Bayesian Belief Networks Naïve Bayes assumption of conditional independence too restrictive But it is intractable without some such assumptions Bayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data. Bayesian Net Node = variables Arc = dependency DAG, with direction on arc representing causality

Bayesian Networks: Multi-variables with Dependency Bayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data. Bayesian Net Node = variables and each variable has a finite set of mutually exclusive states Arc = dependency DAG, with direction on arc representing causality To each variables A with parents B1, …., Bn there is attached a conditional probability table P (A | B1, …., Bn)

Bayesian Belief Networks Age, Occupation and Income determine if customer will buy this product. Given that customer buys product, whether there is interest in insurance is now independent of Age, Occupation, Income. P(Age, Occ, Inc, Buy, Ins ) = P(Age)P(Occ)P(Inc) P(Buy|Age,Occ,Inc)P(Int|Buy) Current State-of-the Art: Given structure and probabilities, existing algorithms can handle inference with categorical values and limited representation of numerical values Occ Age Income Buy X Interested in Insurance

General Product Rule

Nodes as Functions A node in BN is a conditional distribution function P(X|A=a, B=b) 0.1 0.3 0.6 l m h a ab ~ab a~b ~a~b A 0.1 0.3 0.6 0.7 0.2 0.1 0.4 0.2 0.2 0.5 0.3 l m h b B X input: parents state values output: a distribution over its own value

Special Case : Naïve Bayes h e1 e2 …………. en P(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)

Inference in Bayesian Networks Age Income How likely are elderly rich people to buy Sun? P( paper = Sun | Age>60, Income > 60k) House Owner Living Location Newspaper Preference EU Voting Pattern

Inference in Bayesian Networks Age Income How likely are elderly rich people who voted labour to buy Daily Mail? P( paper = DM | Age>60, Income > 60k, v = labour) House Owner Living Location Newspaper Preference EU Voting Pattern

Bayesian Learning Burglary Earthquake B E A C N ~b e a c n ………………... Alarm Newscast Call Input : fully or partially observable data cases Output : parameters AND also structure Learning Methods: EM (Expectation Maximisation) using current approximation of parameters to estimate filled in data using filled in data to update parameters (ML) Gradient Ascent Training Gibbs Sampling (MCMC)