Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Homework 3: Naive Bayes Classification
Bayesian Learning No reading assignment for this topic
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
Bayesian Belief Networks
Ensemble Learning: An Introduction
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Today Logistic Regression Decision Trees Redux Graphical Models
Bayes Nets. Bayes Nets Quick Intro Topic of much current research Models dependence/independence in probability distributions Graph based - aka “graphical.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
For Better Accuracy Eick: Ensemble Learning
Bayesian Networks Textbook: Probabilistic Reasoning, Sections 1-2, pp
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
A Brief Introduction to Graphical Models
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian Networks Martin Bachler MLA - VO
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Introduction to Bayesian Networks
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Bayesian Networks: A Tutorial. 2 Introduction Suppose you are trying to determine if a patient has tuberculosis. You observe the following symptoms:
Machine Learning 5. Parametric Methods.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Probabilistic Reasoning Inference and Relational Bayesian Networks.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Qian Liu CSE spring University of Pennsylvania
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Bayesian Averaging of Classifiers and the Overfitting Problem
Uncertainty in AI.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Chapter 14 February 26, 2004.
Presentation transcript:

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Go over Adaboost examples

Fix to C4.5 data formatting problem?

Quiz 4

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer (P. Domingos and M. Pazzani) Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Experiments Compare five classification methods on 30 data sets from the UCI ML database. SBC = Simple Bayesian Classifier Default = “Choose class with most representatives in data” C4.5 = Quinlan’s decision tree induction system PEBLS = An instance-based learning system CN2 = A rule-induction system

For SBC, numeric values were discretized into ten equal- length intervals.

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier Same as line 1, but significant at 95% confidence Average rank over all domains (1 is best in each domain)

Measuring Attribute Dependence They used a simple, pairwise mutual information measure: For attributes A m and A n, dependence is defined as where A m A n is a “derived attribute”, whose values consist of the possible combinations of values of A m and A n Note: If A m and A n are independent, then D(A m, A n | C) = 0.

Results: (1) SBC is more successful than more complex methods, even when there is substantial dependence among attributes. (2) No correlation between degree of attribute dependence and SBC’s rank. But why????

An Example Let C = {+, −}, and attributes = {A, B, C}. Let P(+) = P(−) = 1/2. Suppose A and C are completely independent, and A and B are completely dependent (e.g., A = B). Optimal classification procedure:

This leads to the following Optimal Classifier conditions: If P(A|+) P(C|+) > P(A | −) P(C| −) then class = + = else class = − SBC conditions If P(A|+) 2 P(C|+) > P(A | −) 2 P(C| −) then class = + else class = −

p = P(+ | A) q = P(+ | C) OptimalSBC + − In the paper, the authors use Bayes Theorem to rewrite these conditions, and plot the “decision boundaries” for the optimal classifier and for the SBC.

Even though A and B are completely dependent, and the SBC assumes they are completely independent, the SBC gives the optimal classification in a very large part of the problem space! But why?

Explanation: Suppose C = {+,−} are the possible classes. Let x be a new example with attributes. What the naive Bayes classifier does is calculates two probabilities, and returns the class that has the maximum probability given x.

The probability calculations are correct only if the independence assumption is correct. However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct! The latter covers a lot more cases than the former. Thus, the SBC is effective in many cases in which the independence assumption does not hold.

More on Bias and Variance

Bias From eecs.oregonstate.edu/~tgd/talks/BV.ppt

Variance From eecs.oregonstate.edu/~tgd/talks/BV.ppt

Noise From eecs.oregonstate.edu/~tgd/talks/BV.ppt

Sources of Bias and Variance Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data Variance arises when the classifier overfits the data There is often a tradeoff between bias and variance From eecs.oregonstate.edu/~tgd/talks/BV.ppt

Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is. 28 From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt

From:

Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is. 30 From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt Why?

SVM Bias and Variance Bias-Variance tradeoff controlled by  Biased classifier (linear SVM) gives better results than a classifier that can represent the true decision boundary! From eecs.oregonstate.edu/~tgd/talks/BV.ppt

Effect of Boosting In the early iterations, boosting is primary a bias-reducing method In later iterations, it appears to be primarily a variance- reducing method From eecs.oregonstate.edu/~tgd/talks/BV.ppt

Bayesian Networks Reading: S. Wooldridge, Bayesian belief networks (linked from class website)Bayesian belief networks

A patient comes into a doctor’s office with a fever and a bad cough. Hypothesis space H: h 1 : patient has flu h 2 : patient does not have flu Data D: coughing = true, fever = true,, smokes = true

Naive Bayes smokesflucoughfever Cause Effects

In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters. However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with. Full joint probability distribution Fever  Fever Fever  Fever flu p 1 p 2 p 3 p 4  flu p 5 p 6 p 7 p 8 cough  cough Sum of all boxes is 1. fever  fever fever  fever flu p 9 p 10 p 11 p 12  flu p 13 p 14 p 15 p 16 cough  cough  smokes smokes

Bayesian networks Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized. smokes flu cough fever “Graphical Models”

true0.01 false0.99 flu smoke cough fever truefalse true false fever flu true0.2 false0.8 smoke truefalse True TrueFalse FalseTrue false cough smokeflu Conditional probability tables for each node

Semantics of Bayesian networks If network is correct, can calculate full joint probability distribution from network. where parents(X i ) denotes specific values of parents of X i.

Example Calculate

Another (famous, though weird) Example Rain Wet grass Question: If you observe that the grass is wet, what is the probability it rained?

SprinklerRain Wet grass Question: If you observe that the sprinkler is on, what is the probability that the grass is wet? (Predictive inference.)

Question: If you observe that the grass is wet, what is the probability that the sprinkler is on? (Diagnostic inference.) Note that P(S) = 0.2. So, knowing that grass is wet increased probability that sprinkler is on.

Now assume the grass is wet and it rained. What is the probability that the sprinkler was on? Knowing that it rained decreases the probability that the sprinkler was on, given that the grass is wet.

SprinklerRain Wet grass Cloudy Question: Given that it is cloudy, what is the probability that the grass is wet?

In general... If network is correct, can calculate full joint probability distribution from network. where parents(X i ) denotes specific values of parents of X i. But need efficient algorithms to do this (e.g., “belief propagation”, “Markov Chain Monte Carlo”).

Complexity of Bayesian Networks For n random Boolean variables: Full joint probability distribution: 2 n entries Bayesian network with at most k parents per node: – Each conditional probability table: at most 2 k entries – Entire network: n 2 k entries

What are the advantages of Bayesian networks? Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables. Represents “beliefs and knowledge” about a particular class of situations. Efficient (?) (approximate) inference algorithms Efficient, effective learning algorithms

Issues in Bayesian Networks Building / learning network topology Assigning / learning conditional probability tables Approximate inference via sampling

Real-World Example: The Lumière Project at Microsoft Research Bayesian network approach to answering user queries about Microsoft Office. “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.” “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

Networks were developed by experts from user modeling studies.

Offspring of project was Office Assistant in Office 97, otherwise known as “clippie”.