27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Chapter 7 Hypothesis Testing
A Tutorial on Learning with Bayesian Networks
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Statistics: Purpose, Approach, Method. The Basic Approach The basic principle behind the use of statistical tests of significance can be stated as: Compare.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Hypothesis Testing “Teach A Level Maths” Statistics 2 Hypothesis Testing © Christine Crisp.
Model Assessment, Selection and Averaging
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Evaluation.
Evaluating Hypotheses
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Inference about a Mean Part II
Learning Bayesian Networks
8-2 Basics of Hypothesis Testing
Thanks to Nir Friedman, HU
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Lecture Slides Elementary Statistics Twelfth Edition
Overview Definition Hypothesis
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
Is this quarter fair? How could you determine this? You assume that flipping the coin a large number of times would result in heads half the time (i.e.,
Inference for a Single Population Proportion (p).
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Machine Learning Chapter 5. Evaluating Hypotheses
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Applied statistics Usman Roshan.
Top Changwatchai 18 October 2000
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February 2001 (based on my ongoing research, Fall 2000)

27 February 2001What is Confidence?Slide 2 Overview The problem of overfitting Bayesian network models Defining confidence

27 February 2001What is Confidence?Slide 3 A Learning Problem Say we want to learn a classifier –fixed distribution of examples, each drawn independently –example consists of set of features –given a set of labeled examples drawn from the distribution Is our primary goal to fit these examples as well as possible? –No! –We want to fit the underlying distribution Overfitting –finding a hypothesis which fits the training examples better than some other hypothesis, but which fits the underlying distribution worse than that hypothesis Moral –focusing only on fitting the training data exposes you to the danger of overfitting

27 February 2001What is Confidence?Slide 4 Handling Overfitting Approaches –Assume you have enough examples Statistical anomalies are minimized –Smoothing (handling unlikely examples) –Other statistical methods, such as partitioning into training/test data –Other heuristics/assumptions My constraints –Very small number of examples Shh! Ultimate goal is incorporating domain knowledge... –If approximations are necessary, make only those that can be directly, quantitatively justified –Want a quantitative measure of overfitting

27 February 2001What is Confidence?Slide 5 Bayesian Networks Notes –BN’s are simply an example application –Focusing on inverted-tree BN’s used as classifiers Quick overview –Nodes represent features (assume Boolean) Bottom node represents label –Links represent direct dependence Absence of link represents lack of direct dependence –Conditional Probability Tables (CPT’s) Each node has 2 Pa entries (Pa = # parents) –It is fairly straightforward to: Learn CPT entries Perform inference

27 February 2001What is Confidence?Slide 6 Bayesian Network Structures Structure determines expressiveness –More parents per node = more expressive –Directly related to the total number of CPT entries in the BN “Bayesian networks don’t overfit” –Given: A BN structure A set of training examples –There is a way of choosing CPT entries which fits the training examples as well as possible –Since we’re given the structure, we must assume that the best fit to the training examples is also the best fit to the underlying distribution However: –Manually building BN structures is a lot of work –We’d like to not only learn the CPT entries, but also learn the correct structures

27 February 2001What is Confidence?Slide 7 BN’s and Overfitting Choosing the “correct” structure is where overfitting becomes a problem If the goal is to maximize accuracy on the training data, then we always prefer more expressive networks –In our inverted-tree classifiers, it would be a naïve Bayes structure Unfortunately, the more expressive the network, the greater the tendency to overfit for a fixed number of training examples Intuition: –Fitting curves to data points –“Spending” examples to increase confidence Current approaches in addressing overfitting –BIC, AIC, MDL, etc. –Each network structure is given a two-part “score” Accuracy (the more accurate, the better) Expressiveness (the fewer CPT entries, the better) –I think these rely on assumption that we have sufficiently many examples

27 February 2001What is Confidence?Slide 8 Confidence Recall our needs: –Given very few examples, we want a –Quantitative measure of overfitting that is –As exact as possible Intuitive definition of “confidence” of a given BN structure –Probability that we have seen enough examples to either accept or reject this structure Confidence and accuracy –Low confidence: need more examples –High confidence, low accuracy: reject this structure –High confidence, high accuracy: accept this structure Sadly, there is not enough time to cover my definition of confidence for an inverted-tree Bayesian network classifier –Coincidentally, I have run into certain technical difficulties in realizing a practical algorithm for evaluating this confidence –See me afterward for discussion

27 February 2001What is Confidence?Slide 9 A New Problem Domain Goal at the end of this section: –Motivate a quantitative definition of confidence of a single-node “Bayesian network” (each example has no other features except for its Boolean label) Coin-flipping domain –k coins Coin i has weight w i (probability of getting heads) One of these coins is picked at random (prior probability of picking coin i is p i ) –This coin is flipped N times, and we observe heads H times –Assuming we know the w i ’s, p i ’s, H, N, and the experimental setup, we can calculate the probability that the next toss of the coin is heads

27 February 2001What is Confidence?Slide 10 Confidence in Coin Flipping How do we define confidence? First, we need to define our decision algorithm In this case it’s easy: –if p heads  0.5, then we predict “heads” –if p heads < 0.5, then we predict “tails” Define confidence as follows: –Our confidence in our decision is the probability that if we saw an arbitrarily large (infinite) number of tosses, we would still make the same decision –Seeing an infinite number of tosses is tantamount to knowing what the weight of the coin (w i ) is –In other words, confidence  Prob(make the same decision | know the coin’s weight) Subtle point: we don’t know the coin’s weight, but we speculate that we do –Alternative POV: say Tasha is in the next room. She knows everything we know (w i ’s, p i ’s, H, N, experimental setup). In addition, she knows the weight of the coin (w i ) that was picked. Her decision is likewise simple: if w i  0.5, then predict “heads” if w i < 0.5, then predict “tails” –Then we can restate the definition: confidence = Prob(we make the same decision as Tasha)

27 February 2001What is Confidence?Slide 11 An Equation for Confidence To repeat: –confidence = Prob(we make the same decision as Tasha) WLOG, let’s say, after calculating p heads = Prob(heads | H, N), we pick heads (i.e., p heads  0.5) –Then confidence = Prob(Tasha also picked heads) –In other words: –where –Recall Thus if we define a random variable X mapping w i to P(coin i | H, N), then p heads = E(X) and confidence = Prob(X  0.5)

27 February 2001What is Confidence?Slide 12 Returning to Single-Node Network Coin-flipping is a discrete domain (discrete set of coins) Results generalize to the continuous case Consider our Boolean-valued labeled examples –Underlying distribution (which we are trying to learn): a single number w 0, which is the probability that a given example will be labeled true –We observe N examples, with H of them labeled true Let W be a random variable corresponding to the prior probability of the weight w 0 Let X be a random variable representing the posterior probability of the weight w 0 given H and N It can be shown that if W has a beta distribution, then X also has a beta distribution. In particular, if W is uniform (we have no information about the prior probability), then X  beta(H+1, N-H+1) From the properties of a beta distribution, we see that Note this is not H/N! As before, confidence = Prob(X  0.5)

27 February 2001What is Confidence?Slide 13 Final Notes We have made few assumptions about the data (for example, N can be small) We have come up with an exact, quantitative expression for confidence (although it may be difficult to evaluate) Analysis extends (not trivially) to multivariate case (more than one node in BN) Defining confidence can be an important first step to dealing with overfitting when given few examples (I haven’t shown the next few steps)

27 February 2001What is Confidence?Slide 14 Summary Overfitting is bad –Overfitting is an issue any time we do learning from examples –Often we make assumptions which allow us to assume we don't overfit –At the very least, we should be aware of these assumptions when we do learning Too much expressiveness is bad –Limiting expressiveness (introducing bias) not only helps to reduce the number of examples needed to learn, but also reduces tendency to overfit You can quantify overfitting –I'm not aware of any other efforts in this direction, but it is doable and may prove useful, especially in reducing reliance on assumptions –To do so, you must clearly define your learning goals (not just the concept to be learned) –In this presentation, we define and use "confidence"