Download presentation
Presentation is loading. Please wait.
Published byAshlynn Sims Modified over 9 years ago
1
27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February 2001 (based on my ongoing research, Fall 2000)
2
27 February 2001What is Confidence?Slide 2 Overview The problem of overfitting Bayesian network models Defining confidence
3
27 February 2001What is Confidence?Slide 3 A Learning Problem Say we want to learn a classifier –fixed distribution of examples, each drawn independently –example consists of set of features –given a set of labeled examples drawn from the distribution Is our primary goal to fit these examples as well as possible? –No! –We want to fit the underlying distribution Overfitting –finding a hypothesis which fits the training examples better than some other hypothesis, but which fits the underlying distribution worse than that hypothesis Moral –focusing only on fitting the training data exposes you to the danger of overfitting
4
27 February 2001What is Confidence?Slide 4 Handling Overfitting Approaches –Assume you have enough examples Statistical anomalies are minimized –Smoothing (handling unlikely examples) –Other statistical methods, such as partitioning into training/test data –Other heuristics/assumptions My constraints –Very small number of examples Shh! Ultimate goal is incorporating domain knowledge... –If approximations are necessary, make only those that can be directly, quantitatively justified –Want a quantitative measure of overfitting
5
27 February 2001What is Confidence?Slide 5 Bayesian Networks Notes –BN’s are simply an example application –Focusing on inverted-tree BN’s used as classifiers Quick overview –Nodes represent features (assume Boolean) Bottom node represents label –Links represent direct dependence Absence of link represents lack of direct dependence –Conditional Probability Tables (CPT’s) Each node has 2 Pa entries (Pa = # parents) –It is fairly straightforward to: Learn CPT entries Perform inference
6
27 February 2001What is Confidence?Slide 6 Bayesian Network Structures Structure determines expressiveness –More parents per node = more expressive –Directly related to the total number of CPT entries in the BN “Bayesian networks don’t overfit” –Given: A BN structure A set of training examples –There is a way of choosing CPT entries which fits the training examples as well as possible –Since we’re given the structure, we must assume that the best fit to the training examples is also the best fit to the underlying distribution However: –Manually building BN structures is a lot of work –We’d like to not only learn the CPT entries, but also learn the correct structures
7
27 February 2001What is Confidence?Slide 7 BN’s and Overfitting Choosing the “correct” structure is where overfitting becomes a problem If the goal is to maximize accuracy on the training data, then we always prefer more expressive networks –In our inverted-tree classifiers, it would be a naïve Bayes structure Unfortunately, the more expressive the network, the greater the tendency to overfit for a fixed number of training examples Intuition: –Fitting curves to data points –“Spending” examples to increase confidence Current approaches in addressing overfitting –BIC, AIC, MDL, etc. –Each network structure is given a two-part “score” Accuracy (the more accurate, the better) Expressiveness (the fewer CPT entries, the better) –I think these rely on assumption that we have sufficiently many examples
8
27 February 2001What is Confidence?Slide 8 Confidence Recall our needs: –Given very few examples, we want a –Quantitative measure of overfitting that is –As exact as possible Intuitive definition of “confidence” of a given BN structure –Probability that we have seen enough examples to either accept or reject this structure Confidence and accuracy –Low confidence: need more examples –High confidence, low accuracy: reject this structure –High confidence, high accuracy: accept this structure Sadly, there is not enough time to cover my definition of confidence for an inverted-tree Bayesian network classifier –Coincidentally, I have run into certain technical difficulties in realizing a practical algorithm for evaluating this confidence –See me afterward for discussion
9
27 February 2001What is Confidence?Slide 9 A New Problem Domain Goal at the end of this section: –Motivate a quantitative definition of confidence of a single-node “Bayesian network” (each example has no other features except for its Boolean label) Coin-flipping domain –k coins Coin i has weight w i (probability of getting heads) One of these coins is picked at random (prior probability of picking coin i is p i ) –This coin is flipped N times, and we observe heads H times –Assuming we know the w i ’s, p i ’s, H, N, and the experimental setup, we can calculate the probability that the next toss of the coin is heads
10
27 February 2001What is Confidence?Slide 10 Confidence in Coin Flipping How do we define confidence? First, we need to define our decision algorithm In this case it’s easy: –if p heads 0.5, then we predict “heads” –if p heads < 0.5, then we predict “tails” Define confidence as follows: –Our confidence in our decision is the probability that if we saw an arbitrarily large (infinite) number of tosses, we would still make the same decision –Seeing an infinite number of tosses is tantamount to knowing what the weight of the coin (w i ) is –In other words, confidence Prob(make the same decision | know the coin’s weight) Subtle point: we don’t know the coin’s weight, but we speculate that we do –Alternative POV: say Tasha is in the next room. She knows everything we know (w i ’s, p i ’s, H, N, experimental setup). In addition, she knows the weight of the coin (w i ) that was picked. Her decision is likewise simple: if w i 0.5, then predict “heads” if w i < 0.5, then predict “tails” –Then we can restate the definition: confidence = Prob(we make the same decision as Tasha)
11
27 February 2001What is Confidence?Slide 11 An Equation for Confidence To repeat: –confidence = Prob(we make the same decision as Tasha) WLOG, let’s say, after calculating p heads = Prob(heads | H, N), we pick heads (i.e., p heads 0.5) –Then confidence = Prob(Tasha also picked heads) –In other words: –where –Recall Thus if we define a random variable X mapping w i to P(coin i | H, N), then p heads = E(X) and confidence = Prob(X 0.5)
12
27 February 2001What is Confidence?Slide 12 Returning to Single-Node Network Coin-flipping is a discrete domain (discrete set of coins) Results generalize to the continuous case Consider our Boolean-valued labeled examples –Underlying distribution (which we are trying to learn): a single number w 0, which is the probability that a given example will be labeled true –We observe N examples, with H of them labeled true Let W be a random variable corresponding to the prior probability of the weight w 0 Let X be a random variable representing the posterior probability of the weight w 0 given H and N It can be shown that if W has a beta distribution, then X also has a beta distribution. In particular, if W is uniform (we have no information about the prior probability), then X beta(H+1, N-H+1) From the properties of a beta distribution, we see that Note this is not H/N! As before, confidence = Prob(X 0.5)
13
27 February 2001What is Confidence?Slide 13 Final Notes We have made few assumptions about the data (for example, N can be small) We have come up with an exact, quantitative expression for confidence (although it may be difficult to evaluate) Analysis extends (not trivially) to multivariate case (more than one node in BN) Defining confidence can be an important first step to dealing with overfitting when given few examples (I haven’t shown the next few steps)
14
27 February 2001What is Confidence?Slide 14 Summary Overfitting is bad –Overfitting is an issue any time we do learning from examples –Often we make assumptions which allow us to assume we don't overfit –At the very least, we should be aware of these assumptions when we do learning Too much expressiveness is bad –Limiting expressiveness (introducing bias) not only helps to reduce the number of examples needed to learn, but also reduces tendency to overfit You can quantify overfitting –I'm not aware of any other efforts in this direction, but it is doable and may prove useful, especially in reducing reliance on assumptions –To do so, you must clearly define your learning goals (not just the concept to be learned) –In this presentation, we define and use "confidence"
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.