Download presentation
Presentation is loading. Please wait.
Published byLorin Craig Modified over 8 years ago
1
Score Function for Data Mining Algorithms Based on Chapter 7 of Hand, Manilla, & Smyth David Madigan
2
Introduction e.g. how to pick the “best” a and b in Y = aX + b usual score in this case is the sum of squared errors Scores for patterns versus scores for models Predictive versus descriptive scores Typical scores are a poor substitute for utility-based scores
3
Predictive Model Scores Assume all observations equally important Depend on differences rather than values Symmetric Proper scoring rules
4
Descriptive Model Scores “Pick the model that assigns highest probability to what actually happened” Many different scoring rules for non- probabilistic models
5
Scoring Models with Different Complexities Bias-variance tradeoff (MSE=Variance + Bias 2 ) score(model) = error(model) + penalty-function(model) AIC: where S L is the negative log-likelihood Selects overly complex models?
6
Bayesian Criterion Typically impossible to compute analytically All sorts of Monte Carlo approximations
7
Laplace Method for p(D|M) (i.e., the log of the integrand divided by n) Laplace’s Method:
8
Laplace cont. Tierney & Kadane (1986, JASA) show the approximation is O(n -1 ) Using the MLE instead of the posterior mode is also O(n -1 ) Using the expected information matrix in is O(n -1/2 ) but convenient since often computed by standard software Raftery (1993) suggested approximating by a single Newton step starting at the MLE Note the prior is explicit in these approximations
9
Monte Carlo Estimates of p(D|M) Draw iid 1,…, m from p( ): In practice has large variance
10
Monte Carlo Estimates of p(D|M) (cont.) Draw iid 1,…, m from p( |D): “Importance Sampling”
11
Monte Carlo Estimates of p(D|M) (cont.) Newton and Raftery’s “Harmonic Mean Estimator” Unstable in practice and needs modification
12
p(D|M) from Gibbs Sampler Output Suppose we decompose into ( 1, 2 ) such that p( 1 |D, 2 ) and p( 2 |D, 1 ) are available in closed- form… First note the following identity (for any * ): Chib (1995) p(D| * ) and p( * ) are usually easy to evaluate. What about p( * |D)?
13
p(D|M) from Gibbs Sampler Output The Gibbs sampler gives (dependent) draws from p( 1, 2 |D) and hence marginally from p( 2 |D)…
14
What about 3 parameter blocks… To get these draws, continue the Gibbs sampler sampling in turn from: OK ? and
15
Bayesian Information Criterion BIC is an O(1) approximation to p(D|M) Circumvents explicit prior Approximation is O(n -1/2 ) for a class of priors called “unit information priors.” No free lunch (Weakliem (1998) example)
16
Score Functions on Hold-Out Data Instead of penalizing complexity, look at performance on hold-out data Note: even using hold-out data, performance results can be optimistically biased
17
classification performance when data are in fact noise...
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.