Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.

Similar presentations


Presentation on theme: "1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy."— Presentation transcript:

1 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

2 2 Outline –Review –Three challenges in constructing PIs –Combining a statistical approach with a learning theory approach to constructing PIs –Relevance to confidence measures for the value of a dynamic treatment regime.

3 3 Review –X is the vector of features in R q, Y is the binary label in {-1,1} –Misclassification Rate: –Data: N iid observations of (Y,X) –Given a space of classifiers,, and the data, use some method to construct a classifier, –The goal is to provide a PI for

4 4 Review –Since the loss function is not smooth, one commonly uses a smooth surrogate loss to estimate the classifier –Surrogate Loss: L(Y,f(X)) –

5 5 Review General approach to providing a PI: –We estimate using the data, resulting in –Derive approximate distribution for –Use this approximate distribution to construct a prediction interval for

6 6 Review A common choice for is the resubstitution error or training error: evaluated at e.g. if then

7 7 Three challenges 1) is too large leading to over-fitting and (negative bias) 2) is a non-smooth function of f. 3) may behave like an extreme quantity No assumption that is close to optimal.

8 8 A Challenge 2) is non-smooth. Example: The unknown optimal classifier has quadratic decision boundary. We fit, by least squares, a linear decision boundary f(x)= sign(β 0 + β 1 x)

9 9 Density of Three Point Dist. (n=30) Three Point Dist. (n=100)

10 10 Coverage of Bootstrap PI in Three Point Example (goal =95%)

11 11 Coverage of Correctly Centered Bootstrap PI (goal= 95%)

12 12 Sample Size Bootstrap Percentile Yang CV CUD- Bound 30.72.75.91 50.82.62.92 100.91.46.94 200.97.35.95 Coverage of 95% PI (Three Point Example)

13 13 Non-smooth In general the distribution of may not converge as the training set increases (variance never settles down).

14 14 Intuition Consider the large sample variance of Variance is if in place of we put where is close to 0 then due to the non-smoothness in at we can get jittering.

15 15 PIs from Learning Theory Given a result of the form: for all N where is known to belong to and forms a conservative 1-δ PI:

16 16 Combine statistical ideas with learning theory ideas Construct a prediction interval for where is chosen to be small yet contain ---from this PI deduce a conservative PI for ---use the surrogate loss to perform estimation and to construct

17 17 Construct a prediction interval for --- should contain all that are close to --- all f for which --- is the “limiting value” of ;

18 18 Prediction Interval Construct a prediction interval for ---

19 19 Prediction Interval

20 20 Bootstrap We use bootstrap to obtain an estimate of an upper percentile of the distribution of to obtain b U. The PI is then

21 21 Implementation Approximation space for the classifier is linear: Surrogate loss is least squares: (resubstitution error)

22 22 Implementation becomes

23 23 Implementation Bootstrap version: denotes the expectation for the bootstrap distribution

24 24 Cud-Bound Level Sets (n=30) Three Point Dist.

25 25 Computational Issues Partition R q into equivalence classes defined by the 2N possible values of the first term. Each equivalence class, can be written as a set of β satisfying linear constraints. The first term is constant on

26 26 Computational Issues can be written as since g is non-decreasing.

27 27 Computational Issues Reduced the problem to the computation of at most 2N mixed integer quadratic programming problems. Using commercial solvers (e.g. CPLEX) the CUD bound can be computed for moderately sized data sets in a few minutes on a standard desktop (2.8 GHz processor 2GB RAM).

28 28 Comparisons, 95% PI DataCUDBSM Y Magic.99.92.98.99 Mamm.1.0.68.43.98 Ion.1.0.61.78.99 Donut1.0.88.63.94 3-Pt.98.83.90.75 Balance.95.91.61.99 Liver1.0.961.0 Sample size = 30 (1000 data sets)

29 29 Comparisons, Length of PI Sample size=30 (1000 data sets) DataCUDBSM Y Magic.58.31.28.46 Mamm..42.53.32.42 Ion..51.43.30.50 Donut.46.59.32.41 3-Pt.40.48.32.46 Balance.38.09.29.48 Liver.62.37.33.49

30 30 Intuition In large samples behaves like

31 31 Intuition The large sample distribution is the same as the distribution of where

32 32 Intuition If then the distribution is approximately that of a (limiting distribution for binomial, as expected).

33 33 Intuition If the distribution is approximately that of where

34 34 Discussion Further reduce the conservatism of the CUD- bound. –Replace by other quantities. –Other surrogates (exponential, logit) Construct a principle for minimizing the length of the conservative PI? The real goal is to produce PIs for the Value of a policy.

35 35 The simplest Dynamic treatment regime (e.g. policy) is a decision rule if there is only one stage of treatment 1 Stage for each individual Observation available at j th stage Action at j th stage (usually a treatment) Primary Outcome:

36 36 Goal : Construct decision rules that input patient information and output a recommended action; these decision rules should lead to a maximal mean Y. In future one selects action:

37 37 Single Stage (k=1) Find a confidence interval for the mean outcome if a particular estimated policy (here one decision rule) is employed. Action A is randomized in {-1,1}. Suppose the decision rule is of form We do not assume the optimal decision boundary is linear.

38 38 Single Stage (k=1) Mean outcome following this policy is is the randomization probability

39 39

40 40 Oslin ExTENd Late Trigger for Nonresponse 8 wks Response TDM + Naltrexone CBI Random assignment: CBI +Naltrexone Nonresponse Early Trigger for Nonresponse Random assignment: Naltrexone 8 wks Response Random assignment: CBI +Naltrexone CBI TDM + Naltrexone Naltrexone Nonresponse

41 41 This seminar can be found at: http://www.stat.lsa.umich.edu/~samurphy/ seminars/Emory11.11.08.ppt Email Eric or me with questions or if you would like a copy of the associated paper: laber@umich.edu or samurphy@umich.edu

42 42 Bias of Common on Three Point Example


Download ppt "1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy."

Similar presentations


Ads by Google