Scoring Rules, Generalized Entropy and Utility Maximization Victor Richmond R. Jose Robert F. Nau Robert L. Winkler The Fuqua School of Business Duke University.

Slides:



Advertisements
Similar presentations
Chapter 0 Review of Algebra.
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Pattern Recognition and Machine Learning
Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.
Chapter 18 Statistical Decision Theory Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th.
Chapter 21 Statistical Decision Theory
Managerial Decision Modeling with Spreadsheets
Chapter 15: Decisions Under Risk and Uncertainty McGraw-Hill/Irwin Copyright © 2011 by the McGraw-Hill Companies, Inc. All rights reserved.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 15 Decisions under Risk and Uncertainty.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
DECISION MARKETS WITH GOOD INCENTIVES Yiling Chen (Harvard), Ian Kash (Harvard), Internet and Network Economics, 2011.
Integration of sensory modalities
TH EDITION LIAL HORNSBY SCHNEIDER COLLEGE ALGEBRA.
The General Linear Model. The Simple Linear Model Linear Regression.
Robust Allocation of a Defensive Budget Considering an Attacker’s Private Information Mohammad E. Nikoofal and Jun Zhuang Presenter: Yi-Cin Lin Advisor:
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Maximum likelihood (ML) and likelihood ratio (LR) test
Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –
Machine Learning CMPT 726 Simon Fraser University
Physics 310 The Method of Maximum Likelihood as applied to a linear function What is the appropriate condition to be satisfied for the sample population.
Continuous Random Variables and Probability Distributions
QMS 6351 Statistics and Research Methods Probability and Probability distributions Chapter 4, page 161 Chapter 5 (5.1) Chapter 6 (6.2) Prof. Vera Adamchik.
Inferences About Process Quality
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Maximum likelihood (ML)
A Utility Framework for Bounded-Loss Market Makers
Chapter 14 Risk and Uncertainty Managerial Economics: Economic Tools for Today’s Decision Makers, 4/e By Paul Keat and Philip Young.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Copyright © 2005 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Managerial Economics Thomas Maurice eighth edition Chapter 15.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Least-Squares Regression
Principles of Pattern Recognition
STOCHASTIC DOMINANCE APPROACH TO PORTFOLIO OPTIMIZATION Nesrin Alptekin Anadolu University, TURKEY.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Mathematics for Business and Economics - I
Statistical Decision Theory
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
IIASA Yuri Ermoliev International Institute for Applied Systems Analysis Mathematical methods for robust solutions.
On the smooth ambiguity model and Machina’s reflection example Robert Nau Fuqua School of Business Duke University.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Chp.5 Optimum Consumption and Portfolio Rules in a Continuous Time Model Hai Lin Department of Finance, Xiamen University.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Scoring Rules, Generalized Entropy, and Utility Maximization Robert Nau Fuqua School of Business Duke University (with Victor Jose and Robert Winkler)
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 4 Inverse, Exponential, and Logarithmic Functions Copyright © 2013, 2009, 2005 Pearson Education,
Logarithms The previous section dealt with exponential functions of the form y = a x for all positive values of a, where a ≠ 1. The horizontal.
4.3 Logarithmic Functions Logarithms Logarithmic Equations
Sampling and estimation Petter Mostad
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Continuous Random Variables and Probability Distributions
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Scoring Rules, Generalized Entropy, and Utility Maximization Victor Jose, Robert Nau, & Robert Winkler Fuqua School of Business Duke University Presentation.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
1 Decision Analysis Without Consequences, Bayesianism Without Probabilities Robert Nau Fuqua School of Business Duke University Presentation for RUD 2005.
Logarithmic Functions Logarithms Logarithmic Equations Logarithmic Functions Properties of Logarithms.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Data Modeling Patrice Koehl Department of Biological Sciences
Learning Tree Structures
Modelling data and curve fitting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Integration of sensory modalities
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Scoring Rules, Generalized Entropy and Utility Maximization Victor Richmond R. Jose Robert F. Nau Robert L. Winkler The Fuqua School of Business Duke University Durham, NC, USA FUR XII Presentation June 23, 2006

The general problem: how to measure information value Suppose there is uncertainty about which of n states of the world will occur, either in a single occurrence or repeated trials. An initial description of the uncertainty is represented by a “baseline” probability distribution q, while a forecaster or decision maker possesses a “true” distribution p based on additional information (e.g.,experimental data or expert judgment). What is an appropriate measure of the value of the information that changes q to p ?

Three strands of information-value literature 1.Decision analysis: information value = increase in expected utility obtained by using p rather than q to select among available acts. 2.Information theory: information value = decrease in expected number of bits needed to communicate the state which has occurred (Kullback-Leibler divergence between p and q ). 3.Scoring rules: information value = expected score obtained when the distribution p is elicited via a proper scoring rule whose expectation is minimized at the baseline distribution q.

Historical perspective All three strands of literature date back to pioneering work on subjective probability, expected utility, and information theory in the 1940’s and 1950’s (Shannon, Savage, Brier, Good...) In recent years, scoring rules have received new attention in experimental economics and neuroeconomics, while generalized divergence measures have found application in machine learning, robust Bayesian statistics, and mathematical psychology. A number of recent papers have explored other aspects of their interconnections (e.g., Grünwald- David 2004 Ann. Stat., Gneiting-Raftery 2005 w.p.)

This paper presents a unification of the three approaches We introduce generalized (weighted) versions of the power and pseudospherical scoring rules with power parameter  and baseline distribution q. These scoring rule families are shown to correspond to two generalized divergence measures which converge to the KL divergence at  = 1. The cases  = 0,  = ½, and  = 2 are also of special interest. They also correspond exactly to canonical decision analysis problems involving a risk averse decision maker whose risk tolerance coefficient is equal to .

Part 1: scoring rules Notation: p = (p 1,..., p n ) is the forecaster’s true distribution r = (r 1,..., r n ) is her reported distribution q = (q 1,..., q n ) is a baseline (reference) distribution e i = the i th unit vector (point mass on state i ) A scoring rule is a function S with arguments r and p, which is linear in p, and has parameters q, such that S(r, p; q) is the expected score for reporting r when the true distribution is p S(r, e i ; q) is the actual score yielded by r in state i S(p, p; q) is minimized at p = q

Proper scoring rules S is [strictly] proper if it [strictly] encourages honesty in the sense that S(p, p; q)  [>] S(r, p; q), r  p Classic proper scoring rules (for which q is uniform): Quadratic (Brier) score: Spherical score: Logarithmic score:

Generalized families of scoring rules The three classic scoring rules can be generalized by introducing a non-uniform baseline distribution q and by substituting an arbitrary real number  for the power of 2 in the quadratic and spherical rules. This leads to the weighted power score and weighted pseudospherical score. The weighted power and pseudospherical scores depend on the state i as affine functions of (r i /q i )   1. They both converge to the weighted logarithmic score at  = 1.

Weighted scoring rules Weighted power score: Weighted pseudospherical score: At  = 1 both converge to the weighted log score:

Corresponding expected score functions Weighted power expected score: Weighted pseudospherical expected score: At  = 1 both converge to weighted log exp. score:

Part 2. Information-theoretic measures of divergence & distance Kullback-Leibler divergence between p and q : Affinity between p and q : Squared Hellinger distance: Chi-square divergence:

Parametric families of generalized divergence Havrda-Chavrat (1967) & others: Arimoto (1971) & others: Both converge to the KL divergence at  = 1 :

First main result: scoring-rule/entropy link Theorem 1: The Havrda-Chavrat and Arimoto divergences of order  are identical to the weighted power and pseudospherical scoring expected-score functions of order , respectively, for all real . The special cases  = 0,  = ½,  = 1, and  = 2 are of particular interest.

Part 3. Decision-analytic information value with exponential/log/power utility, i.e., linear risk tolerance Standard LRT (HARA) utility function with risk tolerance coefficient  : Special cases:

Properties of standard LRT utility functions The graphs of the utility functions { g  } are mutually tangent at the origin for all  : The risk tolerance function (reciprocal of the Pratt-Arrow measure) is a linear function with slope =  and intercept = 1 : g  (y) and g 1  (y) are power utility functions whose exponents are reciprocal to each other

Canonical decision models for determining information value of p over q Suppose a risk averse decision maker with utility function g  (y) and probability distribution p bets so as to maximize her own expected utility vs. a risk-neutral, non-strategic opponent with distribution q : Equivalently, a risk neutral decision maker with probability distribution p bets so as to maximize her own expected utility vs. a risk-averse, non-strategic opponent with utility function g 1  (y) and distribution q :

Second main result: decision analysis/ scoring rule/entropy link Theorem 2(a): The solution of Models Y and Z yields the same optimal utility payoffs as the weighted pseudospherical scoring rule with parameters q and , and its expected utility is the Arimoto divergence of order  between p and q. Note that risk tolerance is non-decreasing in both models Y and Z (only) when  is between 0 and 1. The interesting special case  = ½ corresponds to reciprocal utility in both models, and the special cases  = 0 and  = 1 correspond to exponential utility in one model and logarithmic utility in the other.

Alternative decision models that maximize the sum of expected utilities Suppose a risk averse decision maker with utility function g  (y) and distribution p bets against a risk- neutral, non-strategic opponent with distribution q so as to maximize the sum of their expected utilities: Equivalently, a risk neutral decision maker with distribution p bets against a risk-averse, non-strategic opponent with utility function g 1  (y) and distribution q so as to maximize the sum of their expected utilities:

Second main result, continued: Theorem 2(b): The solution of Models Y and Z yields the same optimal utility payoffs as the weighted power scoring rule with parameters q and , and its expected utility is the Havrda-Chavrat divergence of order  between p and q. Note that Models Y and Z are more “cooperative” in spirit than Y and Z, but also somewhat less natural. The sum of two persons’ expected utilities is maximized, each computed according to a different probability distribution for the same states.

Observations and conclusions 1.The pseudospherical rule & Arimoto divergence have a more compelling decision-theoretic basis than the power rule & Havrda-Chavrat divergence, insofar as they arise from a more natural utility- maximization problem. 2.The most appropriate values of  for either rule appear to be those in the closed unit interval, rather than the more commonly used  = 2. 3.The special case  = ½ is of interest because of its symmetry properties and connection with the Hellinger distance measure (reciprocal utility!) 4.A well-chosen and not-necessarily-uniform baseline distribution q is the most important parameter of the scoring rule in any case.