11-10-2009INFORMS Annual Meeting San Diego 1 HIERARCHICAL KNOWLEDGE GRADIENT FOR SEQUENTIAL SAMPLING Martijn Mes Department of Operational Methods for.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
© 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier Princeton University © 2008 Warren B. Powell, Princeton.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
© 2009 Warren B. Powell 1. Optimal Learning for Homeland Security CCICADA Workshop Morgan State, Baltimore, Md. March 7, 2010 Warren Powell With research.
Value of Information for Complex Economic Models Jeremy Oakley Department of Probability and Statistics, University of Sheffield. Paper available from.
Maximum likelihood (ML)
Estimation Error and Portfolio Optimization Global Asset Allocation and Stock Selection Campbell R. Harvey Duke University, Durham, NC USA National Bureau.
Ensemble Learning (2), Tree and Forest
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Particle Filtering in Network Tomography
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
BCS547 Neural Decoding.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sampling and estimation Petter Mostad
INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
© 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier With research by Ilya Ryzhov Princeton University.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
12. Principles of Parameter Estimation
Anticipatory Synchromodal Transportation Planning
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Machine Learning Basics
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Filtering and State Estimation: Basic Concepts
Estimation Error and Portfolio Optimization
Estimation Error and Portfolio Optimization
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Estimation Error and Portfolio Optimization
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Estimation Error and Portfolio Optimization
12. Principles of Parameter Estimation
Applied Statistics and Probability for Engineers
Presentation transcript:

INFORMS Annual Meeting San Diego 1 HIERARCHICAL KNOWLEDGE GRADIENT FOR SEQUENTIAL SAMPLING Martijn Mes Department of Operational Methods for Production and Logistics University of Twente, The Netherlands Warren Powell Department of Operations Research and Financial Engineering Princeton University, USA Peter Frazier Department of Operations Research and Information Engineering Cornell University, USA Sunday, October 11, 2009 INFORMS Annual Meeting San Diego

INFORMS Annual Meeting San Diego 2/29 OPTIMAL LEARNING  Problem  Find the best alternative from a set of alternatives  Before choosing, you have to option to measure the alternatives  But measurements are noisy  How should you sequence your measurements to produce the best answer in the end?  For problems with a finite number of alternatives  On-line learning (learn as you earn): multi-armed bandit problem  Off-line learning: ranking and selection problem Let’s illustrate the problem…

INFORMS Annual Meeting San Diego 3/29 WHAT WOULD BE THE BEST PLACE TO GO FISHING?

INFORMS Annual Meeting San Diego 4/29 WHAT WOULD BE THE BEST PLACE TO BUILD A WIND FARM?

INFORMS Annual Meeting San Diego 5/29 WHAT WOULD BE THE BEST CHEMICAL COMPOUND IN A DRUG TO FIGHT A PARTICULAR DISEASE?

INFORMS Annual Meeting San Diego 6/29 WHAT PARAMETER SETTINGS WOULD PRODUCE THE BEST MANUFACTURING CONFIGURATION IN A SIMULATED SYSTEM? Simulation Optimization

INFORMS Annual Meeting San Diego 7/29 WHERE IS THE MAX OF SOME MULTI-DIMENSIONAL FUNCTION WHEN THE SURFACE IS MEASURED WITH NOISE? Stochastic Search

INFORMS Annual Meeting San Diego 8/29 BASIC MODEL  We have a set X of distinct alternatives. Each alternative x  X is characterized by an independent normal distribution with unknown mean θ x and known variance λ x.  We have a sequence of N measurement decisions, x 0, x 1,…, x N-1. The decision x n selects an alternative to sample at time n resulting in an observation.  After the N measurements, we make an implementation decision x N, which is given by the alternative with highest expected reward.

INFORMS Annual Meeting San Diego 9/29 OBJECTIVE Conditional expectation with respect to the outcomes Conditional expectation with respect to the policy π.  Our goal is to choose a sampling policy that maximizes the expected value of the implementation decision x N.  Let π  Π be a policy that produces a sequence of measurement decisions x n, n=0,…,N-1.  Objective:

INFORMS Annual Meeting San Diego 10/29 MEASUREMENT POLICIES [1/2]  Optimal policies:  Dynamic programming (computational challenge)  Special case: multi-armed bandit problem. Can be solved using the Gittins index (Gittins and Jones, 1974).  Heuristic measurement policies:  Pure exploitation: always make the choice that appears to be the best.  Pure exploration: make choices at random so that you are always learning more, but without regard to the cost of the decision.  Hybrid  Explore with probability ρ and exploit with probability 1-ρ  Epsilon-greedy exploration: explore with probability p n =c/n. Goes to zero as n  ∞, but not too quickly.

INFORMS Annual Meeting San Diego 11/29 MEASUREMENT POLICIES [2/2]  Heuristic measurement policies, continued:  Boltzmann exploration  Interval estimation  Approximate policies for off-line learning  Optimal computing budget allocation (Chen et al. 1996)  LL(s) – Batch linear loss (Chick et al., 2009)  Maximizing the expected value of a single measurement  (R 1, R 1, …,R 1 ) policy (Gupta and Miescke, 1996)  EVI (Chick et al., 2009)  “Knowledge gradient” (Frazier and Powell, 2008)

INFORMS Annual Meeting San Diego 12/29 THE KNOWLEDGE-GRADIET POLICY [1/2]  Updating beliefs  We assume we start with a distribution of belief about the true mean θ x, (a Bayesian prior)  Next, we observe  Using Bayes theorem, we can show that our new distribution (posterior belief) about the true mean is  We perform these updates with each observation

INFORMS Annual Meeting San Diego 13/29 THE KNOWLEDGE-GRADIET POLICY [2/2]  Measurement decisions  The knowledge gradient is the expected value of a single measurement  Knowledge-gradient policy

INFORMS Annual Meeting San Diego 14/29 PROPERTIES OF THE KNOWLEDGE-GRADIET POLICY  Effectively a myopic policy, but also similar to steepest ascent for nonlinear programming.  Myopically optimal: the best single measurement you can make (by construction).  Asymptotically optimal: as the measurement budget grows, we get the optimal solution.  The knowledge gradient is the only stationary policy with this behavior. Many policies are asymptotically optimal (e.g. pure exploration, epsilon greedy) but are not myopically optimal.  But what if the number of alternatives is large relative to the measurement budget?

INFORMS Annual Meeting San Diego 15/29 CORRELATIONS  There are many problems where making one measurement tells us something about what we might observe from other measurements.  Fishing: nearby locations have similar properties (depth, bottom structure, plants, current, etc.).  Wind farm: nearby locations often share similar wind patterns.  Chemical compounds: structurally similar chemicals often behave similarly.  Simulation optimization: a small adjustment in parameter settings might result in a relative small performance change.  Correlations are particularly important when the number of possible measurements is extremely large relative to the measurement budget (or continuous functions).

INFORMS Annual Meeting San Diego 16/29 KNOWLEDGE GRADIENT FOR CORRELATED BELIEFS  The knowledge-gradient policy for correlated normal beliefs (Frazier, Powell, and Dayanik., 2009)  Belief is multivariate normal  Significantly outperform methods which ignore correlations  Computing the expectation is more challenging  Assumption: covariance matrix known (or we first have to learn it).

INFORMS Annual Meeting San Diego 17/29 STATISTICAL AGGREGATION [1/2]  Instead of using a given covariance matrix, we might work with statistical aggregation to allow generalization across alternatives.  Examples: Binary tree aggregation for continuous functionsGeographical aggregation

INFORMS Annual Meeting San Diego 18/29 STATISTICAL AGGREGATION [2/2]  Examples continued: Aggregation of vector valued data (multi-attribute vectors): ignoring dimensions V(a 1,…,a 10 ) V(a 1,…,a 5 ) V(a 1,…,a 4 ) V(a 1,…,a 3 ) V(a 1,a 2 ) V(a 1,f(a 2 )) V(a 1 ) V(f(a 1 )) g=0 g=1 g=2 g=3 g=4 g=5 g=6 g=7 V = value of a driver with certain attributes a 1 = location a 2 = domicile a 3 = capacity type a 4 = scheduled time at home a 5 = days away from home a 6 = available time a 7 = geographical constraints a 8 = DOT road hours a 9 = DOT duty hours a 10 = Eight-day duty hours

INFORMS Annual Meeting San Diego 19/29 AGGREGATION FUNCTIONS  Aggregation is performed using a set of aggregation functions G g :X  X g, where X g represents the g th level of aggregation of the original set X.  We use as the estimate of the aggregated alternative G g (x) on the g th aggregation level after n measurements.  Using aggregation, we express (our estimate of θ x ) as a weighted combination  We use a Bayesian adaptation of the weights proposed in (George, Powell, and Kulkarni, 2008): Intuition: highest weight to levels with lowest sum of variance and bias.

INFORMS Annual Meeting San Diego 20/29 HIERARCHIAL KNOWLEDGE GRADIENT (HKG)  Idea: combine the knowledge gradient with:  The weighting equation can be seen as a form of linear regression, so we may use Bayesian regression here:  However, this approach requires an informative prior.  We choose for separate beliefs on the values at each aggregation level.  So, instead of working with a multivariate normal, we have a series of independent normal distributions for each value at each aggregation level. These beliefs are combined using the weighting equation.  In the paper we provide a Bayesian justification of this combination of beliefs.

INFORMS Annual Meeting San Diego 21/29 HKG IN A NUTSHELL  Compute the knowledge gradients for all x  X:  Measure decision:  After observing y x n+1 compute: μ x g,n+1, β x g,n+1, δ x g,n+1, w x g,n+1, σ x g,n+1,ε for all x  X and g  G Split in two terms, one of which depends on the unknown measurement value Finding the expectation of the maximum of a set of lines, see (Frazier et al., 2009) Expected weight after observing y x n+1 Variance of μ x g,n+1

INFORMS Annual Meeting San Diego 22/29 ILLUSTRATION OF HKG  The knowledge gradient policies prefer to measure alternatives with high mean and/or low precision:  Equal means  measure lowest precision  Equal precisions  measure highest mean  Some MS Excel demos…  Statistical aggregation  Sampling decisions

INFORMS Annual Meeting San Diego 23/29 NUMERICAL EXPERIMENTS  One-dimensional continuous functions generated by Gaussian process with zero mean and power exponential covariance function. We vary the measurement variance and the length scale parameter ρ.  Multi-dimensional functions: transportation application where the value of a driver depends on his location, domicile, and fleet.

INFORMS Annual Meeting San Diego 24/29 ONE-DIMENSIONAL FUNCTIONS

INFORMS Annual Meeting San Diego 25/29 MULTI-DIMENSIONAL FUNCTIONS  HKG finds the best out of 2725 aggregated alternatives in less than 1200 measurements in all 25 replications.

INFORMS Annual Meeting San Diego 26/29 CONCLUSIONS. HKG…  an extension of the knowledge-gradient policy to problems where an alternative is described by a multi-dimensional vector in a computationally feasible way.  functions are estimated using an appropriately weighted sum of estimates at different levels of aggregation.  it exploits aggregation structure and similarity between alternatives, without requiring a specification of an explicit covariance matrix for our belief (which also avoids the computational challenge of working with large matrices).  It is optimal in the limit, i.e., eventually it always discovers the best alternative.  it efficiently maximizes various functions (continuous, discrete). Besides the aggregation structure, it does not make any specific assumptions about the structure of the function or set of alternatives, and it does not require tuning.

INFORMS Annual Meeting San Diego 27/29 FURTHER RESEARCH [1/2]  Hierarchical sampling  HKG requires us to scan all possible measurements before making a decision.  As an alternative, we can use HKG to choose regions to measure at successively finer levels of aggregation.  Because aggregated sets have fewer elements than the disaggregated set, we might gain some computational advantage.  Challenge: what measures to use in an aggregated sampling decision?  Knowledge gradient for Approximate dynamic programming  To cope with the exploration versus exploitation problem.  Challenge…

INFORMS Annual Meeting San Diego 28/29 FURTHER RESEARCH [2/2]  The challenge is to cope with bias in downstream values  Decision has impact on downstream path  Decision has impact on the value of states in the upstream path (off-policy Monte Carlo learning)

INFORMS Annual Meeting San Diego 29 QUESTIONS? Martijn Mes Assistant professor University of Twente School of Management and Governance Operational Methods for Production and Logistics The Netherlands Contact Phone: Web: