Download presentation
Presentation is loading. Please wait.
1
Math 6330: Statistical Consulting Class 11
Tony Cox University of Colorado at Denver Course web site:
2
Course schedule April 14: Draft of project/term paper due
April 18, 25, May 2: In-class presentations May 2: Last class May 4: Final project/paper due by 8:00 PM
3
MAB Thompson sampling (cont.)
4
Thompson sampling and adaptive Bayesian control: Bernoulli trials
Basic idea: Choose each of the k actions according to the probability that it is best Estimate the probability via Bayes’ rule It is the mean of the posterior distribution Use beta conjugate prior updating for “Bernoulli bandit” (0-1 reward, fail/succeed) Sample from posterior for each arm, 1… k; choose the one with highest sample value. Update & repeat. S = success F = failure Agrawal and Goyal, 2012
5
Thompson sampling: General stochastic (random) rewards
Second idea: Generalize to arbitrary reward distribution (normalized to the interval [0, 1]) by considering a trial a “success” with probability equal to its reward Agrawal and Goyal, 2012
6
Thompson sampling with complex online actions
Main idea: Embed simulation-optimization in Thompson sampling loop = state space, S Sample the states Applications: Job scheduling (assigning jobs to machines); web advertising with reward depending sets of ads shown Y = observation h = reward, X = random variable depending on Updating posteriors can be done efficiently using a sampling-based approach (particle filtering) Gopalan et al.,
7
Comparing methods In simulation experiments, Thompson sampling works well with batch updating, even with slowly or occasionally changing rewards and other realistic complexities. Beats UCB1 in many but not all comparisons More practical than UCB1 for batch updating because it keeps experimenting (trying actions with some randomness) between updates.
8
MAB variations Contextual bandits Adversarial bandits
See signal before acting Constrained contextual bandits: Actions constrained Adversarial bandits Adaptive adversaries Bubeck and Slivens, 2012, Restless bandits: Probabilities change Gittins index maximizes expected discounted reward, not easy to compute Correlated bandits
9
Wrap-up on MAB problems
Adaptive Bayesian learning works well in simple environments, including many of practical interest The resulting rules are *much* simpler to implement than previous methods (e.g., Gittins index policies) Sampling-based approaches (Thomposn, particle filtering, etc.) promote computationally practical “online learning”
10
Wrap-up on adaptive learning
No need for a causal model Learn act-consequence probabilities and optimal decision rules directly Assumes a stationary (or slowly changing) decision environment, known choice set, immediate feedback (reward) following action Works very well when these assumptions are met: low-regret learning is possible
11
Optimal stopping
12
Optimal stopping decision problems
Suppose that a decision-maker (d.m.) faces a random sequence of opportunities How long to wait for best one? When to stop and commit to a final choice? Examples: Selling a house, hiring a new employee, accepting a job offer, replacing a component, shuttering an aging facility, taking a parking spot, etc. Other optimal stopping problems: Least-cost policies for replacing aging components
13
Hazard functions: Conditional rate of failure given survival so far
Let T = length of life for a component (or person, or time until first occurrence of an event, etc.) T is a random variable with cdf F(t) = Pr(T < t) and survival function S(t) = 1 – F(t) = Pr(T > t) The pdf for T is then f(t) = F’(t) = dF(t)/dt The hazard function for T is defined as: h(t) = limdt0Pr(t < T < t + dt | T > t)/dt h(t) = f(t)/S(t) = f(t)/[1 – F(t)] Interpretation: “instantaneous failure rate” h(t)dt Pr(occurs in next dt | survival until t) In discrete time, dt = 1, no limit is taken
14
Using hazard functions to guide decisions
The shape of the hazard function can often guide decisions, e.g… If h(t) is increasing, then optimal time to stop is when h(t) reaches a certain threshold If h(t) is decreasing, then best decision is either don’t start or else continue until failure occurs Normal distribution hazard function calculator is at SPRT and other calculators:
15
Example: optimal age replacement
The lifetime T of a component is a random variable with known distribution Suppose it costs $10 to replace the plant before it fails and $50 to replace it if it fails. When should the component be voluntarily replaced (if not failed yet)? Answer can be calculated by minimizing expected average cost per cycle (or equating marginal benefit to marginal cost for continuing), but calculations are detailed and soon get tedious Alternative: Google “optimal replacement age calculator”
16
Optimal age replacement calculator
17
Optimal selling of an asset
If offers arrive sequentially from a known distribution and costs of waiting are known, then an optimal decision boundary (blue) can be constructed to maximize EMV Sell when red line first hits blue decision boundary W(t) = price series S(t) = maximum price so far
18
Optimal stopping: Variations
Offers arrive sequentially from an unknown distribution Bayesian updating provides solutions Time pressure: Must sell by deadline, or fixed number of offers With or without being able to go back to previous offers Sell when blue line first hits green decision boundary
19
Wrap-up on optimal stopping and statistical decision theory
Many valuable decision problems can be solved using the philosophy of simulation-optimization: Try different decisions, evaluate their probable consequences Choose the one with best (EMV or EU-maximizing) probability distribution of consequences Finding a best decision or decision rule can become very technical Use appropriate software or on-line calculators For business applications, understanding how to formulate decision problems and solve them with software can create high value in practice
20
Heuristics and biases
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.