Shunan Zhang, Michael D. Lee, Miles Munro

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Markov Decision Processes
Minority Games A Complex Systems Project. Going to a concert… But which night to pick? Friday or Saturday? You want to go on the night with the least.
Change Detection in Dynamic Environments Mark Steyvers Scott Brown UC Irvine This work is supported by a grant from the US Air Force Office of Scientific.
Capturing User Interests by Both Exploitation and Exploration Richard Sia (Joint work with NEC) Feb
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Reinforcement Learning (1)
Prediction and Change Detection Mark Steyvers Scott Brown Mike Yi University of California, Irvine This work is supported by a grant from the US Air Force.
Inference in Dynamic Environments Mark Steyvers Scott Brown UC Irvine This work is supported by a grant from the US Air Force Office of Scientific Research.
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
MAKING COMPLEX DEClSlONS
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Introduction Many decision making problems in real life
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Methodological Problems in Cognitive Psychology David Danks Institute for Human & Machine Cognition January 10, 2003.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
Models for Strategic Marketing Decision Making. Market Entry Decisions To enter first or to wait Sources of First-Mover Advantages –Technological leadership.
Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
CS 3343: Analysis of Algorithms Lecture 19: Introduction to Greedy Algorithms.
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Adversarial Search 2 (Game Playing)
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
DECISION MODELS. Decision models The types of decision models: – Decision making under certainty The future state of nature is assumed known. – Decision.
Understanding AI of 2 Player Games. Motivation Not much experience in AI (first AI project) and no specific interests/passion that I wanted to explore.
Does the brain compute confidence estimates about decisions?
Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Artificial Intelligence AIMA §5: Adversarial Search
PSY 626: Bayesian Statistics for Psychological Science
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Figure 5: Change in Blackjack Posterior Distributions over Time.
MCMC Output & Metropolis-Hastings Algorithm Part I
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Reinforcement Learning (1)
Markov Decision Processes
C ODEBREAKER Class discussion.
When Security Games Go Green
Optimal Electricity Supply Bidding by Markov Decision Process
PSY 626: Bayesian Statistics for Psychological Science
Reinforcement Learning
Reasoning in Psychology Using Statistics
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
Chapter 1 Units and Problem Solving
Instructor: Vincent Conitzer
Chapter 8: Estimating with Confidence
Volume 19, Issue 18, Pages (September 2009)
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Kalman Filter: Bayes Interpretation
Chapter 8: Estimating with Confidence
Risky or rational? Alcohol increases the subjective value of
Games & Adversarial Search
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Dr. Arslan Ornek MATHEMATICAL MODELS
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Using Heuristics to Understand Optimal and Human Strategies in Bandit Problems Shunan Zhang, Michael D. Lee, Miles Munro University of California, Irvine names and institutions of authors, as per the abstract Funded by AFOSR award FA9550-07-1-0082 2019/2/24

Two-armed Bandit Problems We study decision-making on the explore vs exploit tradeoff in Bandit problems When chosen, each of the alternatives returns a reward with a certain but unknown probability The goal is to maximize the total number of rewards after a fixed number of trials. 2019/2/24

7 trials left 2019/2/24

6 trials left 2019/2/24

5 trials left 2019/2/24

4 trials left 2019/2/24

3 trials left 2019/2/24

2 trials left 2019/2/24

1 trials left 2019/2/24

0 trials left 2019/2/24

The Explore-Exploit Trade-off Exploration: getting information about less well understood options Exploitation: making choices known with some certainty to be reasonably good. I don’t quite agree with the exploitation definition – choosing the alternative with highest value is optimal (as per the recursive process). I would say exploitation is about making choices known with some certainty to be reasonably good, and then exploration is about getting information about less well understood options 2019/2/24

Environment and Trial Size Environment: the distribution from which the individual reward rates are drawn Trial size: the length of the game, which tells the player over how many decisions to optimize Once the environment and trial size are known, an optimal solution can be determined. good – maybe introduce the terms “plentiful” and “scarce” which say whether an environment is likely to generate high or low reward rates – pictures might be helpful here – I think I did some in my AFOSR talk, although they might not be perfect 2019/2/24

Plentiful and Scarce Environments “Prior Success”, α=3 “Prior Failures”, β=1 “Prior Success”, α=1 “Prior Failures”, β=3 Beta dist. Plentiful Environment Scarce Environment 2019/2/24

Heuristic Models We consider five heuristic models, with different psychological properties Memory: remembers the results from the past Horizon: is sensitive to the number of trials remaining Model Memory Horizon Win-stay-lose-shift no ε-greedy yes ε-decreasing ε-first Explore-exploit 2019/2/24

Win-stay-lose-shift  is a parameter indicating the ‘accuracy of execution’ If a success, stay with probability  If a failure, shift with probability  2019/2/24

ε-greedy The estimated value is calculated for each alternative at each step. The alternative with the higher estimated value is selected with probability 1 Choose randomly with probability  I would say “alternative” instead of “lever” 2019/2/24

ε-decreasing The probability of choosing the alternative with the lower estimated mean decreases over trials At the ith trial , the alternative with the higher estimated value is selected with probability Choose randomly with probability 2019/2/24

ε-first This heuristic moves between two distinct stages Pure exploration stage: first  trials, choosing randomly Pure exploitation stage: rest (1) trials, the alternative with higher estimated value is selected with probability 1. 2019/2/24

New model: Explore-Exploit We propose a new model, switching from one stage to another after  trials First an “Exploration” stage Followed by “Exploitation” stage Explore/Exploit Same Better/Worse 2019/2/24

Implementation We implemented all the heuristics as graphical models, and did Bayesian inference via MCMC 2019/2/24

Experiments We ran 8 subjects on 300 Bandit problems 3 environments: neutral, scarce, and plentiful 2 trial sizes: 8 and 16 50 games We also ran the optimal model on these versions of the Bandit problems Using these (human and optimal) decision-making data, we fit all 5 heuristic models To optimal behavior To individual subject behavior 2019/2/24

Heuristics Fit to Optimal Behavior Explore-exploit -greedy -decreasing -first Agreement Win-stay, Lose-shift 2019/2/24

Heuristics Fit to Human Behavior Explore-exploit -greedy -decreasing -first Agreement Win-stay, Lose-shift 2019/2/24

Test of Generalization Optimal Player Explore-exploit -greedy -decreasing -first Probability agree Win-stay, Lose-shift 2019/2/24

Test of Generalization Human subjects Explore-exploit -greedy -decreasing -first Probability agree Win-stay, Lose-shift 2019/2/24

Understanding Decision-Making We can compare parameters (like the explore-exploit switch point) for human and optimal decision-making 2019/2/24

Understanding Decision-Making show just one of these on a slide immediately before this one, so you can explain it carefully 2019/2/24

Conclusions The worst performed heuristic, in terms of fitting optimal and human data, was win-stay lose-shift Suggests people use memory The best performed heuristic, in terms of fitting optimal and human data, was our new explore-exploit model Suggests people use the horizon Most generally, we have shown how heuristic models can help us understand human and optimal decision-making e.g., we observed many subjects switched to exploitation later than optimal hopefully we can report the cross-validation results before this 2019/2/24

Acknowledgements MADLABers Mark Steyvers Matt Zeigenfuse Sheng Kung (Mike) Yi Pernille Hemmer James Pooley Emily Grothe Our European Collaborators Joachim Vandekerckhove Eric-Jan Wagenmakers Ruud Wetzels don’t need to acknowledge me or miles (we are co-authors) 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

EE trialwise model Zi is an indicator of whether exploit Zi estimated for each trial, given there is the state of explore vs. exploit is an indicator of “accuracy of execution” 2019/2/24

EE trialwise model People switch from exploration to exploitation, but not necessarily at a certain time, given the size of the game. 2019/2/24

2019/2/24

2019/2/24

2019/2/24

Explore/exploit Model When ‘same’, each alternative is chosen with probability .5 When ‘better/worse’, better alternative is chosen with probability When ‘explore/exploit’, explore with probability if it is before trial , and exploit with probability if it is after trial say “trial tau” rather than just “tau” 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

Win-Stay : Lose-Shift 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

ε-Greedy 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

ε-Decreasing 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

ε-First 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

Explore - Exploit 2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24

2019/2/24