Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shunan Zhang, Michael D. Lee, Miles Munro

Similar presentations


Presentation on theme: "Shunan Zhang, Michael D. Lee, Miles Munro"— Presentation transcript:

1 Using Heuristics to Understand Optimal and Human Strategies in Bandit Problems
Shunan Zhang, Michael D. Lee, Miles Munro University of California, Irvine names and institutions of authors, as per the abstract Funded by AFOSR award FA 2019/2/24

2 Two-armed Bandit Problems
We study decision-making on the explore vs exploit tradeoff in Bandit problems When chosen, each of the alternatives returns a reward with a certain but unknown probability The goal is to maximize the total number of rewards after a fixed number of trials. 2019/2/24

3 7 trials left 2019/2/24

4 6 trials left 2019/2/24

5 5 trials left 2019/2/24

6 4 trials left 2019/2/24

7 3 trials left 2019/2/24

8 2 trials left 2019/2/24

9 1 trials left 2019/2/24

10 0 trials left 2019/2/24

11 The Explore-Exploit Trade-off
Exploration: getting information about less well understood options Exploitation: making choices known with some certainty to be reasonably good. I don’t quite agree with the exploitation definition – choosing the alternative with highest value is optimal (as per the recursive process). I would say exploitation is about making choices known with some certainty to be reasonably good, and then exploration is about getting information about less well understood options 2019/2/24

12 Environment and Trial Size
Environment: the distribution from which the individual reward rates are drawn Trial size: the length of the game, which tells the player over how many decisions to optimize Once the environment and trial size are known, an optimal solution can be determined. good – maybe introduce the terms “plentiful” and “scarce” which say whether an environment is likely to generate high or low reward rates – pictures might be helpful here – I think I did some in my AFOSR talk, although they might not be perfect 2019/2/24

13 Plentiful and Scarce Environments
“Prior Success”, α=3 “Prior Failures”, β=1 “Prior Success”, α=1 “Prior Failures”, β=3 Beta dist. Plentiful Environment Scarce Environment 2019/2/24

14 Heuristic Models We consider five heuristic models, with different psychological properties Memory: remembers the results from the past Horizon: is sensitive to the number of trials remaining Model Memory Horizon Win-stay-lose-shift no ε-greedy yes ε-decreasing ε-first Explore-exploit 2019/2/24

15 Win-stay-lose-shift  is a parameter indicating the ‘accuracy of execution’ If a success, stay with probability  If a failure, shift with probability  2019/2/24

16 ε-greedy The estimated value is calculated for each alternative at each step. The alternative with the higher estimated value is selected with probability 1 Choose randomly with probability  I would say “alternative” instead of “lever” 2019/2/24

17 ε-decreasing The probability of choosing the alternative with the lower estimated mean decreases over trials At the ith trial , the alternative with the higher estimated value is selected with probability Choose randomly with probability 2019/2/24

18 ε-first This heuristic moves between two distinct stages
Pure exploration stage: first  trials, choosing randomly Pure exploitation stage: rest (1) trials, the alternative with higher estimated value is selected with probability 1. 2019/2/24

19 New model: Explore-Exploit
We propose a new model, switching from one stage to another after  trials First an “Exploration” stage Followed by “Exploitation” stage Explore/Exploit Same Better/Worse 2019/2/24

20 Implementation We implemented all the heuristics as graphical models, and did Bayesian inference via MCMC 2019/2/24

21 Experiments We ran 8 subjects on 300 Bandit problems
3 environments: neutral, scarce, and plentiful 2 trial sizes: 8 and 16 50 games We also ran the optimal model on these versions of the Bandit problems Using these (human and optimal) decision-making data, we fit all 5 heuristic models To optimal behavior To individual subject behavior 2019/2/24

22 Heuristics Fit to Optimal Behavior
Explore-exploit -greedy -decreasing -first Agreement Win-stay, Lose-shift 2019/2/24

23 Heuristics Fit to Human Behavior
Explore-exploit -greedy -decreasing -first Agreement Win-stay, Lose-shift 2019/2/24

24 Test of Generalization
Optimal Player Explore-exploit -greedy -decreasing -first Probability agree Win-stay, Lose-shift 2019/2/24

25 Test of Generalization
Human subjects Explore-exploit -greedy -decreasing -first Probability agree Win-stay, Lose-shift 2019/2/24

26 Understanding Decision-Making
We can compare parameters (like the explore-exploit switch point) for human and optimal decision-making 2019/2/24

27 Understanding Decision-Making
show just one of these on a slide immediately before this one, so you can explain it carefully 2019/2/24

28 Conclusions The worst performed heuristic, in terms of fitting optimal and human data, was win-stay lose-shift Suggests people use memory The best performed heuristic, in terms of fitting optimal and human data, was our new explore-exploit model Suggests people use the horizon Most generally, we have shown how heuristic models can help us understand human and optimal decision-making e.g., we observed many subjects switched to exploitation later than optimal hopefully we can report the cross-validation results before this 2019/2/24

29 Acknowledgements MADLABers Mark Steyvers Matt Zeigenfuse
Sheng Kung (Mike) Yi Pernille Hemmer James Pooley Emily Grothe Our European Collaborators Joachim Vandekerckhove Eric-Jan Wagenmakers Ruud Wetzels don’t need to acknowledge me or miles (we are co-authors) 2019/2/24

30 2019/2/24

31 2019/2/24

32 2019/2/24

33 2019/2/24

34 2019/2/24

35 2019/2/24

36 EE trialwise model Zi is an indicator of whether exploit
Zi estimated for each trial, given there is the state of explore vs. exploit is an indicator of “accuracy of execution” 2019/2/24

37 EE trialwise model People switch from exploration to exploitation, but not necessarily at a certain time, given the size of the game. 2019/2/24

38 2019/2/24

39 2019/2/24

40 2019/2/24

41 Explore/exploit Model
When ‘same’, each alternative is chosen with probability .5 When ‘better/worse’, better alternative is chosen with probability When ‘explore/exploit’, explore with probability if it is before trial , and exploit with probability if it is after trial say “trial tau” rather than just “tau” 2019/2/24

42 2019/2/24

43 2019/2/24

44 2019/2/24

45 2019/2/24

46 2019/2/24

47 Win-Stay : Lose-Shift 2019/2/24

48 2019/2/24

49 2019/2/24

50 2019/2/24

51 2019/2/24

52 2019/2/24

53 2019/2/24

54 2019/2/24

55 2019/2/24

56 ε-Greedy 2019/2/24

57 2019/2/24

58 2019/2/24

59 2019/2/24

60 2019/2/24

61 2019/2/24

62 2019/2/24

63 2019/2/24

64 2019/2/24

65 ε-Decreasing 2019/2/24

66 2019/2/24

67 2019/2/24

68 2019/2/24

69 2019/2/24

70 2019/2/24

71 2019/2/24

72 2019/2/24

73 2019/2/24

74 ε-First 2019/2/24

75 2019/2/24

76 2019/2/24

77 2019/2/24

78 2019/2/24

79 2019/2/24

80 2019/2/24

81 2019/2/24

82 2019/2/24

83 Explore - Exploit 2019/2/24

84 2019/2/24

85 2019/2/24

86 2019/2/24

87 2019/2/24

88 2019/2/24

89 2019/2/24

90 2019/2/24

91 2019/2/24

92 2019/2/24

93 2019/2/24

94 2019/2/24

95 2019/2/24

96 2019/2/24

97 2019/2/24


Download ppt "Shunan Zhang, Michael D. Lee, Miles Munro"

Similar presentations


Ads by Google