Pondering Probabilistic Play Policies for Pig Todd W. Neller Gettysburg College.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Todd W. Neller Gettysburg College
Markov Decision Process
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 13, Slide 1 Chapter 13 From Randomness to Probability.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Decision Theoretic Planning
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Take out a coin! You win 4 dollars for heads, and lose 2 dollars for tails.
From Randomness to Probability
Random Variables and Expectation. Random Variables A random variable X is a mapping from a sample space S to a target set T, usually N or R. Example:
Games of probability What are my chances?. Roll a single die (6 faces). –What is the probability of each number showing on top? Activity 1: Simple probability:
Probability CSE 473 – Autumn 2003 Henry Kautz. ExpectiMax.
Markov Decision Processes
Planning under Uncertainty
Probability And Expected Value ————————————
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Expected Value.  In gambling on an uncertain future, knowing the odds is only part of the story!  Example: I flip a fair coin. If it lands HEADS, you.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Department of Computer Science Undergraduate Events More
Reinforcement Learning (1)
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Warm up: Solve each system (any method). W-up 11/4 1) Cars are being produced by two factories, factory 1 produces twice as many cars (better management)
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.
MAKING COMPLEX DEClSlONS
Please turn off cell phones, pagers, etc. The lecture will begin shortly.
Complexity and Emergence in Games (Ch. 14 & 15). Seven Schemas Schema: Conceptual framework concentrating on one aspect of game design Schemas: –Games.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 13, Slide 1 Chapter 13 From Randomness to Probability.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Rolling Two Number Cubes Good practice for addition of numbers in primary. Play some games – See who is the first one to fill all the boxes 2-12 on their.
Introduction to Machine Learning Kamal Aboul-Hosn Cornell University Chess, Chinese Rooms, and Learning.
Consider This… NAEP item: The two fair spinners shown below are part of a carnival game. A player wins a prize only when both arrows land on black after.
Probability Evaluation 11/12 th Grade Statistics Fair Games Random Number Generator Probable Outcomes Resources Why Fair Games? Probable Outcome Examples.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
1 Parrondo's Paradox. 2 Two losing games can be combined to make a winning game. Game A: repeatedly flip a biased coin (coin a) that comes up head with.
Lesson 10.5 AIM: The Game of Pig. DO NOW What is a strategy? Give an example from your daily life of how you use strategies.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Lecture 12. Game theory So far we discussed: roulette and blackjack Roulette: – Outcomes completely independent and random – Very little strategy (even.
WOULD YOU PLAY THIS GAME? Roll a dice, and win $1000 dollars if you roll a 6.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Bell Ringers Solve the following equations and write the commutative property equation and solve: (-74) + 54 (-87) + (-32) Solve the following equations.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Graph Search II GAM 376 Robin Burke. Outline Homework #3 Graph search review DFS, BFS A* search Iterative beam search IA* search Search in turn-based.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
PROBABILITY DISTRIBUTIONS DISCRETE RANDOM VARIABLES OUTCOMES & EVENTS Mrs. Aldous & Mr. Thauvette IB DP SL Mathematics.
Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.
AP STATISTICS LESSON AP STATISTICS LESSON PROBABILITY MODELS.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Optimal Play of the Farkle Dice Game
Lecture 13.
Making complex decisions
Markov Decision Processes
From Randomness to Probability
From Randomness to Probability
CS 188: Artificial Intelligence
Announcements Homework 3 due today (grace period through Friday)
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
From Randomness to Probability
Fun… Tree Diagrams… Probability.
Reinforcement Learning (2)
Presentation transcript:

Pondering Probabilistic Play Policies for Pig Todd W. Neller Gettysburg College

Sow What’s This All About? The Dice Game “Pig” Odds and Ends Playing to Win –“Piglet” –Value Iteration Machine Learning

Pig: The Game Object: First to score 100 points On your turn, roll until: –You roll 1, and score NOTHING. –You hold, and KEEP the sum. Simple game  simple strategy? Let’s play…

Playing to Score Simple odds argument –Roll until you risk more than you stand to gain. –“Hold at 20” 1/6 of time: -20  -20/6 5/6 of time: +4 (avg. of 2,3,4,5,6)  +20/6

Hold at 20? Is there a situation in which you wouldn’t want to hold at 20? –Your score: 99; you roll 2 –Case scenario you: 79 opponent:99 Your turn total stands at 20

What’s Wrong With Playing to Score? It’s mathematically optimal! But what are we optimizing? Playing to score  Playing to win Optimizing score per turn  Optimizing probability of a win

Piglet Simpler version of Pig with a coin Object: First to score 10 points On your turn, flip until: –You flip tails, and score NOTHING. –You hold, and KEEP the # of heads. Even simpler: play to 2 points

Essential Information What is the information I need to make a fully informed decision? –My score –The opponent’s score –My “turn score”

A Little Notation P i,j,k – probability of a win if i = my score j = the opponent’s score k = my turn score Hold: P i,j,k = 1 - P j,i+k,0 Flip: P i,j,k = ½(1 - P j,i,0 ) + ½ P i,j,k+1

Assume Rationality To make a smart player, assume a smart opponent. (To make a smarter player, know your opponent.) P i,j,k = max(1 - P j,i+k,0, ½(1 - P j,i,0 + P i,j,k+1 )) Probability of win based on best decisions in any state

The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(1 – P 0,0,0 + P 0,0,2 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(1 – P 1,0,0 + P 0,1,2 )) P 1,0,0 = max(1 – P 0,1,0, ½(1 – P 0,1,0 + P 1,0,1 )) P 1,1,0 = max(1 – P 1,1,0, ½(1 – P 1,1,0 + P 1,1,1 ))

The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(1 – P 0,0,0 + P 0,0,2 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(1 – P 1,0,0 + P 0,1,2 )) P 1,0,0 = max(1 – P 0,1,0, ½(1 – P 0,1,0 + P 1,0,1 )) P 1,1,0 = max(1 – P 1,1,0, ½(1 – P 1,1,0 + P 1,1,1 )) These are winning states!

The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(1 – P 0,0,0 + 1)) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(1 – P 1,0,0 + 1)) P 1,0,0 = max(1 – P 0,1,0, ½(1 – P 0,1,0 + 1)) P 1,1,0 = max(1 – P 1,1,0, ½(1 – P 1,1,0 + 1)) Simplified…

The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(2 – P 0,0,0 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(2 – P 1,0,0 )) P 1,0,0 = max(1 – P 0,1,0, ½(2 – P 0,1,0 )) P 1,1,0 = max(1 – P 1,1,0, ½(2 – P 1,1,0 )) And simplified more into a hamsome set of equations…

How to Solve It? P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(2 – P 0,0,0 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(2 – P 1,0,0 )) P 1,0,0 = max(1 – P 0,1,0, ½(2 – P 0,1,0 )) P 1,1,0 = max(1 – P 1,1,0, ½(2 – P 1,1,0 )) P 0,1,0 depends on P 0,1,1 depends on P 1,0,0 depends on P 0,1,0 depends on …

A System of Pigquations P 0,1,1 P 0,0,1 P 1,1,0 P 0,1,0 P 1,0,0 P 0,0,0 Dependencies between non-winning states

How Bad Is It? The intersection of a set of bent hyperplanes in a hypercube In the general case, no known method (read: PhD research) Is there a method that works (without being guaranteed to work in general)? –Yes! Value Iteration!

Value Iteration Start out with some values (0’s, 1’s, random #’s) Do the following until the values converge (stop changing): –Plug the values into the RHS’s –Recompute the LHS values That’s easy. Let’s do it!

Value Iteration P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(2 – P 0,0,0 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(2 – P 1,0,0 )) P 1,0,0 = max(1 – P 0,1,0, ½(2 – P 0,1,0 )) P 1,1,0 = max(1 – P 1,1,0, ½(2 – P 1,1,0 )) Assume P i,j,k is 0 unless it’s a win Repeat: Compute RHS’s, assign to LHS’s

But That’s GRUNT Work! So have a computer do it, slacker! Not difficult – end of CS1 level Fast! Don’t blink – you’ll miss it Optimal play: –Compute the probabilities –Determine flip/hold from RHS max’s –(For our equations, always FLIP)

Piglet Solved Game to 10 Play to Score: “Hold at 1” Play to Win: You Opponent

Pig Probabilities Just like Piglet, but more possible outcomes P i,j,k = max(1 - P j,i+k,0, 1/6(1 - P j,i,0 + P i,j,k+2 + P i,j,k+3 + P i,j,k+4 + P i,j,k+5 + P i,j,k+6 ))

Solving Pig 505,000 such equations Same simple solution method (value iteration) Speedup: Solve groups of interdependent probabilities Watch and see!

Pig Sow-lution

Reachable States Player 2 Score (j) = 30

Reachable States

Sow-lution for Reachable States

Probability Contours

Summary Playing to score is not playing to win. A simple game is not always simple to play. The computer is an exciting power tool for the mind!

When Value Iteration Isn’t Enough Value Iteration assumes a model of the problem –Probabilities of state transitions –Expected rewards for transitions Loaded die? Optimal play vs. suboptimal player? Game rules unknown?

No Model? Then Learn! Can’t write equations  can’t solve Must learn from experience! Reinforcement Learning –Learn optimal sequences of actions –From experience –Given positive/negative feedback

Clever Mabel the Cat

Mabel claws new La-Z-Boy  BAD! Cats hate water  spray bottle = negative reinforcement Mabel claws La-Z-Boy  Todd gets up  Todd sprays Mabel  Mabel gets negative feedback Mabel learns…

Clever Mabel the Cat Mabel learns to run when Todd gets up. Mabel first learns local causality: –Todd gets up  Todd sprays Mabel Mabel eventually sees no correlation, learns indirect cause Mabel happily claws carpet. The End.

Backgammon Tesauro’s Neurogammon –Reinforcement Learning + Neural Network (memory for learning) –Learned backgammon through self-play –Got better than all but a handful of people in the world! –Downside: Took 1.5 million games to learn

Greased Pig My continuous variant of Pig Object: First to score 100 points On your turn, generate a random number from 0.5 to 6.5 until: –Your rounded number is 1, and you score NOTHING. –You hold, and KEEP the sum. How does this change things?

Greased Pig Challenges Infinite possible game states Infinite possible games Limited experience Limited memory Learning and approximation challenge

Summary Solving equations can only take you so far (but much farther than we can fathom). Machine learning is an exciting area of research that can take us farther. The power of computing will increasingly aid in our learning in the future.