CS 188: Artificial Intelligence Spring 2006 Lecture 23: Games 4/18/2006 Dan Klein – UC Berkeley.

CS 188: Artificial Intelligence Spring 2006 Lecture 23: Games 4/18/2006 Dan Klein – UC Berkeley

Today  Reminder: P3 due at midnight  Finish reinforcement learning  Function approximation  Start game playing  Minimax search

Project 2 Contest Results  Naïve Bayes  Runners-up: Chris Crutchfield and Wei Tu (83%)  Number of curves in the image  Ratio of height to width  Runners-up: Danny Guan and Daniel Low (83%)  Percentage of active pixels  Maximum contiguous active pixels per row  Winners: Taylor Berg-Kirkpatrick and Fenna Krienen (84%)  Color changes across rows and columns

Project 2 Contest Results  Perceptron  Runner-up: Victor Feldman (86% on 1K training)  Center of mass of all active pixels  Runner-up: Jocelyn Cozzo (91%)  Percentage of active pixels  Randomized prediction on ties  Winners: Taylor Berg-Kirkpatrick and Fenna Krienen (92%)  Color changes across rows and columns  25 training iterations

Project 2 Contest Results  Other approaches  Dan Gillick (94%)  Nearest neighbor classifier  Overlapping pixels Euclidian distance function  Only considers a pruned set of training instances that are sufficiently distant from each other  The GSIs (XX%)  Only 10 minutes of work  How did they do it?

Game Playing in Practice  Checkers: Chinook ended 40-year-reign of human world champion Marion Tinsley in 1994. Used an endgame database defining perfect play for all positions involving 8 or fewer pieces on the board, a total of 443,748,401,247 positions. Exact solution imminent.  Chess: Deep Blue defeated human world champion Gary Kasparov in a six-game match in 1997. Deep Blue examined 200 million positions per second, used very sophisticated evaluation and undisclosed methods for extending some lines of search up to 40 ply.  Othello: human champions refuse to compete against computers, who are too good.  Go: human champions refuse to compete against computers, who are too bad. In go, b > 300, so most programs use pattern knowledge bases to suggest plausible moves.

Game Playing  Axes:  Deterministic or not  Number of players  Perfect information or not  Want algorithms for calculating a strategy (policy) which recommends a move in each state

Deterministic Single Player?  Deterministic, single player, perfect information:  Know the rules  Know what moves will do  Have some utility function over outcomes  E.g. Freecell, 8-Puzzle, Rubik’s cube  … it’s (basically) just search!  Slight reinterpretation:  Calculate best utility from each node  Each node is a max over children  Note that goal values are on the goal, not path sums as before 8256

Stochastic Single Player  What if we don’t know what the result of an action will be?  E.g. solitaire, minesweeper, trying to drive home  … just an MDP!  Can also do expectimax search  Chance nodes, like actions except the environment controls the action chosen  Calculate utility for each node  Max nodes as in search  Chance nodes take expectations of children 8256

Deterministic Two Player (Turns)  E.g. tic-tac-toe  Minimax search  Basically, a state-space search tree  Each layer, or ply, alternates players  Choose move to position with highest minimax value = best achievable utility against best play  Zero-sum games  One player maximizes result  The other minimizes result 8256

Minimax Example

Minimax Search

Minimax Properties  Optimal against a perfect player. Otherwise?  Time complexity?  O(b m )  Space complexity?  O(bm)  For chess, b  35, m  100  Exact solution is completely infeasible  But, do we need to explore the whole tree?

Multi-Player Games  Similar to minimax:  Utilities are now tuples  Each player maximizes their own entry at each node  Propagate (or back up) nodes from children 1,2,61,2,64,3,24,3,26,1,26,1,27,4,17,4,15,1,15,1,11,5,21,5,27,7,17,7,15,4,55,4,5

Games with Chance  E.g. backgammon  Expectiminimax search!  Environment is an extra player than moves after each agent  Chance nodes take expectations, otherwise like minimax

Games with Chance  Dice rolls increase b: 21 possible rolls with 2 dice  Backgammon  20 legal moves  Depth 4 = 20 x (21 x 20) 3 1.2 x 10 9  As depth increases, probability of reaching a given node shrinks  So value of lookahead is diminished  So limiting depth is less damaging  But pruning is less possible…  TDGammon uses depth-2 search + very good eval function + reinforcement learning: world- champion level play

Games with Hidden Information  Imperfect information:  E.g., card games, where opponent's initial cards are unknown  Typically we can calculate a probability for each possible deal  Seems just like having one big dice roll at the beginning of the game  Idea: compute the minimax value of each action in each deal, then choose the action with highest expected value over all deals  Special case: if an action is optimal for all deals, it's optimal.  GIB, current best bridge program, approximates this idea by  1) generating 100 deals consistent with bidding information  2) picking the action that wins most tricks on average  Drawback to this approach?  It’s broken!  (Though useful in practice)

Averaging over Deals is Broken  Road A leads to a small heap of gold pieces  Road B leads to a fork:  take the left fork and you'll find a mound of jewels;  take the right fork and you'll be run over by a bus.  Road A leads to a small heap of gold pieces  Road B leads to a fork:  take the left fork and you'll be run over by a bus;  take the right fork and you'll find a mound of jewels.  Road A leads to a small heap of gold pieces  Road B leads to a fork:  guess correctly and you'll nd a mound of jewels;  guess incorrectly and you'll be run over by a bus.

Efficient Search  Several options:  Pruning: avoid regions of search tree which will never enter into (optimal) play  Limited depth: don’t search very far into the future, approximate utility with a value function (familiar?)

Next Class  More game playing  Pruning  Limited depth search  Connection to reinforcement learning!

 -  Pruning Example

Q-Learning  Model free, TD learning with Q-functions:

Function Approximation  Problem: too slow to learn each state’s utility one by one  Solution: what we learn about one state should generalize to similar states  Very much like supervised learning  If states are treated entirely independently, we can only learn on very small state spaces

Discretization  Can put states into buckets of various sizes  E.g. can have all angles between 0 and 5 degrees share the same Q estimate  Buckets too fine  takes a long time to learn  Buckets too coarse  learn suboptimal, often jerky control  Real systems that use discretization usually require clever bucketing schemes  Adaptive sizes  Tile coding  [DEMOS]

Linear Value Functions  Another option: values are linear functions of features of states (or action-state pairs)  Good if you can describe states well using a few features (e.g. for game playing board evaluations)  Now we only have to learn a few weights rather than a value for each state 0.60 0.70 0.800.85 0.650.70 0.80 0.90 0.75 0.85 0.95

TD Updates for Linear Values  Can use TD learning with linear values  (Actually it’s just like the perceptron!)  Old Q-learning update:  Simply update weights of features in Q  (a,s)

Example: TD for Linear Qs

CS 188: Artificial Intelligence Spring 2006 Lecture 23: Games 4/18/2006 Dan Klein – UC Berkeley.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2006 Lecture 23: Games 4/18/2006 Dan Klein – UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 188: Artificial Intelligence Spring 2006 Lecture 23: Games 4/18/2006 Dan Klein – UC Berkeley.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2006 Lecture 23: Games 4/18/2006 Dan Klein – UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback