John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center with a lot of slides from Tuomas Sandholm Copyright 2015 Poker
Recognized challenge problem in AI since 1992 [Billings, Schaeffer, …] – Hidden information (other players’ cards) – Uncertainty about future events – Deceptive strategies needed in a good player – Very large game trees NBC National Heads-Up Poker Championship 2013
Heads-Up Limit Texas Hold’em Bots surpassed pros in 2008 [U. Alberta Poker Research Group] “Essentially solved” in 2015 [Bowling et al.] 2008AAAI-07
Heads-Up No-Limit Texas Hold’em Annual Computer Poker Competition Claudico Tartanian7
Heads-up no-limit Texas Hold’em Thanks Microsoft!
Texas Hold’em poker 2-player Limit has ~10 18 nodes 2-player No-Limit has ~ nodes Losslessly abstracted game too big to solve => abstract more => lossy Nature deals 2 cards to each player Nature deals 3 shared cards Nature deals 1 shared card Round of betting
Bet P2 22,100 possible P2 Call P2 PreFlop Fold Raise Action Flop Turn. River Game Tree Payoff Any leaf
Our approach [Gilpin & Sandholm EC-06, J. of the ACM 2007…] Now used basically by all competitive Texas Hold’em programs Nash equilibrium Original game Abstracted game Automated abstraction Custom equilibrium-finding algorithm Reverse model Foreshadowed by Shi & Littman 01, Billings et al. IJCAI
Compute Strategy Nash Equilibrium – “defensive” strategy (doesn’t try learn or to exploit opponent flaws) – No worse than tie (on average over many hands) – neither player can hope to improve their expected utilities through unilateral strategy change Too hard to solve completely here so we use an approximation that will converge to this… Counterfactual Regret Minimization (CFR) – Invented in 2000 (Hart and Mas-Colell)! – Predominant strategy since ~2006
Abstraction Mostly about how to bin similar situations A spade 4 flush is kind of like a heart 4 flush Clustering into “buckets” (k-means) in this case not at all the only choice 169 pre-flop buckets 60 public flop buckets 500 private buckets for turn and river Down to 5.5^15 nodes
More pruning Sampling Montel Carlo: Sqr(N) Imperfect recall No longer conforms to original convergence criteria Not obvious that this is a big win Empirical results show that it is Indexing scheme Accounts for suit isomorphisms
CFR Regret – how much better could we have done with some other action instead of this one Break down overall regret into regret at each step (actually information set) – sets of game states that the controlling player cannot distinguish and so must choose actions for all such states with the same distribution – for example, the first player to act does not know which cards the other players were dealt, and so all game states immediately following the deal where the first player holds the same cards would be in the same information set Weight this regret by (iteratively recalculated) probability of opponent reaching this set Average overall regret is less than the sum of this Immediate Counterfactual Regret So, if we minimize the immediate regret, we approach a Nash equilibrium
Bet P2 22,100 possible P2 Call P2 PreFlop Fold Raise Action Flop Turn. River Game Tree Payoff Any leaf
Serial? So far this we are prescribing a serial algorithm Do we need to parallelize to – scale up tree? – iterate to accurate solution? History would suggest yes…
Scalability of (near-)equilibrium finding in 2-player 0-sum games AAAI poker competition announced Koller & Pfeffer Using sequence form & LP (simplex) Billings et al. LP (CPLEX interior point method) Gilpin & Sandholm LP (CPLEX interior point method) Gilpin, Hoda, Peña & Sandholm Scalable EGT Gilpin, Sandholm ø & Sørensen Scalable EGT Zinkevich et al. Counterfactual regret
Blacklight Obvious starting point – Fairly easy threaded code (OpenMP) – World’s largest shared memory machine NUMA
Bet P2 22,100 possible P2 Call P2 PreFlop Fold Raise Action ……OpenMP…..
Bet P2 22,100 possible P2 Call P2 PreFlop Fold Raise Action MPI (OpenMP) MPI (OpenMP) MPI (OpenMP) MPI (OpenMP) MPI (OpenMP) MPI (OpenMP)
Hybrid Algorithm 1.Start on head node for pre-flop 2.Send the current state to each child blade 3.Each child blade then samples from its bucket, and continues the iteration of MCCFR. Within each child blade we use, multiple cores. Whenever a child cluster is reached, each core is given the same inputs but uses a different random number seed to select which sample to work on 4.Once all the child blades complete their part of the iteration, their calculated values are returned to the head blade 5.The head blade calculates a weighted average of these values, weighing them by the number of choices 6.The head node then continues its iteration of MCCFR, repeating the process whenever the sample exits the top part until the iteration is complete
Hybrid Programming (Most “complex” version: MPI_THREAD_MULTIPLE ) #include //Last thread of PE 0 sends its number to PE 1 main(int argc, char* argv[]){ int provided, myPE, thread, last_thread, data=0, tag=0; MPI_Status status; MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); MPI_Comm_rank(MPI_COMM_WORLD, &myPE); #pragma omp parallel firstprivate(thread, data, tag, status) { thread = omp_get_thread_num(); last_thread = omp_get_num_threads()-1; if ( thread==last_thread && myPE==0 ) MPI_Send(&thread, 1, MPI_INT, 1, tag, MPI_COMM_WORLD); else if ( thread==last_thread && myPE==1 ) MPI_Recv(&data, 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &status); printf("PE %d, Thread %d, Data %d\n", myPE, thread, data); } MPI_Finalize(); } % export OMP_NUM_THREADS=4 % ibrun -np 3 a.out PE 0, Thread 0, Data 0 PE 1, Thread 0, Data 0 PE 2, Thread 0, Data 0 PE 2, Thread 3, Data 0 PE 0, Thread 3, Data 0 PE 1, Thread 3, Data 3 PE 0, Thread 2, Data 0 PE 2, Thread 2, Data 0 PE 1, Thread 2, Data 0 PE 0, Thread 1, Data 0 PE 1, Thread 1, Data 0 PE 2, Thread 1, Data 0 Output for 4 threads run on 3 PEs
Hybrid Tradeoffs Easier – Scaling – More asynchronous – Flexible Redistribution Harder – Load balancing (a “Fold” action truncates earlier, for example) – MPI_THREAD_MULTIPLE has potential perf penalties – Debugging
Hybrid Performance Comm/Comp – About 1 ms comm time per iteration – About 15 ms per iteration Time – 960 cores for ~ 2M core hours for man-machine – ~1M to win last (2014) machine tournament* *Last July, a predecessor of Claudico, Tartanian7, won a Heads-up No-limit Texas Hold'em contest against other AI bots at the Association for the Advancement of Artificial Intelligence's 2014 Computer Poker Competition.
HandsTotal HandsDougDongBjornJasonTotal Per SessionCumulative Day 1 - A Day 1 - B Day 2 - A Day 2 - B Day 3 - A ,281-73,85042, Day 3 - B , Day 4 - A ,415-42,11446,889-30, Day 4 - B Day 5 - A ,57642,611-94,384-23, , Day 5 - B , Day 6 - A ,29816,24536,609-62, Day 6 - B ,1357,65851,2815, Day 7 -A ,79118,539101,86812, Day 7 -B ,40752,058-29,115-2, Day 8 -A ,26375,683-40,31893, Day 8 - B ,401-46, , Day 9 -A ,57976,12415,05035, Day 9 -B ,144-50, , Day 10 - A ,100-39,75366,286-77, Day 10 - B ,01362,85053,04919, Day 11 - A ,971-54,307101,99954, Day 11- B ,02921,492-74,71187, Day 12 - A ,014-5,59769,41248, Day 12 - B ,201-18,553-76,0609, Day 13 - A ,448-46,08173,651-9, Day ,39011,449-4,443-22, Total
HandsTotal HandsDougDongBjornJasonTotal Per SessionCumulative Day 1 - A Day 1 - B Day 2 - A Day 2 - B Day 3 - A ,281-73,85042, Day 3 - B , Day 4 - A ,415-42,11446,889-30, Day 4 - B Day 5 - A ,57642,611-94,384-23, , Day 5 - B , Day 6 - A ,29816,24536,609-62, Day 6 - B ,1357,65851,2815, Day 7 -A ,79118,539101,86812, Day 7 -B ,40752,058-29,115-2, Day 8 -A ,26375,683-40,31893, Day 8 - B ,401-46, , Day 9 -A ,57976,12415,05035, Day 9 -B ,144-50, , Day 10 - A ,100-39,75366,286-77, Day 10 - B ,01362,85053,04919, Day 11 - A ,971-54,307101,99954, Day 11- B ,02921,492-74,71187, Day 12 - A ,014-5,59769,41248, Day 12 - B ,201-18,553-76,0609, Day 13 - A ,448-46,08173,651-9, Day ,39011,449-4,443-22, Total
Short form… The competition ended with 80,000 hands of poker and an enormous amount of $170 million wagered during the play, and the humans won $732,713. At the top spot was Bjorn Li, who was left with chips worth $529,033. At second position, Doug Polk was left with $213,671, while Dong Kim had chips worth $70,491 and Jason Les with $80,482. However, no real money was involved in the poker competition. The actual prize money was composed of $100,000, which was sponsored by Rivers Casino and Microsoft.
How human? Claudico is latin for “I limp”! Claudico donk (from “donkey”) bets. Will commit heavy against a little pot. Had pros convinced it was learning from them every day! "Limping is for Losers This is the most important fundamental in poker--for every game, for every tournament, every stake: If you are the first player to voluntarily commit chips to the pot, open for a raise. Limping is inevitably a losing play. If you see a person at the table limping, you can be fairly sure he is a bad player. Bottom line: If your hand is worth playing, it is worth raising."
(Near) Future Work k-means on GPU Much bigger IO Win next machine-machine comp (late 2015) so we can… Win next Brains vs. AI!