Classifier-Based Approximate Policy Iteration

Slides:

Advertisements

Similar presentations

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University.

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Artificial Neural Networks

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:

NW Computational Intelligence Laboratory Implementing DHP in Software: Taking Control of the Pole-Cart System Lars Holmstrom.

Analysis of Algorithms

1 Analysis of Algorithms CS 105 Introduction to Data Structures and Algorithms.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Markov Random Fields & Conditional Random Fields

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Conditional Generative Adversarial Networks

Semi-Supervised Clustering

Fun with Hyperplanes: Perceptrons, SVMs, and Friends

Analysis of Algorithms

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Monte-Carlo Planning:

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

AlphaGo with Deep RL Alpha GO.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

ECE 5424: Introduction to Machine Learning

CS 4/527: Artificial Intelligence

"Playing Atari with deep reinforcement learning."

Bayesian Averaging of Classifiers and the Overfitting Problem

CSCI 5822 Probabilistic Models of Human and Machine Learning

Learning Preferences on Trajectories via Iterative Improvement

Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli

Hidden Markov Models Part 2: Algorithms

Haim Kaplan and Uri Zwick

Computational Learning Theory

CAP 5636 – Advanced Artificial Intelligence

Artificial Intelligence Lecture No. 28

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Ensemble learning.

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Model Combination.

CS 188: Artificial Intelligence Fall 2008

Attention for translation

Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun

David Kauchak CS158 – Spring 2019

Word representations David Kauchak CS158 – Fall 2016.

Sequential Learning with Dependency Nets

Version Space Machine Learning Fall 2018.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Classifier-Based Approximate Policy Iteration Alan Fern

Uniform Policy Rollout Algorithm Rollout[π,h,w](s) For each ai run SimQ(s,ai,π,h) w times Return action with best average of SimQ results s a1 a2 ak … SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps. … … … … … … … … … Samples of SimQ(s,ai,π,h) q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw

Multi-Stage Rollout s a1 ak a2 … Each step requires khw simulator calls for Rollout policy a1 a2 ak … Trajectories of SimQ(s,ai,Rollout[π,h,w],h) … … … … … … … … … Two stage: compute rollout policy of “rollout policy of π” Requires (khw)2 calls to the simulator for 2 stages In general exponential in the number of stages

Example: Rollout for Solitaire [Yan et al. NIPS’04] Player Success Rate Time/Game Human Expert 36.6% 20 min (naïve) Base Policy 13.05% 0.021 sec 1 rollout 31.20% 0.67 sec 2 rollout 47.6% 7.13 sec 3 rollout 56.83% 1.5 min 4 rollout 60.51% 18 min 5 rollout 70.20% 1 hour 45 min Multiple levels of rollout can payoff but is expensive Can we somehow get the benefit of multiple levels without the complexity?

Approximate Policy Iteration: Main Idea Nested rollout is expensive because the “base policies” (i.e. nested rollouts themselves) are expensive Suppose that we could approximate a level-one rollout policy with a very fast function (e.g. O(1) time) Then we could approximate a level-two rollout policy while paying only the cost of level-one rollout Repeatedly applying this idea leads to approximate policy iteration

Return to Policy Iteration Vp Compute Vp at all states Choose best action at each state p Improved Policy p’ Current Policy Approximate policy iteration: Only computes values and improved action at some states. Uses those to infer a fast, compact policy over all states.

Approximate Policy Iteration technically rollout only approximates π’. Sample p’ trajectories using rollout p’ trajectories Learn fast approximation of p’ p Current Policy p’ Generate trajectories of rollout policy (starting state of each trajectory is drawn from initial state distribution I) “Learn a fast approximation” of rollout policy Loop to step 1 using the learned policy as the base policy What do we mean by generate trajectories?

Generating Rollout Trajectories Get trajectories of current rollout policy from an initial state a2 Random draw from i ak s … … run policy rollout run policy rollout a1 a2 ak … a1 a2 ak …

Generating Rollout Trajectories Get trajectories of current rollout policy from an initial state Multiple trajectories differ since initial state and transitions are stochastic … … … … … a1 a2 ak … a1 a2 ak …

Generating Rollout Trajectories Get trajectories of current rollout policy from an initial state … … … … … Results in a set of state-action pairs giving the action selected by “improved policy” in states that it visits. {(s1,a1), (s2,a2),…,(sn,an)}

Approximate Policy Iteration technically rollout only approximates π’. Sample p’ trajectories using rollout p’ trajectories Learn fast approximation of p’ p Current Policy p’ Generate trajectories of rollout policy (starting state of each trajectory is drawn from initial state distribution I) “Learn a fast approximation” of rollout policy Loop to step 1 using the learned policy as the base policy What do we mean by “learn an approximation”?

Training Data {(x1,c1),(x2,c2),…,(xn,cn)} Aside: Classifier Learning A classifier is a function that labels inputs with class labels. “Learning” classifiers from training data is a well studied problem (decision trees, support vector machines, neural networks, etc). Training Data {(x1,c1),(x2,c2),…,(xn,cn)} Example problem: xi - image of a face ci  {male, female} Learning Algorithm Classifier H : X  C

Aside: Control Policies are Classifiers A control policy maps states and goals to actions.  : states  actions Training Data {(s1,a1), (s2,a2),…,(sn,an)} Learning Algorithm Classifier/Policy  : states  actions

Approximate Policy Iteration p’ training data {(s1,a1), (s2,a2),…,(sn,an)} Sample p’ trajectories using rollout Learn classifier to approximate p’ p Current Policy p’ Generate trajectories of rollout policy Results in training set of state-action pairs along trajectories T = {(s1,a1), (s2,a2),…,(sn,an)} Learn a classifier based on T to approximate rollout policy Loop to step 1 using the learned policy as the base policy

Approximate Policy Iteration p’ training data {(s1,a1), (s2,a2),…,(sn,an)} Sample p’ trajectories using rollout Learn classifier to approximate p’ p Current Policy p’ The hope is that the learned classifier will capture the general structure of improved policy from examples Want classifier to quickly select correct actions in states outside of training data (classifier should generalize) Approach allows us to leverage large amounts of work in machine learning for classification

API for Inverted Pendulum Consider the problem of balancing a pole by applying either a positive or negative force to the cart. The state space is described by the velocity of the cart and angle of the pendulum. There is noise in the force that is applied, so problem is stochastic.

Experimental Results Story for results. A data set from an API iteration. + is positive action, x is negative (ignore the circles in the figure)

Experimental Results Support vector machine used as classifier. (take CS534 for details) Maps any state to + or – Story for results. Learned classifier/policy after 2 iterations: (near optimal) blue = positive, red = negative

API for Stacking Blocks ? Consider the problem of form a goal configuration of blocks/crates/etc. from a starting configuration using basic movements such as pickup, putdown, etc. Also handle situations where actions fail and blocks fall.

Experimental Results Experimental Results Story for results. The resulting policy is fast near optimal. These problems are very hard for more traditional planners.

API in Computer Vision

Summary of API Approximate policy iteration is a practical way to select policies in large state spaces Relies on ability to learn good, compact approximations of improved policies (must be efficient to execute) Relies on the effectiveness of rollout for the problem There are only a few positive theoretical results convergence in the limit under strict assumptions PAC results for single iteration But often works well in practice