Artificial Intelligence Ch 21: Reinforcement Learning

Artificial Intelligence Ch 21: Reinforcement Learning
Rodney Nielsen

Reinforcement Learning
Introduction Passive Reinforcement Learning Active Reinforcement Learning Generalization in Reinforcement Learning Policy Search Applications of Reinforcement Learning

Introduction Supervised Learning requires labels for every example (percept) What if we only know whether we were successful after a series of (state, action, percept) triples? No a priori model of the environment or reward function For example, we receive feedback/a reward (positive: “you win”, or more likely negative: “you lose”) at the end of a new game we are learning. Or maybe a reward when a point is scored

Example: Chess Supervised Learning: Reinforcement Learning:
Create labeled examples for each position numerous representative board states Labeled Ex.: feature vector representing state of board with label indicating what move to make Reinforcement Learning: Play a game, receive a “reward” at the end for winning or losing, and adjust all executed policy actions accordingly

Example: Robot Grasping
Supervised Learning: Create labeled examples for numerous representative states State: location, orientation, temperature, ability, operating characteristics, etc. of body, arm, hand, legs, head, etc., and of object Reinforcement Learning: Try to grasp object, receive a positive (negative) reward at terminal state for success (failure) and adjust all executed policy actions accordingly Or partial rewards for getting closer

Example: Helicopter Maneuver
Extremely difficult to program, but… Reinforcement Learning Feedback: Crashing (very negative) Shaking (moderate negative) Unstable (moderate negative) Inconsistent with goal (modest negative)

Example: Humanoid Robot Soccer
Goal Kicking Two State Features: x-coordinate of the ball in camera Number mm foot is shifted out from hip Three Actions: Shift leg out Shift leg in Kick Rewards: -1 per shift action -2 for missing -20 for falling +20 for scoring

Boston Dynamics https://www.youtube.com/watch?v=W1czBcnX1Ww

Passive Reinforcement Learning
π, the agent’s policy, does not change π(s) = constant action Does not know: Transition model P(s’|s,a) Reward function R(s) Percepts: Current state s Reward R(s) E.g., (1,1)-.04~(1,2)-.04~…~(4,3)+1

π(s) is static No P(s’|s,a) or R(s) Percepts: s, R(s) Goal: learn the expected utility Uπ(s) . Bellman equations for a fixed policy

Temporal-Difference Learning TD Equation: . α is the learning rate

Active Reinforcement Learning
π, the agent’s policy, must be learned Must learn complete model: Passive-ADP-Agent Transition model P(s’|s,a) Learn optimal action a ?

Learn optimal action a Exploration vs. exploitation Exploitation (greedy agent): Maximize reward under current policy Likely to stick roughly to the first actions that eventually led to success E.g., (1,1)-.04~(2,1)-.04~(3,1)-.04~ (3,2)-.04~(3,3)-.04~~(4,3)+1 Exploration: Test policies assumed to be suboptimal Stay in comfort zone vs. seek a better life ?

Learn optimal action a f(u,n): the exploration function Greed f(u) traded off against curiosity f(n) R+: optimistic estimate of best possible reward Ne: constant parameter  Agent will tries each action–state pair at least Ne times ?

Learning an action-utility function Q-Learning Q(s,a): value of action a in state s TD agents that learn a Q-function do not need a model of P(s’|s,a) either for learning or for action selection

Artificial Intelligence Ch 21: Reinforcement Learning

Similar presentations

Presentation on theme: "Artificial Intelligence Ch 21: Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Artificial Intelligence Ch 21: Reinforcement Learning

Similar presentations

Presentation on theme: "Artificial Intelligence Ch 21: Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback