Artificial Intelligence Ch 21: Reinforcement Learning Rodney Nielsen
Reinforcement Learning Introduction Passive Reinforcement Learning Active Reinforcement Learning Generalization in Reinforcement Learning Policy Search Applications of Reinforcement Learning
Introduction Supervised Learning requires labels for every example (percept) What if we only know whether we were successful after a series of (state, action, percept) triples? No a priori model of the environment or reward function For example, we receive feedback/a reward (positive: “you win”, or more likely negative: “you lose”) at the end of a new game we are learning. Or maybe a reward when a point is scored
Example: Chess Supervised Learning: Reinforcement Learning: Create labeled examples for each position numerous representative board states Labeled Ex.: feature vector representing state of board with label indicating what move to make Reinforcement Learning: Play a game, receive a “reward” at the end for winning or losing, and adjust all executed policy actions accordingly
Example: Robot Grasping Supervised Learning: Create labeled examples for numerous representative states State: location, orientation, temperature, ability, operating characteristics, etc. of body, arm, hand, legs, head, etc., and of object Reinforcement Learning: Try to grasp object, receive a positive (negative) reward at terminal state for success (failure) and adjust all executed policy actions accordingly Or partial rewards for getting closer https://www.youtube.com/watch?v=SbL7ICP-Fx0&index=20&list=PL5nBAYUyJTrM48dViibyi68urttMlUv7e
Example: Helicopter Maneuver Extremely difficult to program, but… Reinforcement Learning Feedback: Crashing (very negative) Shaking (moderate negative) Unstable (moderate negative) Inconsistent with goal (modest negative) https://www.youtube.com/watch?v=VCdxqn0fcnE
Example: Humanoid Robot Soccer Goal Kicking Two State Features: x-coordinate of the ball in camera Number mm foot is shifted out from hip Three Actions: Shift leg out Shift leg in Kick Rewards: -1 per shift action -2 for missing -20 for falling +20 for scoring https://www.youtube.com/watch?v=mRpX9DFCdwI&list=PL5nBAYUyJTrM48dViibyi68urttMlUv7e&index=12 https://www.youtube.com/watch?v=lwc-TYT0tbg https://www.youtube.com/watch?v=eHFg3RVHWjM https://www.youtube.com/watch?v=QdQL11uWWcI
Boston Dynamics https://www.youtube.com/watch?v=W1czBcnX1Ww https://www.youtube.com/watch?v=-h5qpXO3isM https://www.youtube.com/watch?v=mXI4WWhPn-U
Passive Reinforcement Learning π, the agent’s policy, does not change π(s) = constant action Does not know: Transition model P(s’|s,a) Reward function R(s) Percepts: Current state s Reward R(s) E.g., (1,1)-.04~(1,2)-.04~…~(4,3)+1
Passive Reinforcement Learning π(s) is static No P(s’|s,a) or R(s) Percepts: s, R(s) Goal: learn the expected utility Uπ(s) . Bellman equations for a fixed policy
Passive Reinforcement Learning Temporal-Difference Learning TD Equation: . α is the learning rate
Active Reinforcement Learning π, the agent’s policy, must be learned Must learn complete model: Passive-ADP-Agent Transition model P(s’|s,a) Learn optimal action a ?
Active Reinforcement Learning Learn optimal action a Exploration vs. exploitation Exploitation (greedy agent): Maximize reward under current policy Likely to stick roughly to the first actions that eventually led to success E.g., (1,1)-.04~(2,1)-.04~(3,1)-.04~ (3,2)-.04~(3,3)-.04~~(4,3)+1 Exploration: Test policies assumed to be suboptimal Stay in comfort zone vs. seek a better life ?
Active Reinforcement Learning Learn optimal action a f(u,n): the exploration function Greed f(u) traded off against curiosity f(n) R+: optimistic estimate of best possible reward Ne: constant parameter Agent will tries each action–state pair at least Ne times ?
Active Reinforcement Learning Learning an action-utility function Q-Learning Q(s,a): value of action a in state s TD agents that learn a Q-function do not need a model of P(s’|s,a) either for learning or for action selection