Purposive Behavior Acquisition for a real robot by vision based Reinforcement Learning Minuru Asada,Shoichi Noda, Sukoya Tawarasudia, Koh Hosoda Presented by: Subarna Sadhukhan
Reinforced learning Vision based reinforced learning by which a robot learns to shoot a ball into a goal. Develop a method which automatically acquires strategies for this. The robot and its environment are modeled by two synchronized finite state automatons interacting in discrete time cyclical processes. Robot : senses current state and selects an action Environment makes decision to transition to a new state and generates reward back to the robot Robot learns through purposive behavior to achieve a given goal
Environment – Ball, Goal Robot- Mobile and has a camera Nothing about the system is known Assume robot can discriminate the set S of states and take A actions on the world
Q-learning Let Q*(s,a) be the expected return for taking action a in situation s. Where T(s,a,s’) be probability of transition from s to s’, r(s,a) is the reward for state-action pair s-a γ is discounting factor Since T and r are not known we can write Where r is the actual reward for taking a. s’ is the next state and α is the learning rate
State Set 9* states (3*3 of ball*3*3*3 of goal+no goal+no ball)
Action set Two motors Each motor – forward, stop, back 9 actions in all. State-action deviation problem- Small change near observer results in large change in image, large change far from observer small change in image
Learning from Early Missions Delayed reinforcement problem due to no explicit teacher signal, since reward received only after ball is kicked to the goal. r(s,a) = 1 only in goal state Construct the learning schedule so that robot can learn in easy situations at early stages and later on learn in more difficult situations – Learning from Easy missions
Complexity analysis K states, m possible actions Q-learning for first, for second hence LEM m*k : Get reward at each step
Implementing LEM Rough ordering of easy situations Small -> medium -> large (sizes of ball roughly means reaching the goal) State space is categorized into sub-states such as ball size, position and so on. n = size of state space, m = number of ordered sets Apply LEM with m ordered states takes As opposed to
When to shift S1 is nearest to goal, next is S2 and so on. Shifting occurs when Where Δ t indicates a time interval for number of steps to change. We suppose that the current state set S(k-1) can transit only to its neighbors
From previous Q-learning equation if Q converges Thus
LEM
Experiments