Learning optimal behavior

Learning optimal behavior
Twan van Laarhoven

AIBO robot Walking Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion Nate Kohl, Peter Stone (2004) Goal: speed Why do you want this?

Parameterization 12 parameters Front ellipse Rear ellipse Height body
etc. Front locus (3 parameters: height, x-pos., y-pos.) Rear locus (3 parameters) Locus length Locus skew multiplier in the x-y plane (for turning) The height of the front of the body The height of the rear of the body The time each foot takes to move through its locus The fraction of time each foot spends on the ground

Learning No simulator Not a MDP Gradient Reinforcement Learning
test on actual AIBO expensive Not a MDP No Q-Learning Gradient Reinforcement Learning

Gradient Reinforcement Learning
Parameter vector: π = {θ1, …, θN} Random policies: Ri = {θ1+ Δ1, …, θN + ΔN} Each parameter: S-ε,n / S0,n / S+ε,n Averages: Avg-ε,n =avgscore(S-ε,n) Adjust: An= 0 or Avg+ε,n − Avg-ε,n Repeat

Conclusion Gradient Reinforcement Learning is very simple and gives good results Evaluation can be done in parallel

Learning from experts Apprenticeship Learning for Motion Planning with Application to Parking Lot Navigation Pieter Abbeel, Dmitri Dolgov, Andrew Y. Ng, Sebastian Thrun (2008)

Parking lot navigation
Path planning Many cost functions length backward smoothness off road etc.

Cost functions forward length: fwd = ∑fwd || xi - xi-1 ||
reverse length: rev = ∑rev || xi - xi-1 || off-road: road = ∑⌐road(i) || xi - xi-1 || curvature: curv = ∑ (Δxi+1 - Δxi)2 in lane: lane = ∑ D(xi, θi, G) direction: dir = ∑ sin2 (2(θi - αi))

Path planning Two step approach Coarse A* search Refinement

Cost and paths Total cost: Best path: argmins∈S Φ(s)
Many cost functions how to weigh them? learn from examples Goal: match cost: k(s) ≈ k(sE)

Apprenticeship learning
random weights: w(0) = random find paths: si = argmins Φ(s) sum costs: μ(i)k = ∑ k(si) find new weights: w(j+1) ≥ μk – μEk repeat until ||w(j+1)|| ≤ ε

Results Nice Sloppy Backwards

Conclusion Always performs as well as expert!
||μ – μE|| ≤ ||w|| ≤ ε Algorithm is difficult to understand Paper uses confusing notation

More information Apprenticeship learning via inverse reinforcement learning Pieter Abbeel, Andrew Y. Ng Maximal margin

More information Apprenticeship learning via inverse reinforcement learning Pieter Abbeel, Andrew Y. Ng Projection method

Apprenticeship learning
random weights: w(0) = random find path(s): si = argmins ∑ w(i)k k(s) sum costs: μ(i)k = ∑ k(si) find weights: minw,x ||w|| st. μk = ∑ xjμ(j)k wk ≥ μk – μEk repeat: w(j+1) = w / ||w|| combine

Learning optimal behavior

Similar presentations

Presentation on theme: "Learning optimal behavior"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning optimal behavior

Similar presentations

Presentation on theme: "Learning optimal behavior"— Presentation transcript:

Similar presentations

About project

Feedback