Presentation is loading. Please wait.

Presentation is loading. Please wait.

Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Similar presentations


Presentation on theme: "Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –"— Presentation transcript:

1 Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics – http://www.kuka-timoboll.com/en/home/ http://www.kuka-timoboll.com/en/home/

2 Unified View

3 On-line, Tabular TD(λ)

4 Flappy Bird: state space? – http://sarvagyavaish.github.io/FlappyBirdRL/ http://sarvagyavaish.github.io/FlappyBirdRL/

5 Chapter 9: Generalization and Function Approximation How does experience in parts of the state space help us act over the entire state space? How can does function approximation (supervised learning) merge with RL? Function approximator convergence

6 Chapter 9: Generalization and Function Approximation How does experience in parts of the state space help us act over the entire state space? How can does function approximation (supervised learning) merge with RL? Function approximator convergence I read it and it mostly makes sense. There are many methods to do [function approximation], most of which made very little sense as explained.

7 Instead of lookup table for values of V at time t (V t ), consider some kind of weight vector w t E.g., w t could be the weights in a neural network Instead of one value (weight) per state, now we update this vector

8 Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output

9 TD Backups as Training Examples Recall the TD(0) backup: As a training example: – Input = Features of s t – Target output = r t+1 + γV(s t+1 )

10 What FA methods can we use? In principle, anything! – Neural networks – Decision trees – Multivariate regression – Support Vector Machines – Gaussian Processes – Etc. But, we normally want to – Learn while interacting – Handle nonstationarity – Not take too long or use too much memory – Etc.

11 Perceptron Binary, linear classifier: Rosenblatt, 1957 Eventual failure of perceptron to do everything shifted field of AI towards symbolic representations Sum = w 1 x 1 + w 2 x 2 + … + w n x n Output is +1 if sum > 0, -1 otherwise w j = w j + (target – output) x j Also, can use x 0 = 1 and w 0 is therefore a bias

12 Perceptron Consider Perceptron with 3 weights: x, y, bias

13 Spatial-based Perceptron Weights

14 Neural Networks How do we get around only linear solutions?

15 Neural Networks A multi-layer network of linear perceptrons is still linear. Non-linear (differentiable) units Logistic or tanh function

16

17

18

19 Intermission UAS in France – http://www.uasvision.com/2014/02/18/18-year- old-in-nancy-prosecuted-for-uas-video-on- youtube/ http://www.uasvision.com/2014/02/18/18-year- old-in-nancy-prosecuted-for-uas-video-on- youtube/

20 Gradient Descent w = (w 1, w 2, …, w n ) T Assume V t (s) sufficiently smooth differential function of w, for all states s in S Also, assume that training examples are in the form: Features of s t V π (s t ) Goal: minimize errors on the observed samples

21 w t+1 =w t + α[V π (s t )-V t (s t )] w t V t (s t ) Vector of partial derivatives Δ

22 Let J(w) be any function of the weight space The gradient at any point w t in this space is: Then, to iteratively move down the gradient: Why still doing this iteratively? If you could just eliminate the error, why could this be a bad idea?

23 Common goal is to minimize mean-squared error (MSE) over distribution d: Why does this make any sense? d is distribution of states receiving backups on- or off-policy distribution

24 (Dmitry) Motivation for choosing MSE: – MSE - 2-norm of the error. – 1) Square of the norm is a sum of the squares, and its derivative is a linear function, that is good. – 2) QR decomposition is used to get nice solution for linear approximation problems Find x which minimizes 2-norm of(A*x-b) Other norms don't such simple solution

25 Gradient Descent Each sample gradient is an unbiased estimate of the true gradient This will converge to a local minimum of the MSE if α decreases appropriately over time

26 Unfortunately, we dont actually have v π (s) Instead, we just have an estimate of the target V t If the V t is an an unbiased estimate of v π (s t ), then well converge to a local minimum (again with α caveat)

27 δ is or normal TD error e is vector of eligibility traces θ is a weight vector

28 Note that TD(λ) targets are biased But… we do it anyway

29 Linear Methods Why are these a particularly important type of function approximation? Parameter vector θ t Column vector of features φ s for every state (same number of components)

30 Linear Methods Gradient is simple Error surface for MSE is simple (single minimum)

31 Coarse coding Generalization based on features activating

32 Size Matters

33 Tile coding

34 Tile coding, view #2 Consider a game of soccer

35 But, how do you pick the coarseness? Adaptive tile coding IFSA

36 Irregular tilings

37 Radial Basis Functions Instead of binary, have degrees of activation Can combine with tile coding!

38 Kanerva Coding: choose prototype states and consider distance from prototype states Now, updates depend on number of features, not number of dimensions Instance-based methods

39 Fitted R-Max [Jong and Stone, 2007] x y ? Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action

40 Fitted R-Max [Jong and Stone, 2007] x y Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action

41 Fitted R-Max [Jong and Stone, 2007] x y Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action

42 Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action x y

43 Mountain-Car Task

44 3D Mountain Car X: position and acceleration Y: position and acceleration

45 Control with FA Bootstrapping

46 Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy, ease of setup, parameter tuning, priors, labels, expertise)

47 Todds work with decision trees Course feedback


Download ppt "Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –"

Similar presentations


Ads by Google