Download presentation
Presentation is loading. Please wait.
Published byAnissa Ferguson Modified over 9 years ago
1
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012
2
2 Objetivo desta Aula n Aprendizado por Reforço: –Planning and Learning. –Relational Reinforcement Learning –Uso de Heurísticas para acelerar o AR. –Dimensions of RL - Conclusões. n Aula de hoje: –Capítulos 9 e 10 do Sutton & Barto. –Artigo Relational RL, MLJ, de 2001 –Tese de Doutorado Bianchi. Esta é uma aula mais informativa
3
3 Planning and Learning Capítulo 9 do Sutton e Barto
4
4 Objetivos n Use of environment models. n Integration of planning and learning methods.
5
5 Models n Model: anything the agent can use to predict how the environment will respond to its actions –Distribution model: description of all possibilities and their probabilities e.g., –Sample model: produces sample experiences e.g., a simulation model n Both types of models can be used to produce simulated experience n Often sample models are much easier to come by
6
6 Planning n Planning: any computational process that uses a model to create or improve a policy. n Planning in AI: –state-space planning –plan-space planning (e.g., partial-order planner)
7
7 Planning in RL n We take the following (unusual) view: –all state-space planning methods involve computing value functions, either explicitly or implicitly. –they all apply backups to simulated experience.
8
8 Planning Cont. n Classical DP methods are state-space planning methods. n Heuristic search methods are state- space planning methods.
9
9 Q-Learning Planning Random-Sample One-Step Tabular Q-Planning n A planning method based on Q- learning:
10
10 Learning, Planning, and Acting n Two uses of real experience: –model learning: to improve the model –direct RL: to directly improve the value function and policy n Improving value function and/or policy via a model is sometimes called indirect RL or model- based RL. Here, we call it planning.
11
11 Direct vs. Indirect RL n Indirect methods: –make fuller use of experience: get better policy with fewer environment interactions n Direct methods –simpler –not affected by bad models But they are very closely related and can be usefully combined: - planning, acting, model learning, and direct RL can occur simultaneously and in parallel
12
12 The Dyna Architecture (Sutton 1990)
13
13 The Dyna-Q Algorithm direct RL model learning planning
14
14 Dyna-Q on a Simple Maze rewards = 0 until goal, when =1
15
15 Dyna-Q Snapshots: Midway in 2nd Episode
16
16 Prioritized Sweeping n Which states or state-action pairs should be generated during planning? n Work backwards from states whose values have just changed: –Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change –When a new backup occurs, insert predecessors according to their priorities –Always perform backups from first in queue n Moore and Atkeson 1993; Peng and Williams, 1993
17
17 Prioritized Sweeping
18
18 Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction
19
19 Summary n Emphasized close relationship between planning and learning n Important distinction between distribution models and sample models n Looked at some ways to integrate planning and learning –synergy among planning, acting, model learning
20
20 Summary n Distribution of backups: focus of the computation –trajectory sampling: backup along trajectories –prioritized sweeping –heuristic search n Size of backups: full vs. sample; deep vs. shallow
21
Relational Reinforcement Learning Baseado em aula de Tayfun Gürel 21
22
Relational Representations n In most applications the state space is too large. n Generalization over states is essential n Many states are similar for some aspects n Representations for RL have to be enriched for generalization n RRL was proposed (Dzeroski, DeRaedt, Blockeel 1998)
23
Blocks World An action: move(a,b) precondition: clear(a) S clear(b) S s1={clear(b), clear(a), on(b,c), on(c,floor), on(a,floor) }
24
Relational Representations n With Relational Representations for states: –Abstraction from details q_value(0.72) = goal_unstack, numberofblocks(A), action_move(B,C), height(D,E), E=2, on(C,D),!. n Flexible to goal changes –Retraining from the beginning is not necessary n Transfer of experience to more complex domains
25
Relational Reinforcement Learning n How does it work? n An integration of RF with ILP: n Do forever: –Use Q-learning to generate sample Q values for sample states action pairs –Generalize them using ILP (in this case TILDE)
26
TILDE (Top Down Induction of Logical Decision Trees ) n A generalization of Q values is represented by a logical decision tree n Logical Decision Trees: –Nodes are First Order Logic atoms (Prolog Queries as Tests ) e.g. on (A, c) : is there any block on c –Training Data is a relational database or a Prolog knowledge base
27
Logical Decision Tree vs. Decision Tree Decision Tree and Logical Decision Tree deciding whether blocks are stacked
28
TILDE Algorithm Declarative bias: e.g. on(+,-) Background knowledge: A prolog program: An example part of the program can be:
29
Examples generated by Q-RR learning Q-RRL Algorithm
30
Logical regression tree generated by TILDE-RT Equivalent prolog program
31
EXPERIMENTS n Tested for three different goals: –1. one-stack –2. on(a,b) –3. unstack n Tested for the following as well: –Fixed number of blocks –Number of blocks changed after learning –Number of blocks changed while learning n P-RRL vs. Q-RRL
32
Results: Fixed number of blocks Accuracy of random policies Accuracy: percentage of correctly classified (s,a) pairs (optimal-non-optimal)
33
Results: Fixed number of blocks
34
Results: Evaluating learned policies on varying number of blocks
35
Conclusion n RRL has satisfying initial results, needs more research n RRL is more successful when number of blocks is increased (generalizes better to more complex domains) n Theoretical research proving why it works is still missing
36
Usando heuristicas para Aceleração do AR - pdf 36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.