1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2007.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning.
1 A B C
Simplifications of Context-Free Grammars
Variations of the Turing Machine
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
Reinforcement Learning
Select from the most commonly used minutes below.
Reinforcement Learning: Learning from Interaction
STATISTICS HYPOTHESES TEST (I)
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
David Burdett May 11, 2004 Package Binding for WS CDL.
CALENDAR.
Chapter 7 Sampling and Sampling Distributions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
1 Combination Symbols A supplement to Greenleafs QR Text Compiled by Samuel Marateck ©2009.
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
Turing Machines.
PP Test Review Sections 6-1 to 6-6
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Chapter 1: Expressions, Equations, & Inequalities
Lilian Blot PART III: ITERATIONS Core Elements Autumn 2012 TPOP 1.
1 TV Viewing Trends Rivière-du-Loup EM - Diary Updated Spring 2014.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Types of selection structures
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.
Essential Cell Biology
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
PSSA Preparation.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Chapter 5: Monte Carlo Methods
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Chapter 5: Monte Carlo Methods
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
September 22, 2011 Dr. Itamar Arel College of Engineering
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Presentation transcript:

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2007

2 Objetivo desta Aula n Aprendizado por Reforço: –Métodos de Monte Carlo. –Aprendizado por Diferenças Temporais. –Traços de Elegibilidade. n Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

3 Relembrando a aula passada.

4 O que é o Aprendizado por Reforço? n Aprendizado por interação. n Aprendizado orientado a objetivos. n Aprendizado sobre, do e enquanto interagindo com um ambiente externo. n Aprender o que fazer: –Como mapear situações em ações. –Maximizando um sinal de recompensa numérico.

5 Agente no AR n Situado no tempo. n Aprendizado e planejamento continuo. n Objetivo é modificar o ambiente. Ambiente Ação Estado Recompensa Agente

6 Policy Reward Value Model of environment Elementos do AR n Política (Policy): o que fazer. n Recompensa (Reward): o que é bom. n Valor (Value): o que é bom porque prevê uma recompensa. n Modelo (Model): o que causa o que.

7 The Agent-Environment Interface t... s t a r t +1 s a r t +2 s a r t +3 s... t +3 a

8 The Agent Learns a Policy n Reinforcement learning methods specify how the agent changes its policy as a result of experience. n Roughly, the agent’s goal is to get as much reward as it can over the long run.

9 Goals and Rewards n Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. n A goal should specify what we want to achieve, not how we want to achieve it. n A goal must be outside the agent’s direct control—thus outside the agent. n The agent must be able to measure success: –explicitly; –frequently during its lifespan.

10 Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode.

11 Importante!!! n São muito diferentes: –Reward ( r t ): O que ganha quando faz uma ação. –Return ( R t ): É o retorno esperado. n A relação entre um e outro pode ser: Expected Return ( E{R t } ): –É o que se deseja maximizar.

12 Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return:

13 The Markov Property n Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:

14 Defining a Markov Decision Processes n To define a finite MDP, you need to give: –state S and action sets A(s). –one-step “dynamics” defined by transition probabilities: –reward expectations:

15 n The value of a state is the expected return starting from that state; depends on the agent’s policy. State-value function for policy  : Value Functions

16 The value of taking an action in a state under policy  is the expected return starting from that state, taking that action, and thereafter following . Action-value function for policy  : Value Functions

17 Bellman Equation for a Policy  The basic idea: So: Or, without the expectation operator:

18 Policy Iteration policy evaluationpolicy improvement “greedification”

19 Policy Iteration

20 Value Iteration Recall the full policy evaluation backup: Here is the full value iteration backup:

21 Value Iteration Cont.

22 Fim da Revisão n Importante: –Conceitos básicos bem entendidos. n Problema: –DP necessita do modelo de transição de estados P. n Como resolver este problema, se o modelo não é conhecido?

23 Métodos de Monte Carlo Capítulo 5 do Sutton e Barto.

24 Monte Carlo Methods n Métodos de Monte Carlo permitem aprender a partir de exemplos de retornos completos (complete sample returns) –Definido para tarefas episódicas. n Métodos de Monte Carlo possibilitam o aprendizado baseado diretamente em experiências: –On-line: Não necessita de um modelo para atingir a solução ótima. –Simulated: Não necessita de um modelo completo.

25 Wikipedia: Monte Carlo Definition n Monte Carlo methods are a widely used class of computational algorithms for simulating the behavior of various physical and mathematical systems.computationalalgorithmssimulatingphysical mathematical n They are distinguished from other simulation methods (such as molecular dynamics) by being stochastic, usually by using random numbers - as opposed to deterministic algorithms.molecular dynamicsstochasticrandom numbersdeterministic algorithms n Because of the repetition of algorithms and the large number of calculations involved, Monte Carlo is needs large computer power.

26 Wikipedia: Monte Carlo Definition n A Monte Carlo algorithm is a numerical Monte Carlo method used to find solutions to mathematical problems (which may have many variables) that cannot easily be solved, for example, by integral calculus, or other numerical methods.integral calculus n For many types of problems, its efficiency relative to other numerical methods increases as the dimension of the problem increases.dimension

27 Monte Carlo principle n Consider the game of solitaire: what’s the chance of winning with a properly shuffled deck? n Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards n Insight: why not just play a few hands, and see empirically how many do in fact win? n More generally, can approximate a probability density function using only samples from that density ? Lose WinLose Chance of winning is 1 in 4!

28 Monte Carlo principle n Given a very large set X and a distribution p(x) over it n We draw a set of N samples n We can then approximate the distribution using these samples X p(x)

29 Monte Carlo principle n We can also use these samples to compute expectations n And even use them to find a maximum

30 Monte Carlo Example: Approximation of  (the number)... n If a circle of radius r = 1 is inscribed inside a square whit side length L = 2, then we obtain:

31 MC Example: Approximation of  (the number)... n Inside the square, we can put N points at random with uniform distribution with (x,y) coordinates. n Now, we can to count how many points have fallen in the circle.

32 MC Example: Approximation of  (the number)... n If N is large enough, we can think that the ratio:

33 MC Example: Approximation of  (the number)... n For N = 1000: –N Circle = 768 –Pi = –Error = 0.07

34 MC Example: Approximation of  (the number)... n For N = 10000: –N Circle = 7802 –Pi = –Error = 0.021

35 MC Example: Approximation of  (the number)... n For N = : –N Circle = –Pi = –Error = 0.008

36 Monte Carlo Policy Evaluation Goal: learn V  (s) Given: some number of episodes under  which contain s n Idea: Average returns observed after visits to s 12345

37 Monte Carlo Policy Evaluation n Every-Visit MC: average returns for every time s is visited in an episode n First-visit MC: average returns only for first time s is visited in an episode n Both converge asymptotically.

38 First-visit Monte Carlo policy evaluation

39 Blackjack example n Object: Have your card sum be greater than the dealers without exceeding 21. n States (200 of them): –current sum (12-21) –dealer’s showing card (ace-10) –do I have a useable ace? n Reward: +1 for winning, 0 for a draw, -1 for losing n Actions: stick (stop receiving cards), hit (receive another card) n Policy: Stick if my sum is 20 or 21, else hit

40 Blackjack value functions

41 Backup diagram for Monte Carlo n Entire episode included n Only one choice at each state (unlike DP) n MC does not bootstrap n Time required to estimate one state does not depend on the total number of states

42 Monte Carlo Estimation of Action Values (Q) n Monte Carlo is most useful when a model is not available –We want to learn Q * Q  (s,a) - average return starting from state s and action a following  n Also converges asymptotically if every state- action pair is visited n Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

43 Monte Carlo Control n MC policy iteration: Policy evaluation using MC methods followed by policy improvement n Policy improvement step: greedify with respect to value (or action-value) function

44 Convergence of MC Control n Policy improvement theorem tells us: This assumes exploring starts and infinite number of episodes for MC policy evaluation To solve the latter: update only to a given level of performance alternate between evaluation and improvement per episode

45 Monte Carlo Exploring Starts Fixed point is optimal policy   * Proof is open question

46 Blackjack example continued n Exploring starts n Initial policy as described before

47 On-policy Monte Carlo Control greedy non-max n On-policy: learn about policy currently executing. n How do we get rid of exploring starts? –Need soft policies:  (s,a) > 0 for all s and a –e.g.  -soft policy: Similar to GPI: move policy towards greedy policy (i.e.  - soft) Converges to best  -soft policy

48 On-policy Monte Carlo Control

49

50 On-policy MC Control

51 Learning  while following  ’

52 Learning  while following  ’

53 Off-policy Monte Carlo control n Recall that the distinguishing feature of on- policy methods is that they estimate the value of a policy while using it for control. n In off-policy methods these two functions are separated: –Behavior policy generates behavior in environment. –Estimation policy is policy being learned about. n Average returns from behavior policy by probability their probabilities in the estimation policy.

54 Off-policy MC control

55 Incremental Implementation n MC can be implemented incrementally –saves memory n Compute the weighted average of each return incremental equivalentnon-incremental

56 Monte Carlo Summary n MC has several advantages over DP: –Can learn directly from interaction with environment –No need for full models –No need to learn about ALL states –Less harm by Markovian violations (later in book) n MC methods provide an alternate policy evaluation process n One issue to watch for: maintaining sufficient exploration –exploring starts, soft policies n No bootstrapping (as opposed to DP)

57 Métodos das Diferenças Temporais Após o intervalo

58 Intervalo