Download presentation
Presentation is loading. Please wait.
1
RL 2 It’s 2:00 AM. Do you know where your mouse is?
2
First up: Vote! Albuquerque Municipal Election today (Oct 4) Not all of you are eligible to vote, I know...... But if you are, you should. Educate yourself first! Mayor City councilors Bonds (what will ABQ spend its money on?) Propositions (election finance, min wage, voter ID) Polls close at 7:00 PM today...
3
Voting resources City of Albuquerque web site: www.cabq.govwww.cabq.gov League of Women Voters web site: http://www.lwvabc.org/elections/2005VG_Engli sh.html http://www.lwvabc.org/elections/2005VG_Engli sh.html
4
News o’ the day Wall Street Journal reports: “Microsoft Windows Officially Broken” In 2004, MS Longhorn (successor to XP) bogged down Whole code base had to be scrapped & started afresh ⇒ Vista Point: not MS bashing (much) Importance of software process MS moved to a more agile process for Vista Test first Rigorous regression testing Better coding infrastructure
5
Administrivia Grading: P1 rollout finished grading I will send grade reports this afternoon & tomorrow morning Prof Lane out of town Oct 11 Andree Jacobsen will cover Stefano Markidis out of town Oct 19 Will announce new office hours presently
6
Your place in History Last time: Q2 Introduction to Reinforcement Learning (RL)
7
Your place in History This time: ✓ P2M1 due ✓ Voting ✓ News ✓ Administriva ✓ Q&A More on RL Design exercise: WorldSimulator and Terrains
8
Recall: Mack & his maze Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? ?
9
Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10
10
Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...
11
Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...
12
Where can you go? Definition: Complete set of all states agent could be in is called the state space: S Could be discrete or continuous For Project 2: states are discrete Q: what is the state space for P2? Size of state space: | S | Q: How big is the state space for P2?
13
Where can you go? Definition: Complete set of actions an agent could take is called the action space: A Again, discrete or cont. Again, P2: A is discrete Q: What is A for P2? Size? Again, size: | A |
14
What is it worth to you? Idea of “good” and “bad” places to go Quantified as “rewards” (This is where term “reinforcement learning” comes from. Originated in psychology.) Formally: R : S → Reals R(s) == reward for getting to state s How good or bad it is to reach state s Larger (more positive) is better Agent “wants” to get more positive rwd
15
How does it happen? Dynamics of agent defined by transition function T: S x A x S → [0,1] T(s,a,s’)==Pr[next state is s’ | curr state is s, act a] Examples from P2?
16
How does it happen? Dynamics of agent defined by transition function T: S x A x S → [0,1] T(s,a,s’)==Pr[next state is s’ | curr state is s, act a] Examples from P2? In practice: Don’t write T down explicitly. Encoded by WorldSimulator and Terrain/agent interactions.
17
The MDP Entire RL environment defined by a Markov decision process: M = 〈 S, A,T,R 〉 S : state space A : action space T : transition function R : reward function Q: What modules represent these in P2?
18
Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh)
19
Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy, π Policy defines what action to take in every state of the system:
20
Experience & histories Fundamental unit of experience in RL: At time t in some state s i, take action a j, get reward r t, end up in state s k Called an experience tuple or SARSA tuple Set of all experience during a single episode up to time T is a history or trajectory:
21
How good is a policy? Value is a function of start state and policy: Value measures: How good is policy π, averaged over all time, if agent starts at state s 1 and runs forever?
22
The goal of RL Agent’s goal: Find the best possible policy: π* Find policy, π*, that maximizes V π (s) for all s
23
Design Exercise: WorldSimulator & Friends
24
Design exercise Q1: Design the act() method in WorldSimulator What objects does it need to access? How can it take different terrains/agents into account? Q2: GridWorld2d could be really large Most of the terrain tiles are the same everywhere How can you avoid millions of copies of same tile?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.