Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013.

Slides:



Advertisements
Similar presentations
Improving Your Sales Skills. Youll discover: Techniques to Use Prior to the Sales Techniques to Use During the Sale Post Purchase Selling Techniques.
Advertisements

Lecture 18: Temporal-Difference Learning
Dialogue Policy Optimisation
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Example 2.2 Estimating the Relationship between Price and Demand.
Markov Decision Process
RL for Large State Spaces: Value Function Approximation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 11, 2012.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
x – independent variable (input)
Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Goals, plans, and planning Northwestern University CS 395 Behavior-Based Robotics Ian Horswill.
Evolutionary Games The solution concepts that we have discussed in some detail include strategically dominant solutions equilibrium solutions Pareto optimal.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
“Teach A Level Maths” Statistics 1
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning (1)
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Using the Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects Richard Williams
Introduction Many decision making problems in real life
Probability, contd. Learning Objectives By the end of this lecture, you should be able to: – Describe the difference between discrete random variables.
Understanding Human Behavior Helps Us Understand Investor Behavior MA2N0246 Tsatsral Dorjsuren.
Learning from Observations Chapter 18 Through
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
Clustering Prof. Ramin Zabih
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human.
Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Introduction to Economics What do you think of when you think of economics?
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
Lecture 6: Introduction to Machine Learning
CS 188: Artificial Intelligence Fall 2008
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Morteza Kheirkhah University College London
Presentation transcript:

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013

Today’s Class Reinforcement Learning and Partially Observable Markov Decision Processes Some slides inspired by slides by Gillian Hayes, U. Edinburgh A full, free online course on reinforcement learning! Images used according to Fair Use

A relaxing break from prediction modeling

Reinforcement Learning in Psychology

Reinforcement Learning in CS Totally different

Reinforcement Learning in CS Totally different Undergraduate receiving pellets rather than rat

Reinforcement Learning in CS (Really) Learning from feedback The computer program has actions it can take The computer program gets feedback as to the success of those actions

Reinforcement Learning Try to automatically learn a mapping from situations to actions – So that feedback is as positive as possible – Through trying actions and seeing the results

Example I need 4 volunteers

Volunteer 1 You are a used glortz salesman – Your goal is to sell your greeble clients the most expensive possible used glortz You have the following options – Offer more expensive glortz – Offer less expensive glortz – Discuss merits of current glortz – Demand they buy current glortz

Glortzes (prices in Greldaks) ,000

How much? How much did glortz dealer 1 make in sales?

Volunteer 2 You are also a glortz salesman, at the glortz dealer next door

How much? How did glortz dealer 2 do?

Volunteer 3 You are also a glortz salesman, at the glortz dealer down the street

How much? How did glortz dealer 3 do?

Volunteer 4 You are also a glortz salesman, at the annual glortz trade show

How much? How did glortz dealer 4 do?

This is Reinforcement Learning Learning a mapping from situations to actions What have we learned about our friends?

What are the meaningful features?

What do blue circles mean?

What do purple circles mean?

What do green squares mean?

Is every greeble with the same appearance the same?

How does it work in an automated fashion? Try out actions to learn which produces highest rewards – At first, trial and error – Later, biased search

How does it work in an automated fashion? Situations, Actions, Goals Infer the situation Choose the action To achieve the goal

Exploration or Exploitation? Exploitation – You can get the highest immediate rewards from trying the answer that is most likely to work Exploration – But at least at first, this is unlikely to always be the right answer – And settling too early may prevent you from finding the best answer – Local minima problem

Deterministic or Stochastic? Most real-world problems have some noise Another reason to keep exploring early on in learning how to respond to a problem And to try same solution to (seemingly) same case multiple times – Not all greebles that look the same are the same

Immediate reward or long-term reward? The goal of method should be to infer greatest long-term reward, rather than immediate reward Not obvious for Greldak salesman problem Consider learning, though

Immediate reward or long-term reward? For an intelligent tutor, Different policies will lead to – Better short-term performance Current problem – Better long-term performance Next problem, next week

Example Let’s say a tutor has the following actions every problem – Let student try to answer – Give hint – Give scaffolding – Give answer If the reward in RL is “performance on current problem” – Then the best strategy is always “Give answer” – Probably not best for long-term performance

Policy computation  (A n,S)= Prob(A n |S) S = current situation A n = action n is best (and should be chosen) Through computing Q(A n,S) = TotalReward(A n |S)

Policy computation TotalReward(A n |S) = Reward now + Reward now+1 + Reward now+2 + Reward now+3 + Reward end However, future rewards may be less certain than immediate rewards d = time discount function between 0 and 1 TotalReward(A n |S) = d* Reward now + d 2 * Reward now+1 + d 3 * Reward now+2 + d 4 * Reward now+3 + d t * Reward end

Simplest Version “Bandit Problem” Just one situation – Determine which action generally has highest reward, regardless of situation

I need 2 volunteers

Volunteer 1: You are in the fabled casino in Ulence Flats

You can choose between three levers Green lever Red lever Yellurple lever (Don’t start yet)

Volunteer 2 Write down the results each time One column for Red One column for Green One column for Yellurple (Volunteer 1: you can’t look)

Volunteer 1 Play 100 times Your goal is to make as much money as possible

Volunteer 2 What are the mean and SD for each option?

True values RED: M = +2, SD= ~6 GREEN: M = + 40, SD = ~57 YELLURPLE: M = - 20, SD = ~57 What should the values be for:  (A n,S)= Prob(A n |S) Q(A n,S) = TotalReward(A n |S)

True values RED: M = +2, SD= ~6 GREEN: M = + 40, SD = ~57 YELLURPLE: M = - 20, SD = ~57 If this involved real money, what would you do?

Exploitation versus Exploration Multiple strategies: Greedy – always pick the action with highest Q  -Greedy – select random action  percent of the time, else pick the action with highest Q

Exploitation versus Exploration Softmax Algorithm Tau influences aggressiveness – For Tau close to 0, model uses reward information strongly – For Tau close to infinity, model uses reward information very little

Becomes more complex… When situation is taken into account Model needs to figure out mapping from situation to reward – And which aspects of situation (e.g. features) matter

Becomes more complex… When considering possibility that agent actions may actually change state of the world E.g. reward function changes based on situation, and your actions change situation

Lots of algorithms When situation is taken into account Model needs to figure out mapping from situation to reward – And which aspects of situation (e.g. features) matter (e.g. lead to a significant difference in rewards)

POMDP Partially observable markov decision process Adds observations as well as rewards to RL E.g. your actions change the situation, but you don’t know exactly how – All you know is what the agent can observe

Example of RL in EDM

Chi et al. (2011) Goal: Choose strategy in Physics tutoring by expert human tutors Situation – Range of features of past tutor behavior, past student behavior, and difficulty of current step Action – – Elicit, Tell crossed with – Justify, Skip-Justify Reward – Student’s Final Learning Gain from Pre- test to Post-test Separate model developed for each of 8 KCs

Most commonly used selection features in final policies (final models not given in paper) StepSimplicityPS: percentage of answers across all students that were correct for a step in training data TuConceptsToWordsPS: ratio of physics concepts to words in the tutor's dialogue. tellsSinceElicitA: the number of tells the student has received since the last elicit. durationKCBetweenDecisionT: time since the last tutorial decision was made on the current KC.

Results Using policy with new group of students led to better student learning than random policy Also, using pre-post learning gain as reward function led to better student learning than using immediate performance

Questions? Comments?

Next Class Monday, March 6 Advanced Detector Evaluation and Validation Assignment Due: None

The End