Observational Methods Part Two January 20, 2010. Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next.

Slides:



Advertisements
Similar presentations
Observational Research
Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
The Research Consumer Evaluates Measurement Reliability and Validity
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High.
Effect Size Overheads1 The Effect Size The effect size (ES) makes meta-analysis possible. The ES encodes the selected research findings on a numeric scale.
Ground Truth for Behavior Detection Week 3 Video 1.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17.
Qualitative Methods Part One January 20, Today’s Class Probing Question for today Qualitative Methods Probing Question for next class.
Useful Statistical Tools February 19, Today’s Class Aphorisms Useful Statistical Tools Probing Question Assignments Surveys.
Meta-Cognition, Motivation, and Affect PSY504 Spring term, 2011 April 13, 2011.
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
Discussion Questions 3 Analyzing Qualitative Data.
Chi-square Test of Independence
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
Practical Meta-Analysis -- D. B. Wilson
Today Concepts underlying inferential statistics
Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures  Today we’ll focus on metrics for classifiers  Later this week we’ll.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Bivariate Relationships Analyzing two variables at a time, usually the Independent & Dependent Variables Like one variable at a time, this can be done.
Data Annotation for Classification. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Meta-Cognition, Motivation, and Affect PSY504 Spring term, 2011 January 13, 2010.
Jon Curwin and Roger Slater, QUANTITATIVE METHODS: A SHORT COURSE ISBN © Cengage Chapter 2: Basic Sums.
1 Measuring Association The contents in this chapter are from Chapter 19 of the textbook. The crimjust.sav data will be used. cjsrate: RATE JOB DONE: CJ.
User Study Evaluation Human-Computer Interaction.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 2, 2012.
Step 3 of the Data Analysis Plan Confirm what the data reveal: Inferential statistics All this information is in Chapters 11 & 12 of text.
Chapter 2 Doing Social Psychology Research. Why Should You Learn About Research Methods?  It can improve your reasoning about real-life events  This.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 26, 2012.
You will be given a data set (on a computer) and a hypothesis. You will be asked the following questions (word for word): 1. How many degrees of freedom.
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
Descriptive Research Study Investigation of Positive and Negative Affect of UniJos PhD Students toward their PhD Research Project Dr. K. A. Korb University.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
Testing hypotheses Continuous variables. H H H H H L H L L L L L H H L H L H H L High Murder Low Murder Low Income 31 High Income 24 High Murder Low Murder.
Feature Engineering Studio March 1, Let’s start by discussing the HW.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 16, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Statistical Analysis Topic – Math skills requirements.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 6, 2013.
PSC 47410: Data Analysis Workshop  What’s the purpose of this exercise?  The workshop’s research questions:  Who supports war in America?  How consistent.
Chi-Square Analyses.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 3, 2013.
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 1, 2012.
Non-parametric Approaches The Bootstrap. Non-parametric? Non-parametric or distribution-free tests have more lax and/or different assumptions Properties:
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Getting the most out of interactive and developmental data Daniel Messinger
Statistics (cont.) Psych 231: Research Methods in Psychology.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
+ Mortality. + Starter for 10…. In pairs write on a post it note: One statistic that we use to measure mortality On another post it note write down: A.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Scatterplots & Correlations Chapter 4. What we are going to cover Explanatory (Independent) and Response (Dependent) variables Displaying relationships.
Inferential Statistics
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Getting the most out of interactive and developmental data
Presentation transcript:

Observational Methods Part Two January 20, 2010

Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

Survey Results Much broader response this time – thanks! – Good to go with modern technology Generally positive comments – Some contradictions in best-part, worst-part – The best sign one is doing well is when everyone wants contradictory changes? I will give another survey in a few classes

Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

Probing Question For today, you have read D'Mello, S., Taylor, R.S., Graesser, A. (2007) Monitoring Affective Trajectories during Complex Learning. Proceedings of the 29th Annual Meeting of the Cognitive Science Society, Which used data from a lab study If you wanted to study affective transitions in real classrooms, which of the methods we discussed today would be best? Why?

What’s the best way? First, let’s list out the methods For now, don’t critique, just describe your preferred method – One per person, please – If someone else has already presented your method, no need to repeat it – If you propose something similar, quickly list the difference (no need to say why right now)

For each method What are the advantages? What are the disadvantages?

Votes for each method

Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models (ref to later)

Agreement/ Accuracy The easiest measure of inter-rater reliability is agreement, also called accuracy # of agreements total number of codes

Agreement/ Accuracy There is general agreement across fields that agreement/accuracy is not a good metric What are some drawbacks of agreement/accuracy?

Agreement/ Accuracy Let’s say that Tasha and Uniqua agreed on the classification of 9200 time sequences, out of actions – For a coding scheme with two codes 92% accuracy Good, right?

Non-even assignment to categories Percent Agreement does poorly when there is non-even assignment to categories – Which is almost always the case Imagine an extreme case – Uniqua (correctly) picks category A 92% of the time – Tasha always picks category A Agreement/accuracy of 92% But essentially no information

An alternate metric Kappa (Agreement – Expected Agreement) (1 – Expected Agreement)

Kappa Expected agreement computed from a table of the form Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Rater 1 Category 2 Count

Kappa Expected agreement computed from a table of the form Note that Kappa can be calculated for any number of categories (but only 2 raters) Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Rater 1 Category 2 Count

Cohen’s (1960) Kappa The formula for 2 categories Fleiss’s (1971) Kappa, which is more complex, can be used for 3+ categories – I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with you

Expected agreement Look at the proportion of labels each coder gave to each category To find the number of agreed category A that could be expected by chance, multiply pct(coder1/categoryA)*pct(coder2/categoryA) Do the same thing for categoryB Add these two values together and divide by the total number of labels This is your expected agreement

Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the percent agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the percent agreement? 80% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Tyrone’s expected frequency for on-task? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Tyrone’s expected frequency for on-task? 75% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Pablo’s expected frequency for on-task? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Pablo’s expected frequency for on-task? 65% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the expected on-task agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the expected on-task agreement? 0.65*0.75= Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the expected on-task agreement? 0.65*0.75= Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What are Tyrone and Pablo’s expected frequencies for off-task behavior? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What are Tyrone and Pablo’s expected frequencies for off-task behavior? 25% and 35% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What is the expected off-task agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What is the expected off-task agreement? 0.25*0.35= Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What is the expected off-task agreement? 0.25*0.35= Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is the total expected agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is the total expected agreement? = Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is kappa? Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is kappa? (0.8 – 0.575) / ( ) 0.225/ Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

So is that any good? What is kappa? (0.8 – 0.575) / ( ) 0.225/ Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Interpreting Kappa Kappa = 0 – Agreement is at chance Kappa = 1 – Agreement is perfect Kappa = negative infinity – Agreement is perfectly inverse Kappa > 1 – You messed up somewhere

Kappa<0 It does happen, but usually not in the case of inter-rater reliability Occasionally seen when Kappa is used for EDM or other types of machine learning – More on this in 2 months!

0<Kappa<1 What’s a good Kappa? There is no absolute standard For inter-rater reliability, – 0.8 is usually what ed. psych. reviewers want to see – You can usually make a case that values of Kappa around 0.6 are good enough to be usable for some applications Particularly if there’s a lot of data Or if you’re collecting observations to drive EDM – Remember that Baker, Corbett, & Wagner (2006) had Kappa = 0.58

Landis & Koch’s (1977) scale κ Interpretation < 0No agreement 0.0 — 0.20Slight agreement 0.21 — 0.40Fair agreement 0.41 — 0.60Moderate agreement 0.61 — 0.80Substantial agreement 0.81 — 1.00Almost perfect agreement

Why is there no standard? Because Kappa is scaled by the proportion of each category When one class is much more prevalent – Expected agreement is higher than If classes are evenly balanced

Because of this… Comparing Kappa values between two studies, in a principled fashion, is highly difficult A lot of work went into statistical methods for comparing Kappa values in the 1990s No real consensus Informally, you can compare two studies if the proportions of each category are “similar”

There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis

There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis – Do a 1 df Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal & Rosnow (1991)

There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis – Do a 1 df Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal & Rosnow (1991) – Or in other words, nyardley nyardley nyoo

Comments? Questions?

Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

Next step… Once you have analyzed inter-rater reliability, and you “trust” your observation codes You can conduct analyses with your findings

One simple question What is the prevalence of each category?

Why might this be interesting?

Some examples of studies What is the prevalence of teacher behavior X in Japan versus the USA? (Stigler & Hiebert, 1997) What is the prevalence of student off-task behavior in the USA versus the Philippines? (Baker et al, submitted) Does the prevalence of gaming the system drop when we try to reduce gaming with an animated agent? (Baker et al, 2006)

Approach Apply same coding scheme in situations A and B – Sometimes, find previous work where coding scheme was applied to situation A – And apply the same coding scheme to situation B Find prevalence of behavior for each student in each situation – Use an unpaired t-test to compare (in your favorite stats package)

Data might look like Student1 15% Student2 7% Student3 12% Student4 9% Student5 11% Student6 8% Student7 4% Student8 6% Student9 15% Student1010% Student114% Student1214% WPI students % off-task Student1 25% Student2 17% Student3 22% Student4 23% Student5 19% Student6 18% Student7 14% Student8 64% Student9 8% Student1030% Student1124% Student12101% Harvard stus.% off-task

Can also do Apply single coding scheme in situation A Find prevalence of behaviors B1 and B2 for each student – Use a paired t-test to compare (in your favorite stats package)

Data might look like Student1 15%11% Student2 7%13% Student3 12%14% Student4 9%1% Student5 11%8% Student6 8%7% Student7 4%4% Student8 6%18% Student9 15%6% Student1010%22% Student114%7% Student1214%19% WPI students % bored% frustrated

Comments? Questions?

Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

Another question How do these behaviors we coded correlate to some other construct we’re interested in? “Correlation” = They vary together (e.g. as one goes up, the other goes up; as one goes down, the other goes down) (for more info on correlation, please attend optional session)

Why might this be interesting?

Some examples of studies What is the relationship between off-task behavior and learning? (Lahaderne, 1968; Karweit & Slavin, 1981; Baker et al, 2004; Gobel et al, 2008; Rowe et al, 2009) What is the relationship between gaming the system and learning? (Baker et al, 2004; Walonoski & Heffernan, 2006) What is the relationship between insults in collaborative learning, and learning? (Prata et al, 2008) What is the relationship between gaming the system and student attitudes, as measured by questionnaires? (Baker et al, 2005; Walonoski & Heffernan, 2006)

Potential Measures Knowledge Motivational/attitudinal surveys Learning Gain (we will talk about special statistical methods for correlating to this on Feb. 12) Robust Learning (discussed on Feb. 12)

Approach Apply coding scheme to find prevalence of behavior A for each student Collect additional measure for each student Compute correlation between prevalence of behavior and additional measure – Statistical significance can be computed in your favorite stats package, using linear regression, or using the formula on the inside cover of Rosenthal & Rosnow (1991) – Note that a different approach is needed for learning gains; will be discussed on Feb. 12

Data might look like Student1 15%1 Student2 7%3 Student3 12%4 Student4 9%1 Student5 11%6 Student6 8%6 Student7 4%4 Student8 6%5 Student9 15%6 Student1010%2 Student114%6 Student1214%6 WPI students % boredgrit scale (1-6)

Comments? Questions?

Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

Dynamics Models Several approaches to creating dynamics models – Markov Models – Sequential Pattern Mining – D’Mello’s L (D’Mello et al, 2007) Note: This has very little relationship to Systems Dynamics models

In fact You can construct a Markov Model using D’Mello’s L Let’s take a look – I will define Markov Models when we get there

Step 1 Lay out all your data, in terms of what the observation is at time N, and what the observation is at time N+1

Data might look like Student1 1CONFUSEDFLOW Student1 2FLOW CONFUSED Student1 3CONFUSEDFLOW Student1 4FLOWFLOW Student1 5FLOWFLOW Student1 6FLOWnone Student2 1FRUSTRATEDBORED Student2 2BOREDBORED Student2 3BOREDBORED Student24BOREDBORED Student25BOREDBORED Student26BOREDnone WPI students obs-timecategory-nowcategory-next

Step 2 Break your data down by student

Step 3 For each student, compute the D’Mello’s L likelihood that category A will be followed by category B

D’Mello’s L (% time B follows A – expected % B) (1 – expected % B)

D’Mello’s L That’s right, it’s Kappa (% time B follows A – expected % B) (1 – expected % B)

D’Mello’s L Expected % B is computed as the overall % of time that B is seen after any other category (% time B follows A – expected % B) (1 – expected % B)

Example % BORED after FRUSTRATED: 20% % BORED after ANYTHING: 10% % FRUSTRATED after ANYTHING: 30% What is D’Mello’s L?

Example % BORED after FRUSTRATED: 20% % BORED after ANYTHING: 10% % FRUSTRATED after ANYTHING: 30% (20% - 10%) / (100% - 10%) 10% / 90%

Step 4 For each transition, find the mean and standard error L across all students

Data might look like Student1 CONFUSEDFLOW0.4 Student2 CONFUSEDFLOW0.5 Student3CONFUSEDFLOW0.2 Student4CONFUSEDFLOW0.1 Student5CONFUSEDFLOW0.05 Student6CONFUSEDFLOW0.2 Student1 CONFUSEDBORED0.05 Student2 CONFUSEDBORED0.1 Student3CONFUSEDBORED0.05 Student4CONFUSEDBORED0.00 Student5CONFUSEDBORED-0.05 Student6CONFUSEDBORED0.1 WPI students category-nowcategory-nextD’Mello L

Data might look like Student1 CONFUSEDFLOW0.4 Student2 CONFUSEDFLOW0.5 Student3CONFUSEDFLOW0.2 Student4CONFUSEDFLOW0.1 Student5CONFUSEDFLOW0.05 Student6CONFUSEDFLOW0.2 Student1 CONFUSEDBORED0.05 Student2 CONFUSEDBORED0.1 Student3CONFUSEDBORED0.05 Student4CONFUSEDBORED0.00 Student5CONFUSEDBORED-0.05 Student6CONFUSEDBORED0.1 WPI students category-nowcategory-nextD’Mello L Mean = 0.24 Stdev = 0.17 Stderr = 0.07 Mean = 0.04 Stdev = 0.06 Stderr = 0.02

Step 5 You can now determine if a transition is significantly more likely than chance with a 1- sample t-test Or you can determine if two transitions differ in likelihood with a paired t-test

Step 6 Take the transitions that are significantly different than chance and graph them BOREDFRUSTCONFFLOWDELIGHT

This is… A Markov Model – Markov Model is a model of transitions and probabilities, which only considers single transitions

Similar to… Hidden Markov Models (HMMs) which you may have seen in AI classes – HMMs have latent (e.g. unknowable) states, and knowable outputs, which are emitted by each state with a certain probability – But in this case, our observations tell us what the student state is..

Differentiated From… Sequential Pattern Mining, where sequences of more than one transition are considered In a Markov Model – P(BORED->FRUSTRATED->BORED) = P(BORED->FRUSTRATED)*P(FRUSTRATED->BORED) In multi-step Sequential Pattern Mining approaches, this assumption does not hold

Comments? Questions?

Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

Final use… Many times, observations are used to create EDM models that then are used instead of the original observations We will talk about this on March 3 rd – Why you might want to do this – Advantages and drawbacks – And *how* you do this

Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

Probing Question Think of something awesome that Stigler & Hiebert could do with their coded data – What could be done? – How would one go about doing it, at a very high level? If you want, you can also pretend that Stigler & Hiebert handed out and coded any kind of paper survey or measure, as long as a student can fill it out in less than an hour

Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

Please take a moment to read through assignment 1 Any questions about the assignment?

The End