Data Annotation for Classification. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination.

Slides:



Advertisements
Similar presentations
Copyright 2005, Prentice Hall, Sarafino
Advertisements

Observational Research
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
The Research Consumer Evaluates Measurement Reliability and Validity
Chapter 4 – Reliability Observed Scores and True Scores Error
Interpreting Kappa in Observational Research: Baserate Matters Cornelia Taylor Bruckner Vanderbilt University.
Knowledge Engineering Week 3 Video 5. Knowledge Engineering  Where your model is created by a smart human being, rather than an exhaustive computer.
Accuracy Assessment of Thematic Maps
Effect Size Overheads1 The Effect Size The effect size (ES) makes meta-analysis possible. The ES encodes the selected research findings on a numeric scale.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 7 Using Nonexperimental Research.
Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability.
Data Synchronization and Grain-Sizes Week 3 Video 2.
Ground Truth for Behavior Detection Week 3 Video 1.
Data and the Nature of Measurement
Analysis of frequency counts with Chi square
Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables Chapter 15 1.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Observational Methods Part Two January 20, Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next.
Observational Methods January 20, Today’s Class Survey Results Probing Question from Wed, Jan. 20 Observational Methods Probing Question for Fri,
Useful Statistical Tools February 19, Today’s Class Aphorisms Useful Statistical Tools Probing Question Assignments Surveys.
Meta-Cognition, Motivation, and Affect PSY504 Spring term, 2011 April 13, 2011.
Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
Practical Meta-Analysis -- D. B. Wilson
Crosstabs. When to Use Crosstabs as a Bivariate Data Analysis Technique For examining the relationship of two CATEGORIC variables  For example, do men.
TAYLOR HOWARD The Employment Interview: A Review of Current Studies and Directions for Future Research.
Hypothesis testing.
How Can We Test whether Categorical Variables are Independent?
Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures  Today we’ll focus on metrics for classifiers  Later this week we’ll.
Statistics 1 Course Overview
Bivariate Relationships Analyzing two variables at a time, usually the Independent & Dependent Variables Like one variable at a time, this can be done.
Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Technical Adequacy Session One Part Three.
1 Psychology 2020 Measurement & Observing Behavior Unit 2.
Chi-square Test of Independence Steps in Testing Chi-square Test of Independence Hypotheses.
Chi Square 22. Parametric Statistics Everything we have done so far assumes that data are representative of a probability distribution (normal curve).
User Study Evaluation Human-Computer Interaction.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Observation & Analysis. Observation Field Research In the fields of social science, psychology and medicine, amongst others, observational study is an.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Chapter 16 The Chi-Square Statistic
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Confidence Intervals for Proportions Chapter 8, Section 3 Statistical Methods II QM 3620.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.4 Analyzing Dependent Samples.
The Normal Curve Theoretical Symmetrical Known Areas For Each Standard Deviation or Z-score FOR EACH SIDE:  34.13% of scores in distribution are b/t the.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Feedback on Psycho. Invest. 1.Give NUMBER and at least 2 other facts. 2.OPPORTUNITY sample: relate details to YOUR activity. 3.Full marks only if you apply.
Part III – Gathering Data
McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. Using Nonexperimental Research.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 6, 2013.
Chapter 6: Analyzing and Interpreting Quantitative Data
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa Daniel H. Mathalon, Ph.D., M.D. Department of Psychiatry Yale University School of.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Inter-observer variation can be measured in any situation in which two or more independent observers are evaluating the same thing Kappa is intended to.
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 1, 2012.
8.1 Confidence Intervals: The Basics Objectives SWBAT: DETERMINE the point estimate and margin of error from a confidence interval. INTERPRET a confidence.
Getting the most out of interactive and developmental data Daniel Messinger
Accuracy Assessment of Thematic Maps THEMATIC ACCURACY.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
Measures of Agreement Dundee Epidemiology and Biostatistics Unit
Accuracy Assessment of Thematic Maps
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Psych 231: Research Methods in Psychology
Presentation transcript:

Data Annotation for Classification

Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Which students are off-task? Which students will fail the class?

Classification Develop a model which can infer a categorical predicted variable from some combination of other aspects of the data – Which students will fail the class? – Is the student currently gaming the system? – Which type of gaming the system is occurring?

We will… We will go into detail on classification methods tomorrow

In order to use prediction methods We need to know what we’re trying to predict And we need to have some labels of it in real data

For example… If we want to predict whether a student using educational software is off-task, or gaming the system, or bored, or frustrated, or going to fail the class… We need to first collect some data And within that data, we need to be able to identify which students are off-task (or the construct of interest), and ideally when

So we need to label some data We need to obtain outside knowledge to determine what the value is for the construct of interest

In some cases We can get a gold-standard label – For instance, if we want to know if a student passed a class, we just go ask their instructor

But for behavioral constructs… There’s no one to ask – We can’t ask the student (self-presentation) – There’s no gold-standard metric So we use data labeling methods or observation methods – (e.g. quantitative field observations, video coding) To collect bronze-standard labels – Not perfect, but good enough

One such labeling method Text replay coding

Text replays Pretty-prints of student interaction behavior from the logs

Examples

Sampling You can set up any sampling schema you want, if you have enough log data 5 action sequences 20 second sequences Every behavior on a specific skill, but other skills omitted

Sampling Equal number of observations per lesson Equal number of observations per student Observations that machine learning software needs help to categorize (“biased sampling”)

Major Advantages Both video and field observations hold some risk of observer effects Text replays are based on logs that were collected completely unobtrusively

Major Advantages Blazing fast to conduct – 8 to 40 seconds per observation

Notes Decent inter-rater reliability is possible (Baker, Corbett, & Wagner, 2006) (Baker, Mitrovic, & Mathews, 2010) (Sao Pedro et al, 2010) (Montalvo et al, 2010) Agree with other measures of constructs (Baker, Corbett, & Wagner, 2006) Can be used to train machine-learned detectors (Baker & de Carvalho, 2008) (Baker, Mitrovic, & Mathews, 2010) (Sao Pedro et al, 2010)

Major Limitations Limited range of constructs you can code Gaming the System – yes Collaboration in online chat – yes (Prata et al, 2008) Frustration, Boredom – sometimes Off-Task Behavior outside of software – no Collaborative Behavior outside of software – no

Major Limitations Lower precision (because lower bandwidth of observation)

Hands-on exercise

Find a partner Could be your project team-mate, but doesn’t have to be You will do this exercise with them

Get a copy of the text replay software On your flash drive Or at obspackage-LSRM.zip obspackage-LSRM.zip

Skim the instructions At Instructions-LSRM.docx

Log into text replay software Using exploratory login Try to figure out what the student’s behavior means, with your partner Do this for ~5 minutes

Now pick a category you want to code With your partner

Now code data According to your coding scheme – (is-category versus is-not-category) Separate from your partner For 20 minutes

Now put your data together Using the observations-NAME files you obtained Make a table (in excel?) showing

Coder 1 Y Coder 1 N Coder 2 Y152 Coder 2 N38

Now… We can compute your inter-rater reliability… (also called agreement)

Agreement/ Accuracy The easiest measure of inter-rater reliability is agreement, also called accuracy # of agreements total number of codes

Agreement/ Accuracy There is general agreement across fields that agreement/accuracy is not a good metric What are some drawbacks of agreement/accuracy?

Agreement/ Accuracy Let’s say that Tasha and Uniqua agreed on the classification of 9200 time sequences, out of actions – For a coding scheme with two codes 92% accuracy Good, right?

Non-even assignment to categories Percent Agreement does poorly when there is non-even assignment to categories – Which is almost always the case Imagine an extreme case – Uniqua (correctly) picks category A 92% of the time – Tasha always picks category A Agreement/accuracy of 92% But essentially no information

An alternate metric Kappa (Agreement – Expected Agreement) (1 – Expected Agreement)

Kappa Expected agreement computed from a table of the form Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Rater 1 Category 2 Count

Kappa Expected agreement computed from a table of the form Note that Kappa can be calculated for any number of categories (but only 2 raters) Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Rater 1 Category 2 Count

Cohen’s (1960) Kappa The formula for 2 categories Fleiss’s (1971) Kappa, which is more complex, can be used for 3+ categories – I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with you

Expected agreement Look at the proportion of labels each coder gave to each category To find the number of agreed category A that could be expected by chance, multiply pct(coder1/categoryA)*pct(coder2/categoryA) Do the same thing for categoryB Add these two values together and divide by the total number of labels This is your expected agreement

Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the percent agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the percent agreement? 80% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Tyrone’s expected frequency for on-task? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Tyrone’s expected frequency for on-task? 75% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Pablo’s expected frequency for on-task? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is Pablo’s expected frequency for on-task? 65% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the expected on-task agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the expected on-task agreement? 0.65*0.75= Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

Example What is the expected on-task agreement? 0.65*0.75= Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What are Tyrone and Pablo’s expected frequencies for off-task behavior? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What are Tyrone and Pablo’s expected frequencies for off-task behavior? 25% and 35% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What is the expected off-task agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What is the expected off-task agreement? 0.25*0.35= Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

Example What is the expected off-task agreement? 0.25*0.35= Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is the total expected agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is the total expected agreement? = Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is kappa? Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

Example What is kappa? (0.8 – 0.575) / ( ) 0.225/ Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

So is that any good? What is kappa? (0.8 – 0.575) / ( ) 0.225/ Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

What is your Kappa?

Interpreting Kappa Kappa = 0 – Agreement is at chance Kappa = 1 – Agreement is perfect Kappa = negative infinity – Agreement is perfectly inverse Kappa > 1 – You messed up somewhere

Kappa<0 It does happen, but usually not in the case of inter-rater reliability Occasionally seen when Kappa is used for EDM or other types of machine learning

0<Kappa<1 What’s a good Kappa? There is no absolute standard For inter-rater reliability, – 0.8 is usually what ed. psych. reviewers want to see – You can usually make a case that values of Kappa around 0.6 are good enough to be usable for some applications Particularly if there’s a lot of data Or if you’re collecting observations to drive EDM, and remembering that this is a “bronze-standard”

Landis & Koch’s (1977) scale κ Interpretation < 0No agreement 0.0 — 0.20Slight agreement 0.21 — 0.40Fair agreement 0.41 — 0.60Moderate agreement 0.61 — 0.80Substantial agreement 0.81 — 1.00Almost perfect agreement

Why is there no standard? Because Kappa is scaled by the proportion of each category When one class is much more prevalent – Expected agreement is higher than If classes are evenly balanced

Because of this… Comparing Kappa values between two studies, in a principled fashion, is highly difficult A lot of work went into statistical methods for comparing Kappa values in the 1990s No real consensus Informally, you can compare two studies if the proportions of each category are “similar”

There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis

There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis – Do a 1 df Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal & Rosnow (1991)

There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis – Do a 1 df Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal & Rosnow (1991) – Or in other words, nyardley nyardley nyoo

Additional thoughts/comments About inter-rater reliability

Additional thoughts/comments About text replays