Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2012.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.
Customer Lifetime Value – Direct Wines Customer Lifetime Value TFM&A 2014 David Lockwood: Direct Wines Terry Hogan: Golden Orb.
Intro to EDM Why EDM now? Which tools to use in class Week 1, video 1.
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2010.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 18, 2013.
Forecasting Using the Simple Linear Regression Model and Correlation
Tutorial 10: Performing What-If Analyses
Improving learning by improving the cognitive model: A data- driven approach Cen, H., Koedinger, K., Junker, B. Learning Factors Analysis - A General Method.
Knowledge Engineering Week 3 Video 5. Knowledge Engineering  Where your model is created by a smart human being, rather than an exhaustive computer.
Discovery with Models Week 8 Video 1. Discovery with Models: The Big Idea  A model of a phenomenon is developed  Via  Prediction  Clustering  Knowledge.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Mgt 240 Lecture Decision Support Systems March 3, 2005.
Week 9 Data Mining System (Knowledge Data Discovery)
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Chapter 11 Multiple Regression.
MATH408: Probability & Statistics Summer 1999 WEEKS 8 & 9 Dr. Srinivas R. Chakravarthy Professor of Mathematics and Statistics Kettering University (GMI.
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Science and Engineering Practices
Presenter: Teng-Chih Yang Professor: Ming-Puu Chen Date: 10/ 28/ 2009 Data mining in course management systems: Moodle case study and tutorial Romero,
Educational Data Mining and DataShop John Stamper Carnegie Mellon University 1 9/12/2012 PSLC Corporate Partner Meeting 2012.
Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.
Factor Analysis Psy 524 Ainsworth.
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Feature Engineering Week 3 Video 3. Feature Engineering.
Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
Introduction: The essential background
PSLC DataShop Introduction Slides current to DataShop version John Stamper DataShop Technical Director.
Term 2, 2011 Week 1. CONTENTS Types and purposes of graphic representations Spreadsheet software – Producing graphs from numerical data Mathematical functions.
Feature Engineering Studio March 30, Iterative Feature Refinement.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Appendix: The WEKA Data Mining Software
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 2, 2012.
1.07 Utilize Information Technology Tools to Manage and Perform Work Responsibilities WF SEM 2.
DSc 3120 Generalized Modeling Techniques with Applications Part II. Forecasting.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Educational Data Mining: Discovery with Models Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Ken Koedinger CMU Director of PSLC Professor of Human-Computer.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 January 23, 2013.
Data mining with DataShop Ken Koedinger CMU Director of PSLC Professor of Human-Computer Interaction & Psychology Carnegie Mellon University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Using DataShop Tools to Model Students Learning Statistics Marsha C. Lovett Eberly Center & Psychology Acknowledgements to: Judy Brooks, Ken Koedinger,
A Framework and Methods for Characterizing Uncertainty in Geologic Maps Donald A. Keefer Illinois State Geological Survey.
Collaborative Filtering - Pooja Hegde. The Problem : OVERLOAD Too much stuff!!!! Too many books! Too many journals! Too many movies! Too much content!
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
HUDK5199: Special Topics in Educational Data Mining
Data-Driven Education
Stats Methods at IC Lecture 3: Regression.
Part Four ANALYSIS AND PRESENTATION OF DATA
By Arijit Chatterjee Dr
Introduction to Regression Analysis
Regression Analysis Module 3.
Statistical Data Analysis
Data Mining 101 with Scikit-Learn
Special Topics in Educational Data Mining
Multiple Regression.
Using Bayesian Networks to Predict Test Scores
HUDK5199: Special Topics in Educational Data Mining
Reliability and Validity of Measurement
Big Data, Education, and Society
Big Data, Education, and Society
Statistical Data Analysis
Core Methods in Educational Data Mining
Presentation transcript:

Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2012

Welcome to the EDM track! On behalf of the track lead, John Stamper, and all of our colleagues

Educational Data Mining “Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in.” –

Classes of EDM Method (Baker & Yacef, 2009) Prediction Clustering Relationship Mining Discovery with Models Distillation of Data For Human Judgment

Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Which students are off-task? Which students will fail the class?

Clustering Find points that naturally group together, splitting full data set into set of clusters Usually used when nothing is known about the structure of the data – What behaviors are prominent in domain? – What are the main groups of students? Conceptually Related to Factor Analysis – Geoff Gordon’s talk tomorrow

Relationship Mining Discover relationships between variables in a data set with many variables – Association rule mining – Correlation mining – Sequential pattern mining – Causal data mining

Discovery with Models Pre-existing model (developed with EDM prediction methods… or clustering… or knowledge engineering) Applied to data and used as a component in another analysis

Distillation of Data for Human Judgment Making complex data understandable by humans to leverage their judgment Text replays are a simple example of this

Scheuer & McLaren (2011) also argue for distinct class Parameter Estimation – Fitting parameters for a probabilistic model, and then using and interpreting these parameters

A related method

Knowledge Engineering Creating a model by hand rather than automatically fitting model Several trade-offs, but broadly… – Data mined models are easier to validate, and often achieve better agreement to other measures – Knowledge engineered models are easier to create and explain

Comments? Questions?

EDM Tools

PSLC DataShop Many large-scale datasets Tools for – exploratory data analysis – learning curves – domain model testing Detail in talk by John Stamper tomorrow morning at 10am

Microsoft Excel Excellent tool for exploratory data analysis, and for setting up simple models

Pivot Tables

Who has used pivot tables before?

Pivot Tables What do they allow you to do?

Pivot Tables Facilitate aggregating data for comparison or use in further analyses

Equation Solver Allows you to fit mathematical models in Excel Let’s go through a simple example together

Equation Solver: Example Let’s fit a Bayesian Knowledge Tracing model We’ll discuss this model later – For now, it’s worth noting that classical BKT has four parameters per knowledge component – BKT predicts student knowledge and performance (correctness) – By fitting different values to the parameters, we get a better or worse fit to student performance Using PSLC-SS-2012-Example-v1.xlsx – This is a small subset of my dissertation data from the Scatterplot Tutor, available in full form in the DataShop

Under SR type =(J2-S2)^2 This finds the difference between the prediction (0 right now) and the correctness value (0 or 1) – Squaring it is a way to both get the absolute value, and magnify larger differences; very common in statistics

Go to sheet KC These are the parameters for each skill

To the right of SSR type =sum(data!T2:T20974) This is the sum of squared residuals, again a very common way of evaluating models

To the right of r type =CORREL(data!S2:S20974,data!J2:J20974) This is the correlation between the model and the variable being predicted (correctness)

Now go into the Excel Equation Solver And set up this model, and press solve

What changed?

What stayed the same?

Why is this useful? You can specify a range of complex mathematical models And much more quickly than you can implement them in software Excel is usually where I test variants on Bayesian Knowledge Tracing before implementing them in Java

Note Excel is a good starting point for this type of analysis… but not a good ending point For example, the Equation Solver is not as good at finding optimal values for BKT as – Expectation Maximization – Brute Force/Grid-Search

Comments? Questions?

Suite of visualizations Scatterplots (with or without lines) Bar graphs

Weka and RapidMiner Data mining packages RapidMiner has become more popular in recent years among the EDM community – I prefer it too

Weka.vs. RapidMiner Weka easier to use than RapidMiner RapidMiner significantly more powerful and flexible (from GUI, both are powerful and flexible if accessed via API)

In particular… It is impossible to do key types of model validation for EDM within Weka’s GUI – Such as multi-level cross-validation RapidMiner can be kludged into doing so No data mining tool really tailored to the needs of EDM researchers at current time…

SPSS SPSS is a statistical package, and therefore can do a wide variety of statistical tests It can also do some forms of data mining, like factor analysis

SPSS The difference between statistical packages (like SPSS) and data mining packages (like RapidMiner and Weka) is: – Statistics packages are focused on finding models and relationships that are statistically significant (e.g. the data would be seen less than 5% of the time if the model were not true) – Data mining packages set a lower bar – are the models accurate and generalizable?

R R is an open-source competitor to SPSS More powerful and flexible than SPSS But substantially harder to use

Matlab A powerful tool for building complex mathematical models Beck and Chang’s Bayes Net Toolkit – Student Modeling is built in Matlab

Comments? Questions?

Pre-processing Tomorrow morning, John and Ken will talk about some of the great data available in DataShop

Wherever you get your data from You’ll need to process it into a form that software can easily analyze, and which builds successful models

Common approach Flat data file – Even if you store your data in databases, most data mining techniques require a flat data file Like the one we looked at in Excel

Feature Distillation is Essential But time-consuming…

Educational Data Mining Workbench (Rodrigo et al., 2012) Provides support for feature distillation and for rapid data labeling (aka text replays) Supports data in DataShop format, as well as other formats Available for free at

Feature distillation Can automatically distill 26 features for DataShop data used in previous analyses Can distill features at the transaction (individual student action) level Can also distill aggregated features at the level of clips, defined by – time intervals – number of actions – “begin” and “end” events

Data Labeling Supports “text replay” data labeling of clips Clips can be sampled either randomly or in stratified fashion

Data Labeling

Comments? Questions?

Time to work on projects