John Stamper Human-Computer Interaction Institute Carnegie Mellon University Technical Director Pittsburgh Science of Learning Center DataShop.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit
Supporting (aspects of) self- directed learning with Cognitive Tutors Ken Koedinger CMU Director of Pittsburgh Science of Learning Center Human-Computer.
Effective Skill Assessment Using Expectation Maximization in a Multi Network Temporal Bayesian Network By Zach Pardos, Advisors: Neil Heffernan, Carolina.
Chapter 9 Business Intelligence Systems
/ department of mathematics and computer science TU/e eindhoven university of technology CEDEFOP workshop: Policy, Practice, Partnership: Getting to Work.
Conclusion Our prediction model did a good job at predict 8 th grade math proficiency. It can be used to estimate 10 th grade score fairly well, too. But.
Data Mining – Intro.
Science and Engineering Practices
DataShop: An Educational Data Mining Platform for the Learning Science Community John Stamper Pittsburgh Science of Learning Center Human-Computer Interaction.
Educational Data Mining and DataShop John Stamper Carnegie Mellon University 1 9/12/2012 PSLC Corporate Partner Meeting 2012.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Understanding Data Analytics and Data Mining Introduction.
PSLC DataShop Introduction Slides current to DataShop version John Stamper DataShop Technical Director.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Unit 2: Engineering Design Process
ENGLISH LANGUAGE ARTS AND READING K-5 Curriculum Overview.
Capstone Experience at UNH Manchester Student Guided Mentoring for an Undergraduate Research Group in Speech Capstone Objectives Challenges Technology.
Anomaly detection with Bayesian networks Website: John Sandiford.
+ Facts + Figures: Your Introduction to Meetings Data Presented By: Name Title Organization.
PSLC DataShop Introduction Slides current to DataShop version John Stamper DataShop Technical Director.
CpSc 881: Machine Learning Introduction. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
1 Research Groups : KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems SCI 2 SMetrology and Models Intelligent.
بسم الله الرحمن الرحیم. Ehsan Khoddam Mohammadi M.J.Mahzoon Koosha K.Moogahi.
Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Measuring What Matters: Technology & the Assessment of all Students Jim Pellegrino.
Introduction to Science Informatics Lecture 1. What Is Science? a dependence on external verification; an expectation of reproducible results; a focus.
N A T I O N A L S C I E N C E T E A C H E R S A S S O C I A T I O N NSTA Learning Center What would be helpful for you and your teachers? Al Byers Assistant.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Big data and learning: building and learning from LearnSphere (…because the next generation DataShop was just too incremental) John Stamper Pittsburgh.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Kaggle Competition Prudential Life Insurance Assessment
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Data mining with DataShop Ken Koedinger CMU Director of PSLC Professor of Human-Computer Interaction & Psychology Carnegie Mellon University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
ROSS Courses: Business Technology Applications & Marketing Principles Lesson Plans for the Week of September 3, 2012.
Copyright © 2005 Education Development Center, Inc. Teacher Students Science Forming Relationships.
Challenge Based Learning i.Key Components ii.Process.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
© 2004 McGraw-Hill Companies, Inc., McGraw-Hill RyersonSlide 8-2 TURNING MARKETING INFORMATION INTO ACTION C HAPTER.
Item 4 - Intrusion Detection and Prevention Yuh-Jye Lee Dept. of Computer Science and Information Engineering National Taiwan University of Science and.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
Data-Driven Education
Data Mining – Intro.
How to interact with the system?
Julia Lane New York University
CIKM Competition 2014 Second Place Solution
CS 179 Project Intro.
CIKM Competition 2014 Second Place Solution
The Graduate College Travel Summary Presentation
Dr. Morgan C. Wang Department of Statistics
Introduction to PSLC DataShop
Addressing the Assessing Challenge with the ASSISTment System
Overview of Machine Learning
Ensembles.
How to interact with the system?
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Big DATA.
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

John Stamper Human-Computer Interaction Institute Carnegie Mellon University Technical Director Pittsburgh Science of Learning Center DataShop

2

eScience 3 Jim Gray – the 4 th paradigm

4

Paradigms of Scientific Exploration 5  Empirical – started thousands of years ago  Theoretical – last few hundred years  Computational – last 30 – 40 years  Data Exploration (eScience)

The Book 6

Data Exploration 7  Driven by the availability (or overabundance) of data  Ties simulation with data analysis, highly statistical  Requires tools to collect, analyze, and visualize large data sets

Data Exploration 8 Focus Areas  Health (Medicine, DNA)  Environmental (Global Warming)  Astronomy (Galaxy Mapping)  Physics (CERN) Education is missing

Can EDM be part of eScience? 9 We need:  Data  Tools  Ideas and methods

EDM Data Size What is the right size for EDM discovery?

Data Granularity 11 Finest – Transaction Steps Problems Units Tests Class Grades Class Avgs Schools Coarsest -…. We are mostly here Policy is being made here

EDM Conference Data Average 520 Students Median 148 Students Largest 172,000 Transactions 2009 Average 1,168 Students Median 300 Students Largest 437,000 Transactions

How about 2011? 13  Hypothesis – Average will be larger due mainly to a few large datasets

Trend towards larger data sets… 14  … and they are coming!  Carnegie Learning / Assistments  Seeing a move from collecting data to secondary analysis  This is good, but it has risks!

Risks of Secondary Analysis  Misunderstanding the data  Stagnation on a few datasets  Privacy/Security 15

Minimizing the risks  Misunderstanding the data – Standard formats  Stagnation on a few datasets – turn on the flow  Privacy/Security – must have reasonable procedures to protect student identity Warning – Shameless Plug Ahead!!! 16

Standard Repositories  Repositories like DataShop are one way to mitigate these issues and provide:  Standardization  Privacy/Security  Lots of data 17

DataShop Stats… 18

DataShop - How to increase awareness?  Tutorials/Workshops  Press/media  Competitions 19

2010 KDD Cup Competition  KDD Cup is the premier data mining challenge  2010 KDD Cup called “Educational Data Mining Challenge”  Ran from April 2010 through June

2010 KDD Cup Competition  The challenge asked participants to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems.

KDD Cup Competition Why do we care?  Advances in prediction  Advances modeling 22

Prediction  Prediction of student performance is the reason for assessment.  Tons of effort placed on Standardized Testing  What if we could predict from student data better? Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online system that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI). 19(3), pp

Modeling  Student Models drive many of the decisions for adaptive instruction  What level of granularity should these models be?  Better Student Models should lead to faster learning 24

The Data Data was provided by Carnegie Learning Inc DatasetStudentsStepsFile size Algebra I ,3109,426,9663 GB Bridge to Algebra ,04320,768, GB 25

Details on the Data RowStudentProblemStepIncorrectsHintsError Rate Knowledge component Opportunity Count 1S01WATERING_VEGGIES(WATERED-AREA Q1)000Circle-Area1 2S01WATERING_VEGGIES(TOTAL-GARDEN Q1)211 Rectangle- Area 1 3S01WATERING_VEGGIES(UNWATERED-AREA Q1)000 Compose- Areas 1 4S01WATERING_VEGGIESDONE000 Determine- Done 1 5S01MAKING-CANS(POG-RADIUS Q1)000Enter-Given1 6S01MAKING-CANS(SQUARE-BASE Q1)000Enter-Given2 7S01MAKING-CANS(SQUARE-AREA Q1)000Square-Area1 8S01MAKING-CANS(POG-AREA Q1)000Circle-Area2 9S01MAKING-CANS(SCRAP-METAL-AREA Q1)201 Compose- Areas 2 10S01MAKING-CANS(POG-RADIUS Q2)000Enter-Given3 26

Details on the Data Splitting Data for the Competition 27

2010 KDD Cup Competition  655 registered participants  130 participants who submitted predictions  3,400 submissions

Solutions 1 st National Taiwan University  Used a DM course around 2010 KDD CUP  Expanded features by various binarization and discretization techniques  Resulting sparse feature sets are trained by logistic regression (using LIBLINEAR)  Condensed features so that the number is less than 20.  Final submission used ensemble by linear regression. 29

Solutions 2 nd Zhang and Su  Used combination of techniques  Gradient Boosting Machines  Singular Value Decomposition  Combined results of multiple SVDs which is called Gradient Boosting. 30

Solutions 3 rd Big KDD  Used collaborative filtering techniques  Matrix Factorization  Factorize student/step/group relationships  Other Baseline Predictions  Neural network combines an ensemble of predictions  Originally developed for the Netflix competition 31

Solutions 4 th Zach Pardos  Used a novel Bayesian HMM  learns individualized student specific parameters (prior, learn rate, guess and slip)  uses these parameters to train skill specific models.  The bagged decision tree classifier was the primary classifier  Bayesian model was used in ensemble selection to generate extra features for decision tree classifier 32

What did we learn?  The top teams used very different techniques to achieve similar results  More work still needed to bring these techniques into the mainstream  How good does the prediction have to be? 33

2010 KDD Cup Benefits  Advances in prediction and student modeling  Excitement in the KDD Community  The datasets are now in the “wild” and showing up in non KDD conferences  Competition site is still up and functioning! (including facts and papers from winning teams!) 34

2010 KDD Cup Competition Next steps to continue momentum? 35

2012 EDM Cup Competition! Goals  Generate Excitement within the EDM Community  Use as a bridge to connect KDD, LAKS, EC-TEL, AERA, etc.  Make the competition annual  Have each year build on knowledge gained from previous year  Vary the questions and data 36

The Future of EDM  More and more data will come  It needs to be mined  EDM as a community or conference? 37

EDM Data Size  What is the right size for EDM Discovery? 38

PSLC DataShop a data analysis service for the learning science community Free Data is there, Use it! Make Discoveries! 39