June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart.

Slides:



Advertisements
Similar presentations
MCMS Mining Course Management Systems
Advertisements

Introduction to Service-Learning for Students
Pentaho Open Source BI Goldwin. Pentaho Overview Pentaho is the commercial open source software for Business Pentaho is the commercial open source software.
Everything you wanted to know, but were afraid to ask……..
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
The Q-matrix method: A new artificial intelligence tool for data mining Dr. Tiffany Barnes Kennedy 213, PhD - North Carolina State University.
University of Florida Academic Activities Reporting (Formerly IWF)
BENJAMIN GAMBOA, RESEARCH ANALYST CRAFTON HILLS COLLEGE RESEARCHING: ALPHA TO ZETA.
Online Career Assessment: Matching Profiles and Training Programs Bryan Dik, Ph.D. Kurt Kraiger, Ph.D.
Academic Advising Implementation Team PROGRESS REPORT April 29, 2009.
What is the Student Success Plan (SSP)? The SSP is a software system and process for student success, designed to increase the persistence, success, and.
ETL By Dr. Gabriel.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
June 10-15, 2012 Growing Community; Growing Possibilities Chris Strauber Tisch Library, Tufts University.
Implementing Sakai A Panel Discussion Feliz Gouveia, Magnus Tagesson, Michael Osterman, Josh Baron, Lance Speelmon.
Moodle: using an open learning management system to support student learning Keith Landa Purchase College
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
SAKAI February What is SAKAI? Sakai ≠ Course Management System Sakai = Collaboration & Learning Environment.
Times They are A-changin’… Apple launched iTunes in 2003… (continued) 2.
Data Warehousing at STC MSIS 2007 Geneva, May 8-10, 2007 Karen Doherty Director General Informatics Branch Statistics Canada.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
Revisiting Retention: A Four Phase Retention Research Initiative 2012 SLOAN Conference October 10 th, 2012 Gary J. Burkholder, PhD Senior Research Scholar.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Marketing Sakai – Partie Deux Developing and Sharing Case Studies Mike Zackrison – rSmart Lois Brooks – Stanford University July 3, 2008.
Data Management for Large STEP Projects Michigan State University & Lansing Community College NSF STEP PI Meeting March 15, 2012 Workshop Session I-08.
Learning Analytics Initiative and working in an International community (Apereo Foundation) A.M.Berg, Innovation WG, Community Officer LAI.
1 Monitoring Student Performance New Horizons Conference Virginia Community College System April 2006 Roanoke, VA.
A way to integrate IR and Academic activities to enhance institutional effectiveness. Introduction The University of Alabama (State of Alabama, USA) was.
1 June 10-15, 2012 Growing Community; Growing Possibilities Switching to on-line evaluations for courses at UC Berkeley Daphne Ogle, Lead Design, UC Berkeley.
Multi-Institutional Data Predicting Transfer Student Success Denise Nadasen Anna Van Wie Institutional Research University of Maryland University College.
Predicting Student Retention: Last Students in are Likely to be the First Students Out Jo Ann Hallawell, PhD November 19, th Annual Conference.
7 Strategies for Extracting, Transforming, and Loading.
Beyond the Basics Learning Management Systems (LMS) Michael Heumann, Distance Education Coordinator, Imperial Valley College Cindy Vinson, Distance Learning.
Project Management May 30th, Team Members Name Project Role Gint of Communications Sai
Education Planning Initiative ASCCC Fall 2015 Plenary November 5, 2015 David Shippen, Cynthia Rico, Sabra Sabio, Bernadette Flameno.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Early Identification of Introductory Major's Biology Students for Inclusion in an Academic Support Program BETHANY V. BOWLING and E. DAVID THOMPSON Department.
Learning Analytics – The Apereo Approach ESUP DAYS & APEREO EUROPE 2016 TUESDAY, FEBRUARY 2, 2016.
1 Copyright © 2008, Oracle. All rights reserved. I Course Introduction.
Developing metrics and predictive algorithms for your institution– Marist Story JISC LEARNING ANALYTICS NETWORK EVENT SANDEEP JAYAPRAKASH.
(OBIA) Training & Placement Program By Keen IT To request free demo session please mail us at
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
Template provided by: “posters4research.com” Academic Performance and Persistence of Undergraduate Students at a Land-Grant Institution: A Statistical.
Learning Analytics at Marist College: From a single node prototype to a cluster computing platform Eitel Lauría (*) Peggy Kuck (**), Edward Presutti (**),
Making the Learning Analytics Processor (LAP) better, stronger, faster, and more Moodle friendly Lou Harrison Director of Educational Technology Services.
Utilizing “big Data” analytics for student success
Scott Elliot, SEG Measurement Gerry Bogatz, MarketingWorks
OAAI: Deploying an Open Ecosystem for Learning Analytics
Experience Report: System Log Analysis for Anomaly Detection
Once Upon a Time: The Story of a Successful BI Implementation
EDUCAUSE Annual Conference
Machine Learning with Spark MLlib
Better Informed Academic Planning Using Student Flow Models
Evolving Decision Rules (EDR)
#AdaptiveMap || The Challenges of Adaptive Content: The Perspective of two Universities Amanda Buckley (Bay Path University)
Developing an early warning system combined with dynamic LMS data
Presented by Khawar Shakeel
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
(includes online “demo” video)
Pace’s Inaugural Retention Conference June 16, 2017
Health Catalyst Care Management Suite
Stoplights for student success
Using Predictive Analytics to Enhance Student Performance and
Pat Fay and Nancy Hoffman
Using Advanced Analytics to Boost Student Success
Predicting Students’ Course Success Using Machine Learning Approach
Opening Up Learning Analytics: Addressing a Strategic Imperative
EDUCAUSE MARC 2004 E-Portfolios: Two Approaches for Transforming Curriculum & Promoting Student Learning Glenn Johnson Instructional Designer Penn State.
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Presentation transcript:

June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart

 Open Academic Analytics Initiative (OAAI)  Building the Predictive Model ◦ Overview of the process ◦ Data sets used and data extraction process ◦ Overview of Pentaho and training process  Deploying the Predictive Model ◦ Using Pentaho to score data ◦ Performance of the predictive model ◦ Producing Academic Alert Reports (AARs)  Overview of Intervention Strategies  Current Outcomes and Next Steps

“Creating an Open Ecosystem for Learning Analytics”  OAAI is using two primary data sources: ◦ Student Information System (SIS – Banner)  Demographics, Aptitude (SATs, GPA) ◦ Learning Management System (LMS)  Event logs, Gradebook  Goal: create open-source “early alert” system ◦ Predict “at risk” students in first 3 weeks of a course ◦ Deploy intervention to ensure student succeeds

Student Attitude Data (SATs, current GPA, etc.) Student Demographic Data (Age, gender, etc.) Sakai Event Log Data Sakai Gradebook Data Predictive Model Scoring Identified Students “at risk” to not complete course Static data Dynamic Data Intervention Deployed Model developed w/ historical data

 Purdue University’s Course Signals Project ◦ Built on dissertation research by Dr. John Campbell ◦ Ellucian product that integrates with Blackboard ◦ Students in courses using Course Signals  scored up to 26% more A or B grades  up to 12% fewer C's; up to 17% fewer D's and F‘s ◦ Positive affect on four year retention rates  No Course Signals courses – 69%  Two or more Course Signals courses - 93%  Interventions that utilize “support groups” ◦ Improved 1st and 2nd semester GPAs ◦ Increase semester persistence rates (79% vs. 39%)

 Building “open ecosystem” for Learning analytics ◦ Sakai Collaboration and Learning Environment  Sakai API to automate secure data capture  Will also facilitate use of Course Signals & IBM SPSS ◦ Pentaho Business Intelligence Suite  OS data mining, integration, analysis and reporting tools ◦ OAAI Predictive Model released under OS license  Predictive Modeling Markup Language (PMML)  Researching critical analytics scaling factors ◦ How “portable” are predictive models? ◦ What intervention strategies are most effective?

 Release Sakai Academic Alert System (beta) ◦ Will be included as part of Sakai CLE release  Conducted real world pilots ◦ 36 courses at community colleges ◦ 36 courses at HBCUs  Research finding related to… ◦ Strategies for effectively “porting” predictive models ◦ The use of online communities and OER to impact on course completion, persistence and content mastery.

 Wave I EDUCAUSE Next Generation Learning Challenges (NGLC) grant  Funded by Bill and Melinda Gates and Hewlett Foundations  $250,000 over a 15 month period  Began May 1, 2011, ends January 2013 (extended)

Overview of the process Data sets and data extraction Overview of Pentaho and training process 2012 Jasig Sakai Conference9

Development and initial deployment of an “open source” predictive model of academic risk  Methodological framework for model development  Empirical analysis of predictive performance (preliminary results) 2012 Jasig Sakai Conference10

 Data Integration using Pentaho Kettle  Data Extraction phase  Transformation phase  Load phase  Predictive Modeling using Pentaho WEKA  Training phase  Testing phase 2012 Jasig Sakai Conference11

2012 Jasig Sakai Conference12 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x Xeon E GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Software Platform: SQL Server 2008 R2 Pentaho Kettle Data Pre- processing : missing values, outliers, derived features

 SQL queries to extract grade and user event data from Sakai CLE (see Sakai wiki for details)Sakai wiki  Ensure access to historical data: data warehouse, backups etc.  Extract from backup to ensure no impact on production performance  Encrypting user IDs for user anonymization 2012 Jasig Sakai Conference13

 Data mining and predictive modeling are affected by input data of diverse quality  A predictive model is usually as good as its training data  Good: lots of data  Not so good: Data Quality Issues  Not so good: Unbalanced classes (at Marist, 6% of students at risk. Good for the student body, bad for training predictive models ) 2012 Jasig Sakai Conference14

 Variability in instructor’s assessment criteria  Variability in workload criteria  Variability in period used for prediction (early detection)  Variability in Multiple instance data (partial grades with variable contribution, and heterogeneous composition)  Solution: Use ratios  Percent of usage over Avg percent of usage per course  Effective Weighted Score / Avg Effective Weighted Score 2012 Jasig Sakai Conference15

 Variability in Sakai tools usage  No uniform criterion in the use of CMS tools (faculty members are a wild bunch )  Tools not used, data not entered, too much missing data 2012 Jasig Sakai Conference16 Forums Content Lessons Assigns Assmnts

2012 Jasig Sakai Conference17 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x Xeon E GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Software Platform: Platform SQL Server 2008 R2 Pentaho Kettle Data Pre- processin g: missing values, outliers, derived features

 Fall 2010 data sample of undergraduate students  Datasets were joined and data was cleaned, recoded, and aggregated to produce an input data file of 3877 records corresponding to courses taken by students Jasig Sakai Conference18

 Use Weka 3.7 and IBM SPSS Modeler  Generate 5 different random partitions (70% training, 30% testing)  Balance each training dataset  Train a predictive model (Logistic Regression, C4.5 Decision Tree, SVM) for each balanced training dataset ◦ 5 x 3 = 15 models  Measure predictive performance of classifiers ◦ recall, precision, specificity  Produce summary measures (mean and standard error) 2012 Jasig Sakai Conference19

 At model training time: o For all students in all courses compute their effective weighted score as sumproduct(partial scores, partial weights) * (1 / sum partial weights) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effectiveWeighted Score / effective Avgweighted score  At testing time: o For all students in the course tested compute their effective weighted score o as sumproduct(partial scores, partial weights) * (1 / sum partialwieghts) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effective Weighted score / effective Avgweighted score 2012 Jasig Sakai Conference20

2012 Jasig Sakai Conference21

 Logistic regression and SVM did much better that C5.0 / J4.8 ◦ Detect 82% to 87% of the student population at risk. ◦ In comparison, recall of C5.0 / J4.8: 59% (why so low?)  False positives: ◦ 10% of false positives over Ok students (C5.0 / J4.8 does better: 3%) ◦ 65% of predictions are false alarms (C5.0 / J4.8 does better: 44%) 2012 Jasig Sakai Conference22

 For logistic regression ◦ RMN_SCORE ◦ ACADEMIC_STANDING CUM_GPA ◦ Then R_SESSIONS and SAT_VERBAL  For the SVM classifier ◦ RMN_SCORE ◦ CUM_GPA, ACADEMIC_STANDING, R_SESSIONS and SAT_VERBAL  C5.0/J4.8 ◦ Minimal difference among predictors 2012 Jasig Sakai Conference23

 Results are encouraging, although the number of false alarms raises some concern  Differences among classifiers, in particular DTs (typically very robust classifiers), requires further investigation.  Data quality (missing values) remains an open issue with partial remediation  Partial-grades-derived score (RMN_SCORE) remains as the best predictor.  CMS generated events appear to be second tier predictors 2012 Jasig Sakai Conference24

Using Pentaho to score data Performance of the predictive model Producing Academic Alert Reports 2012 Jasig Sakai Conference25

 ETL phase Remains similar to the ETL process used for training the model except for the records have missing data are also retained  Scoring phase Utilizes WEK A Scoring plugin to embed WEKA predictive model into Pentaho Kettle  Reporting phase Pentaho Report Designer tool to create a template for reporting Jasig Sakai Conference26

2012 Jasig Sakai Conference27 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x Xeon E GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Data Pre- processi ng: missing values, outliers, derived features Software Platform : SQL Server 2008 R2 Pentaho Kettle

2012 Jasig Sakai Conference28

Awareness Messaging Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference29

 Researching effectiveness of two strategies ◦ Awareness Messaging ◦ Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference30 OER Content Self-Assessments Learning Skills - Flat World Knowledge Learning Support Facilitation & Mentoring

Initial research findings Future efforts 2012 Jasig Sakai Conference31

 Applied similar analytical techniques to those used by Campbell at Purdue, using Fall 2010 data  Marist College and Purdue University ◦ Differences (institutional type and size) ◦ Similarities: % students receiving federal Pell Grants, % ethnicity, ACT composite ACT composite 25th/75th percentile  We found similarities in correlation values.  As in the case of Purdue, all these metrics are found to be significantly correlated with course grade, with rather low correlation values Jasig Sakai Conference32

 Initial review of instructor “research logs” showed general agreement with predictions  Faculty/student feedback has been positive 2012 Jasig Sakai Conference33 "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention! "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention!

 Develop and release “student effort data”API  Develop a Sakai Academic Alert dashboard  Create customized predictive models for different academic contexts  Work to facilitate use of SNAPP with Sakai Jasig Sakai Conference34

Jasig Sakai Conference35

 Sakai Confluence Wiki – Open Academic Analytics Initiative (OAAI)  Contact  Josh Baron – Senior Academic Technology Officer, Marist College  Sandeep M. Jayaprakash – Technical Consultant OAAI, Marist College Jasig Sakai Conference36

Jasig Sakai Conference37

2012 Jasig Sakai Conference38

2012 Jasig Sakai Conference39

2012 Jasig Sakai Conference40

2012 Jasig Sakai Conference41 Data Sets Data Extraction (course event data aggregated, student data added, student identity removed) Banner (ERP) Sakai Student Data (Demographics & Course enrollment) Course Event Data Data Pre- processing (missing values, outliers, incomplete records, derived features) Identifying student information is removed during the data extraction process

2012 Jasig Sakai Conference42

2012 Jasig Sakai Conference43 SVM/SMO Logistic DT J4.8 NBayes Knowledge Flow Filter Balance Partition

2012 Jasig Sakai Conference44 Modeler Logistic DT C5.0 SVM Balance Partition Filter

 Balance the training dataset  Subsample – ( Reduce the dataset)  Oversample – (Duplicate records if it’s a minority) Oversample  SMOTE Nitesh V Chawla. Et.al(2002) Synthetic Minority Over Sampling Technique. Journal of Artificial Intelligence Research Jasig Sakai Conference45

 In the case of unbalanced classes, Accuracy is a poor measure ◦ Accuracy = (TP+TN) / (TP+TN+FP+FN) ◦ The large class overwhelms the metric  Better Metrics: ◦ Recall = TP / (TP+FN) Ability to detect the class of interest ◦ Specificity = TN / (TN+FP) Ability to rule out the unimportant class ◦ Precision = TP / (TP+FP) Ability to rule out false alarms  Confusion Matrix 2012 Jasig Sakai Conference46

2012 Jasig Sakai Conference47 Logistic Regression C4.5/C5.0 Boosted Decision Tree Support Vector machines

2012 Jasig Sakai Conference48

2012 Jasig Sakai Conference49

2012 Jasig Sakai Conference50