June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart
Open Academic Analytics Initiative (OAAI) Building the Predictive Model ◦ Overview of the process ◦ Data sets used and data extraction process ◦ Overview of Pentaho and training process Deploying the Predictive Model ◦ Using Pentaho to score data ◦ Performance of the predictive model ◦ Producing Academic Alert Reports (AARs) Overview of Intervention Strategies Current Outcomes and Next Steps
“Creating an Open Ecosystem for Learning Analytics” OAAI is using two primary data sources: ◦ Student Information System (SIS – Banner) Demographics, Aptitude (SATs, GPA) ◦ Learning Management System (LMS) Event logs, Gradebook Goal: create open-source “early alert” system ◦ Predict “at risk” students in first 3 weeks of a course ◦ Deploy intervention to ensure student succeeds
Student Attitude Data (SATs, current GPA, etc.) Student Demographic Data (Age, gender, etc.) Sakai Event Log Data Sakai Gradebook Data Predictive Model Scoring Identified Students “at risk” to not complete course Static data Dynamic Data Intervention Deployed Model developed w/ historical data
Purdue University’s Course Signals Project ◦ Built on dissertation research by Dr. John Campbell ◦ Ellucian product that integrates with Blackboard ◦ Students in courses using Course Signals scored up to 26% more A or B grades up to 12% fewer C's; up to 17% fewer D's and F‘s ◦ Positive affect on four year retention rates No Course Signals courses – 69% Two or more Course Signals courses - 93% Interventions that utilize “support groups” ◦ Improved 1st and 2nd semester GPAs ◦ Increase semester persistence rates (79% vs. 39%)
Building “open ecosystem” for Learning analytics ◦ Sakai Collaboration and Learning Environment Sakai API to automate secure data capture Will also facilitate use of Course Signals & IBM SPSS ◦ Pentaho Business Intelligence Suite OS data mining, integration, analysis and reporting tools ◦ OAAI Predictive Model released under OS license Predictive Modeling Markup Language (PMML) Researching critical analytics scaling factors ◦ How “portable” are predictive models? ◦ What intervention strategies are most effective?
Release Sakai Academic Alert System (beta) ◦ Will be included as part of Sakai CLE release Conducted real world pilots ◦ 36 courses at community colleges ◦ 36 courses at HBCUs Research finding related to… ◦ Strategies for effectively “porting” predictive models ◦ The use of online communities and OER to impact on course completion, persistence and content mastery.
Wave I EDUCAUSE Next Generation Learning Challenges (NGLC) grant Funded by Bill and Melinda Gates and Hewlett Foundations $250,000 over a 15 month period Began May 1, 2011, ends January 2013 (extended)
Overview of the process Data sets and data extraction Overview of Pentaho and training process 2012 Jasig Sakai Conference9
Development and initial deployment of an “open source” predictive model of academic risk Methodological framework for model development Empirical analysis of predictive performance (preliminary results) 2012 Jasig Sakai Conference10
Data Integration using Pentaho Kettle Data Extraction phase Transformation phase Load phase Predictive Modeling using Pentaho WEKA Training phase Testing phase 2012 Jasig Sakai Conference11
2012 Jasig Sakai Conference12 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x Xeon E GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Software Platform: SQL Server 2008 R2 Pentaho Kettle Data Pre- processing : missing values, outliers, derived features
SQL queries to extract grade and user event data from Sakai CLE (see Sakai wiki for details)Sakai wiki Ensure access to historical data: data warehouse, backups etc. Extract from backup to ensure no impact on production performance Encrypting user IDs for user anonymization 2012 Jasig Sakai Conference13
Data mining and predictive modeling are affected by input data of diverse quality A predictive model is usually as good as its training data Good: lots of data Not so good: Data Quality Issues Not so good: Unbalanced classes (at Marist, 6% of students at risk. Good for the student body, bad for training predictive models ) 2012 Jasig Sakai Conference14
Variability in instructor’s assessment criteria Variability in workload criteria Variability in period used for prediction (early detection) Variability in Multiple instance data (partial grades with variable contribution, and heterogeneous composition) Solution: Use ratios Percent of usage over Avg percent of usage per course Effective Weighted Score / Avg Effective Weighted Score 2012 Jasig Sakai Conference15
Variability in Sakai tools usage No uniform criterion in the use of CMS tools (faculty members are a wild bunch ) Tools not used, data not entered, too much missing data 2012 Jasig Sakai Conference16 Forums Content Lessons Assigns Assmnts
2012 Jasig Sakai Conference17 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x Xeon E GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Software Platform: Platform SQL Server 2008 R2 Pentaho Kettle Data Pre- processin g: missing values, outliers, derived features
Fall 2010 data sample of undergraduate students Datasets were joined and data was cleaned, recoded, and aggregated to produce an input data file of 3877 records corresponding to courses taken by students Jasig Sakai Conference18
Use Weka 3.7 and IBM SPSS Modeler Generate 5 different random partitions (70% training, 30% testing) Balance each training dataset Train a predictive model (Logistic Regression, C4.5 Decision Tree, SVM) for each balanced training dataset ◦ 5 x 3 = 15 models Measure predictive performance of classifiers ◦ recall, precision, specificity Produce summary measures (mean and standard error) 2012 Jasig Sakai Conference19
At model training time: o For all students in all courses compute their effective weighted score as sumproduct(partial scores, partial weights) * (1 / sum partial weights) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effectiveWeighted Score / effective Avgweighted score At testing time: o For all students in the course tested compute their effective weighted score o as sumproduct(partial scores, partial weights) * (1 / sum partialwieghts) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effective Weighted score / effective Avgweighted score 2012 Jasig Sakai Conference20
Logistic regression and SVM did much better that C5.0 / J4.8 ◦ Detect 82% to 87% of the student population at risk. ◦ In comparison, recall of C5.0 / J4.8: 59% (why so low?) False positives: ◦ 10% of false positives over Ok students (C5.0 / J4.8 does better: 3%) ◦ 65% of predictions are false alarms (C5.0 / J4.8 does better: 44%) 2012 Jasig Sakai Conference22
For logistic regression ◦ RMN_SCORE ◦ ACADEMIC_STANDING CUM_GPA ◦ Then R_SESSIONS and SAT_VERBAL For the SVM classifier ◦ RMN_SCORE ◦ CUM_GPA, ACADEMIC_STANDING, R_SESSIONS and SAT_VERBAL C5.0/J4.8 ◦ Minimal difference among predictors 2012 Jasig Sakai Conference23
Results are encouraging, although the number of false alarms raises some concern Differences among classifiers, in particular DTs (typically very robust classifiers), requires further investigation. Data quality (missing values) remains an open issue with partial remediation Partial-grades-derived score (RMN_SCORE) remains as the best predictor. CMS generated events appear to be second tier predictors 2012 Jasig Sakai Conference24
Using Pentaho to score data Performance of the predictive model Producing Academic Alert Reports 2012 Jasig Sakai Conference25
ETL phase Remains similar to the ETL process used for training the model except for the records have missing data are also retained Scoring phase Utilizes WEK A Scoring plugin to embed WEKA predictive model into Pentaho Kettle Reporting phase Pentaho Report Designer tool to create a template for reporting Jasig Sakai Conference26
2012 Jasig Sakai Conference27 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x Xeon E GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Data Pre- processi ng: missing values, outliers, derived features Software Platform : SQL Server 2008 R2 Pentaho Kettle
Awareness Messaging Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference29
Researching effectiveness of two strategies ◦ Awareness Messaging ◦ Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference30 OER Content Self-Assessments Learning Skills - Flat World Knowledge Learning Support Facilitation & Mentoring
Initial research findings Future efforts 2012 Jasig Sakai Conference31
Applied similar analytical techniques to those used by Campbell at Purdue, using Fall 2010 data Marist College and Purdue University ◦ Differences (institutional type and size) ◦ Similarities: % students receiving federal Pell Grants, % ethnicity, ACT composite ACT composite 25th/75th percentile We found similarities in correlation values. As in the case of Purdue, all these metrics are found to be significantly correlated with course grade, with rather low correlation values Jasig Sakai Conference32
Initial review of instructor “research logs” showed general agreement with predictions Faculty/student feedback has been positive 2012 Jasig Sakai Conference33 "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention! "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention!
Develop and release “student effort data”API Develop a Sakai Academic Alert dashboard Create customized predictive models for different academic contexts Work to facilitate use of SNAPP with Sakai Jasig Sakai Conference34
Sakai Confluence Wiki – Open Academic Analytics Initiative (OAAI) Contact Josh Baron – Senior Academic Technology Officer, Marist College Sandeep M. Jayaprakash – Technical Consultant OAAI, Marist College Jasig Sakai Conference36
2012 Jasig Sakai Conference41 Data Sets Data Extraction (course event data aggregated, student data added, student identity removed) Banner (ERP) Sakai Student Data (Demographics & Course enrollment) Course Event Data Data Pre- processing (missing values, outliers, incomplete records, derived features) Identifying student information is removed during the data extraction process
2012 Jasig Sakai Conference43 SVM/SMO Logistic DT J4.8 NBayes Knowledge Flow Filter Balance Partition
2012 Jasig Sakai Conference44 Modeler Logistic DT C5.0 SVM Balance Partition Filter
Balance the training dataset Subsample – ( Reduce the dataset) Oversample – (Duplicate records if it’s a minority) Oversample SMOTE Nitesh V Chawla. Et.al(2002) Synthetic Minority Over Sampling Technique. Journal of Artificial Intelligence Research Jasig Sakai Conference45
In the case of unbalanced classes, Accuracy is a poor measure ◦ Accuracy = (TP+TN) / (TP+TN+FP+FN) ◦ The large class overwhelms the metric Better Metrics: ◦ Recall = TP / (TP+FN) Ability to detect the class of interest ◦ Specificity = TN / (TN+FP) Ability to rule out the unimportant class ◦ Precision = TP / (TP+FP) Ability to rule out false alarms Confusion Matrix 2012 Jasig Sakai Conference46
2012 Jasig Sakai Conference47 Logistic Regression C4.5/C5.0 Boosted Decision Tree Support Vector machines
