Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart.

Similar presentations


Presentation on theme: "June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart."— Presentation transcript:

1 June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart

2  Open Academic Analytics Initiative (OAAI)  Building the Predictive Model ◦ Overview of the process ◦ Data sets used and data extraction process ◦ Overview of Pentaho and training process  Deploying the Predictive Model ◦ Using Pentaho to score data ◦ Performance of the predictive model ◦ Producing Academic Alert Reports (AARs)  Overview of Intervention Strategies  Current Outcomes and Next Steps

3 “Creating an Open Ecosystem for Learning Analytics”  OAAI is using two primary data sources: ◦ Student Information System (SIS – Banner)  Demographics, Aptitude (SATs, GPA) ◦ Learning Management System (LMS)  Event logs, Gradebook  Goal: create open-source “early alert” system ◦ Predict “at risk” students in first 3 weeks of a course ◦ Deploy intervention to ensure student succeeds

4 Student Attitude Data (SATs, current GPA, etc.) Student Demographic Data (Age, gender, etc.) Sakai Event Log Data Sakai Gradebook Data Predictive Model Scoring Identified Students “at risk” to not complete course Static data Dynamic Data Intervention Deployed Model developed w/ historical data

5  Purdue University’s Course Signals Project ◦ Built on dissertation research by Dr. John Campbell ◦ Ellucian product that integrates with Blackboard ◦ Students in courses using Course Signals  scored up to 26% more A or B grades  up to 12% fewer C's; up to 17% fewer D's and F‘s ◦ Positive affect on four year retention rates  No Course Signals courses – 69%  Two or more Course Signals courses - 93%  Interventions that utilize “support groups” ◦ Improved 1st and 2nd semester GPAs ◦ Increase semester persistence rates (79% vs. 39%)

6  Building “open ecosystem” for Learning analytics ◦ Sakai Collaboration and Learning Environment  Sakai API to automate secure data capture  Will also facilitate use of Course Signals & IBM SPSS ◦ Pentaho Business Intelligence Suite  OS data mining, integration, analysis and reporting tools ◦ OAAI Predictive Model released under OS license  Predictive Modeling Markup Language (PMML)  Researching critical analytics scaling factors ◦ How “portable” are predictive models? ◦ What intervention strategies are most effective?

7  Release Sakai Academic Alert System (beta) ◦ Will be included as part of Sakai CLE release  Conducted real world pilots ◦ 36 courses at community colleges ◦ 36 courses at HBCUs  Research finding related to… ◦ Strategies for effectively “porting” predictive models ◦ The use of online communities and OER to impact on course completion, persistence and content mastery.

8  Wave I EDUCAUSE Next Generation Learning Challenges (NGLC) grant  Funded by Bill and Melinda Gates and Hewlett Foundations  $250,000 over a 15 month period  Began May 1, 2011, ends January 2013 (extended)

9 Overview of the process Data sets and data extraction Overview of Pentaho and training process 2012 Jasig Sakai Conference9

10 Development and initial deployment of an “open source” predictive model of academic risk  Methodological framework for model development  Empirical analysis of predictive performance (preliminary results) 2012 Jasig Sakai Conference10

11  Data Integration using Pentaho Kettle  Data Extraction phase  Transformation phase  Load phase  Predictive Modeling using Pentaho WEKA  Training phase  Testing phase 2012 Jasig Sakai Conference11

12 2012 Jasig Sakai Conference12 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Software Platform: SQL Server 2008 R2 Pentaho Kettle Data Pre- processing : missing values, outliers, derived features

13  SQL queries to extract grade and user event data from Sakai CLE (see Sakai wiki for details)Sakai wiki  Ensure access to historical data: data warehouse, backups etc.  Extract from backup to ensure no impact on production performance  Encrypting user IDs for user anonymization 2012 Jasig Sakai Conference13

14  Data mining and predictive modeling are affected by input data of diverse quality  A predictive model is usually as good as its training data  Good: lots of data  Not so good: Data Quality Issues  Not so good: Unbalanced classes (at Marist, 6% of students at risk. Good for the student body, bad for training predictive models ) 2012 Jasig Sakai Conference14

15  Variability in instructor’s assessment criteria  Variability in workload criteria  Variability in period used for prediction (early detection)  Variability in Multiple instance data (partial grades with variable contribution, and heterogeneous composition)  Solution: Use ratios  Percent of usage over Avg percent of usage per course  Effective Weighted Score / Avg Effective Weighted Score 2012 Jasig Sakai Conference15

16  Variability in Sakai tools usage  No uniform criterion in the use of CMS tools (faculty members are a wild bunch )  Tools not used, data not entered, too much missing data 2012 Jasig Sakai Conference16 Forums Content Lessons Assigns Assmnts

17 2012 Jasig Sakai Conference17 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Software Platform: Platform SQL Server 2008 R2 Pentaho Kettle Data Pre- processin g: missing values, outliers, derived features

18  Fall 2010 data sample of undergraduate students  Datasets were joined and data was cleaned, recoded, and aggregated to produce an input data file of 3877 records corresponding to courses taken by students. 2012 Jasig Sakai Conference18

19  Use Weka 3.7 and IBM SPSS Modeler 14.2.  Generate 5 different random partitions (70% training, 30% testing)  Balance each training dataset  Train a predictive model (Logistic Regression, C4.5 Decision Tree, SVM) for each balanced training dataset ◦ 5 x 3 = 15 models  Measure predictive performance of classifiers ◦ recall, precision, specificity  Produce summary measures (mean and standard error) 2012 Jasig Sakai Conference19

20  At model training time: o For all students in all courses compute their effective weighted score as sumproduct(partial scores, partial weights) * (1 / sum partial weights) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effectiveWeighted Score / effective Avgweighted score  At testing time: o For all students in the course tested compute their effective weighted score o as sumproduct(partial scores, partial weights) * (1 / sum partialwieghts) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effective Weighted score / effective Avgweighted score 2012 Jasig Sakai Conference20

21 2012 Jasig Sakai Conference21

22  Logistic regression and SVM did much better that C5.0 / J4.8 ◦ Detect 82% to 87% of the student population at risk. ◦ In comparison, recall of C5.0 / J4.8: 59% (why so low?)  False positives: ◦ 10% of false positives over Ok students (C5.0 / J4.8 does better: 3%) ◦ 65% of predictions are false alarms (C5.0 / J4.8 does better: 44%) 2012 Jasig Sakai Conference22

23  For logistic regression ◦ RMN_SCORE ◦ ACADEMIC_STANDING CUM_GPA ◦ Then R_SESSIONS and SAT_VERBAL  For the SVM classifier ◦ RMN_SCORE ◦ CUM_GPA, ACADEMIC_STANDING, R_SESSIONS and SAT_VERBAL  C5.0/J4.8 ◦ Minimal difference among predictors 2012 Jasig Sakai Conference23

24  Results are encouraging, although the number of false alarms raises some concern  Differences among classifiers, in particular DTs (typically very robust classifiers), requires further investigation.  Data quality (missing values) remains an open issue with partial remediation  Partial-grades-derived score (RMN_SCORE) remains as the best predictor.  CMS generated events appear to be second tier predictors 2012 Jasig Sakai Conference24

25 Using Pentaho to score data Performance of the predictive model Producing Academic Alert Reports 2012 Jasig Sakai Conference25

26  ETL phase Remains similar to the ETL process used for training the model except for the records have missing data are also retained  Scoring phase Utilizes WEK A Scoring plugin to embed WEKA predictive model into Pentaho Kettle  Reporting phase Pentaho Report Designer tool to create a template for reporting. 2012 Jasig Sakai Conference26

27 2012 Jasig Sakai Conference27 Personal (Bio) Data Course Data Performance Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Target feature Partition Training Data Test Data Balance Library of predictive models (Classifiers) Balanced Training Data Train Test Store New Data Predict (Score) Results Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Predictive Modeling Layer: Training Testing and Scoring Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition Source Data Algorithms Data Pre- processi ng: missing values, outliers, derived features Software Platform : SQL Server 2008 R2 Pentaho Kettle

28 2012 Jasig Sakai Conference28

29 Awareness Messaging Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference29

30  Researching effectiveness of two strategies ◦ Awareness Messaging ◦ Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference30 OER Content Self-Assessments Learning Skills - Flat World Knowledge Learning Support Facilitation & Mentoring

31 Initial research findings Future efforts 2012 Jasig Sakai Conference31

32  Applied similar analytical techniques to those used by Campbell at Purdue, using Fall 2010 data  Marist College and Purdue University ◦ Differences (institutional type and size) ◦ Similarities: % students receiving federal Pell Grants, % ethnicity, ACT composite ACT composite 25th/75th percentile  We found similarities in correlation values.  As in the case of Purdue, all these metrics are found to be significantly correlated with course grade, with rather low correlation values. 2012 Jasig Sakai Conference32

33  Initial review of instructor “research logs” showed general agreement with predictions  Faculty/student feedback has been positive 2012 Jasig Sakai Conference33 "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention! "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention!

34  Develop and release “student effort data”API  Develop a Sakai Academic Alert dashboard  Create customized predictive models for different academic contexts  Work to facilitate use of SNAPP with Sakai 20122012 Jasig Sakai Conference34

35 20122012 Jasig Sakai Conference35

36  Sakai Confluence Wiki – Open Academic Analytics Initiative (OAAI) https://confluence.sakaiproject.org/pages/viewpage.action?pageId=75671025  Contact  Josh Baron – Senior Academic Technology Officer, Marist College josh.baron@marist.edu  Sandeep M. Jayaprakash – Technical Consultant OAAI, Marist College sandeep.jayaprakash1@marist.edu 20122012 Jasig Sakai Conference36

37 20122012 Jasig Sakai Conference37

38 2012 Jasig Sakai Conference38

39 2012 Jasig Sakai Conference39

40 2012 Jasig Sakai Conference40

41 2012 Jasig Sakai Conference41 Data Sets Data Extraction (course event data aggregated, student data added, student identity removed) Banner (ERP) Sakai Student Data (Demographics & Course enrollment) Course Event Data Data Pre- processing (missing values, outliers, incomplete records, derived features) Identifying student information is removed during the data extraction process

42 2012 Jasig Sakai Conference42

43 2012 Jasig Sakai Conference43 SVM/SMO Logistic DT J4.8 NBayes Knowledge Flow Filter Balance Partition

44 2012 Jasig Sakai Conference44 Modeler Logistic DT C5.0 SVM Balance Partition Filter

45  Balance the training dataset  Subsample – ( Reduce the dataset)  Oversample – (Duplicate records if it’s a minority) Oversample  SMOTE Nitesh V Chawla. Et.al(2002) Synthetic Minority Over Sampling Technique. Journal of Artificial Intelligence Research. 16.321-357. 2012 Jasig Sakai Conference45

46  In the case of unbalanced classes, Accuracy is a poor measure ◦ Accuracy = (TP+TN) / (TP+TN+FP+FN) ◦ The large class overwhelms the metric  Better Metrics: ◦ Recall = TP / (TP+FN) Ability to detect the class of interest ◦ Specificity = TN / (TN+FP) Ability to rule out the unimportant class ◦ Precision = TP / (TP+FP) Ability to rule out false alarms  Confusion Matrix 2012 Jasig Sakai Conference46

47 2012 Jasig Sakai Conference47 Logistic Regression C4.5/C5.0 Boosted Decision Tree Support Vector machines

48 2012 Jasig Sakai Conference48

49 2012 Jasig Sakai Conference49

50 2012 Jasig Sakai Conference50


Download ppt "June 10-15, 2012 Growing Community; Growing Possibilities Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart."

Similar presentations


Ads by Google