Building a predictive model to enhance students' self-driven engagement Moletsane Moletsane T: +27(0)51 401 9111 | info@ufs.ac.za | www.ufs.ac.za
Overview Introduction Data Modelling The “what-if” tool. Introduction and Motivation for a sensitivity tool Data Criteria for inclusion of variables Variables used Modelling Random Forest modelling process Evaluation of the model The “what-if” tool.
What is student engagement? Student engagement measures provide information about: What students do – time and energy devoted to educationally purposeful activities What institutions do – using effective educational practices to induce students to do the right things With the aim of: channelling student energy towards activities that matter.
what do we learn from se surveys? In the absence of reliable indicators of actual student learning, SE surveys are “process indicators or proxies for student learning outcomes” (Banta, Pike, Hansen, 2009; Kuh, 2009) Having reflected on what student engagement is, it is important to explore what we learn from these measures in terms of the quality of teaching and learning.
Is se data shared with students? Little use of student engagement data by students. Similar for technology committees/groups in the institutions (NSSE, 2014) http://www.jsu.edu/oira/reports_pdf/National_Survey_of_Student_Engagement.pdf “…this principle establishes the need to establish what information students will need in order to make more informed decisions regarding their learning journeys as basis for all collection, analysis and use of student data. “ “The collection, analysis and use of student data therefore needs to primarily reflect the interests, values and priorities of students.”
How can we best share SE data to students? In a manner that: Guides students’ effective educational behaviours and encourages students to make more informed decisions regarding their learning Reflects the students interest. Does not violate students’ privacy User friendly
How can we best share SE data? Possible methods include: Creating an annual report for students Releasing snippets of data at certain time intervals (Social media, Posters, Email, SMSs) Publishing SE articles in varsity magazines Using SE data during the advising process, or Providing students with aggregated data Through a web based prediction tool that implements a model based on SE data .
What is the prediction tool? A prediction model (We use a machine learning technique for the prediction modelling) That is implemented in a web interface (Built in the R environment) To make reactive predictions to students inputs on the tool That allow students to: Explore which educational behaviours lead to a higher chance of success, thus encouraging students to make more informed decisions regarding their learning. Ask what if questions, and then find answers Explain reactive predictions
What data do we have? Student Engagement data UFS data from 2013 to 2016. Biographical data Institutional Data Students’ outcome e.g. we use proportion of modules passed Students’ credit and module load
Should we Include all the data? Biographical Data Since we intend on sharing the tool with students, we believe that biographical data may be interpreted in a prejudiced manner. E.g. Race, disability, or gender. Non actionable data For the purpose of the tool, some non actionable data was not included in the prediction model despite being modest predictors. E.g. Faculty, residence status
SASSE data UFS data from 2013 to 2016 has 6213 respondents. Only 4602 of the observations are matched to the institutional data. 190 variables
How do we choose which variables to use? Variable Importance The machine learning technique we use has a built in variable selection method. The method is based on cross validation principles for variables which ranks the variables by the loss of accuracy the model has when a model is implemented without that feature. From the top ranking variables, we select the most predictive 8 variables for our method.
How do we choose which variables to use? Variable Importance The machine learning technique we use has a built in variable selection method. The method is based on cross validation principles for variables which ranks the variables by the loss of accuracy the model has when a model is implemented without that feature. From the top ranking variables, we select the best 5 variables for our interface.
Which variables are most important? MeanDecreaseGini is a measure of variable importance based on the Gini impurity index used for the calculation of splits during training. A common misconception is that the variable importance metric refers to the Gini used for asserting model performance which is closely related to AUC, but this is wrong
Algorithm From 1 to K Draw a bootstrap sample of size n from the data Grow a random forest tree to the bootstrapped data by Selecting m variables at random from the p variables Pick the best variable split among the m variables Split the node into two data nodes Output the ensemble of trees Make a final prediction based on the majority vote of ensemble
Overview of the random forest model New data Sample 1 Learning algorithm Classifier 1 Training data Combined classifiers Sample 2 Learning algorithm Classifier 2 Sample k Learning algorithm Classifier k Prediction
Prediction with all (177) the variables sample (20.97%) Model Resutls Prediction with all (177) the variables sample (20.97%) False positive rate = 20.8% False negative rate = 21.08% Prediction with the selected (8) variables sample (23.64%) False positive rate = 24.3% False negative rate = 23.5% Pred Actual Pred Actual
The tool (Part 1 of 2)
The tool (Part 2 of 2)
Thank you T: +27(0)51 401 9111 | info@ufs.ac.za | www.ufs.ac.za