Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Sense of Student Loan Data

Similar presentations


Presentation on theme: "Making Sense of Student Loan Data"— Presentation transcript:

1 Making Sense of Student Loan Data
Helping students and postsecondary institutions make educated financial decisions May 5, 2016 BIT 5534 Final Project Group 5 - Jason Dominiczak, Aras Memisyazici, Jessica Oaks Hello - My name is Jessica Oaks and I am presenting on behalf of my group including Jason Dominiczak and Aras “Russ” Memisyazci. In our project, Making Sense of Student Loan Data, we explored some of the more hidden aspects of student loan data and identified factors that may provide insight and knowledge to prospective students as well as help schools improve their performance.

2 Problem Statement As of 2011 almost 20 million students enrolled in higher education Over $68 Billion in student debt 7 Million are in default on their student loans There is a need to explore student loan data to help students and institutions make better financial decisions. In 2011, nearly 20 million students enrolled in a higher education program. During the course of their studies, this cohort amassed over $68 Billion in student debt. The total amount of student debt in the United States is $1.2 Trillion. More worrying is that over 7 million former students are in default on their loans. When choosing an institution and academic program, students and their parents can be overwhelmed with choices. Questions such as “Which program provides the best return on investment?”, and “Which college is worth the money?” are not easy to answer. This project aims to explore organizational and educational factors that impact a student’s ability to service their debt. This analysis is important from both a potential student’s point of view as well as from the point of view of an educational institution.

3 Dataset Description GE Dataset College Scorecard Over 13,000 Records
Many Missing/Redacted Features 4,615 Relevant Program of Study/Educational Institution Records Education Institution identified by OPEID code Program of Study Identified by CIP Code College Scorecard Contains a Record for Each Accredited Educational Institution Educational institution identified by OPEID Code Amalgamation of Multiple Sources of Data Hundreds of Features. As part of the 2012 US Department of Education rules, preliminary data regarding gainful employment was released to the public. This dataset (GE Dataset) covers programs that prepare students for gainful employment. The GE dataset is aggregate in nature and only reports data for programs with over 30 students in the relevant cohort. The data includes the educational institution and the geographic information regarding their headquarters. Over 13,000 programs of study are included in the GE dataset. Programs with fewer than 31 students in the relevant cohort were removed due to unavailable data. The remaining 4,615 programs of study are the focus of this analysis. The individual education institutions are identified by the unique Office of Postsecondary Education Identification code, often called the OPEID code. Furthermore, the standardized Classification of Instructional Program (CIP) code provides a window into academic program classifications across institutions by providing a taxonomic scheme that will support the accurate tracking, assessment, and reporting of fields of study and program completions activity. To further supplement the GE dataset, the College Scorecard dataset (CS dataset), supplied by the U.S. Department of Education, was added. The College Scorecard dataset contains hundreds of variables related to an educational institution’s structure, funding, type, size, and student outcomes.

4 Data Exploration and Dataset Features
Institutional Features Institution Name, State, Public or Private, Full Time Faculty Share of Expenditures, Average cost of Attendance, Part Time Faculty Percentage, Admission Rate. Educational Program Information Program Code, Credential Level Student Debt Information Repayment Rate, Median Debt for Graduates, Two and Three Year Cohort Default Rate, Share of students earning over $25k in 6 years, Median Private Loans, Median Institutional Loans, Debt to Earnings and Debt to Discretionary Income Percentages The selected dataset was utilized to determine the best attributes for explanation of markers of student loan debt. The analysis utilized the Repayment rate as the target variable as it is a strong marker of whether a student is able to service their debt. Utilizing this dataset, the relationships between a marker of student debt (repayment rate) to educational program and educational institution attributes was able to be completed. It is important to understand the relationship, if any, between attributes of an educational program as well as the attributes of an institution and the graduate’s ability to service their student debt.

5 Student Loan Average Repayment Rate by State
This visual shows the variability among state averages for student repayment rate, it also expresses the geographical variability that is present in this dataset. The varying state performances presents an insight into possible sources of differences on the state level. This may be due to state policies on funding strategies in terms of salary levels and as a share of state loans, grants, and other financial support. Those institutions within states with a lower loan repayment rate may look to sister states for insight and advice on how to achieve better rates.

6 Linear Regression Repayment Rate as the target variable in the final model Scatterplot shows a significant slope providing evidence this is a relevant model Correlations were determined using linear regression fit modeling within the statistical software JMP RSquared value of the closer the value is to 1 the more accurate Root Mean Square Error (RMSE) of A lower value indicates a better model The first model examined was a linear regression model. A linear regression model is a useful predictor when analyzing statistics because users can visually see patterns with scatterplots and trend lines. The trends are graphically easy to understand in relation to the target variable and independent variables. For the student loan data an initial model was created to examine how all of the variables related to the target variable of repayment rate The top predictors impacting repayment rate were narrowed down by examining model summary results. The final model had an RSquared value of The closer the RSquared value is to 1 the more accurate the model. The Root Mean Square Error or RMSE was 8.01 and a lower value of RMSE indicates a better fit. A lower RMSE result also makes the model more accurate for predicting performance.

7 Linear Regression Significant Predictors of Repayment Rate
The scatterplots shown above reference Institution Name, CIP Name, Credential Level and Debt to Earnings These variable all had a p-value less than .005 and means when compared to loan repayment rate these four indicators play the most significant role. It is interesting to note that some of the higher repayment rate programs and debt to earnings rates are in programs that are beyond a bachelor’s program, such as Masters, Doctoral, and Professional Degrees. This is likely because although the amount of debt is large for these advanced degrees, the future earnings potential is also large. Repayment rate does not take into account the amount that is being repaid a month only that that a payment is being made. The lower graph tries to account for some of this discrepancy by showing the credential level vs. the mean repayment rate

8 Building a Better Prediction Model
Three Additional Models Decision Tree Bootstrap Forest Neural Network Increase in Accuracy Validation The linear regression model provides a useful tool when evaluating the impacts of the selected features on the overall repayment rate. While this data is quite useful, there is an opportunity to use other methods to more accurately predict the repayment rate. To accomplish this, three additional predictive models were built for the model. A bootstrap forest model, a decision tree model, and a neural network model were constructed. These three models utilize more complex structures to represent the data. These three models were run with a validation subset of the overall dataset to ensure the model is generalizable and to reduce the risk of overfitting.

9 Predicting Performance
While the bootstrap forest model provided the best accuracy for this dataset, the model fails to capture a significant amount of the observed variation. The results of the linear regression and the bootstrap forest confirms the predictive power of the educational program and the Institution itself. These two features show that the educational program and the institution that a student chooses has a strong impact on their future ability to service their debt. While the model has captured the significant impact that an institution plays in its graduate’s ability to repay their loans, the chosen institutional features haven’t been able to provide more detail or more predictive power.

10 Principal Component and Cluster Analysis
Cluster Analysis identified a few inte- resting clusters of note were: Cluster #1 Cluster #4 These clusters all pointed out some very unique relationships and data points. The PCA and cluster analysis helped us identify key variables in this data set and how the data responded within these reduced variable sets. Of particular note, Clusters #1 and #4 showed interesting deviations from the other clusters. Specifically, Cluster #4 contained students with the highest amount of income after graduation and have borrowed the most amount of loans. It appears that those students who take on significant debt during their educational program and following graduation have a higher income in comparison to the rest of the clusters, are able to repay their loans at a much higher rate. Furthermore, Cluster #1 contained students with the least amount of income after graduation and also encumbered little education debt. This cluster had the 2nd smallest default rate. This reiterates the earlier point that students should be aware of what the financial return they can expect from their educational program and ensure that the level of debt they are taking on is sustainable.

11 Conclusion Variability in the student’s ability to service their debt is tied to institutional factors. The two main factors are the educational institution and the student’s chosen plan of study In choosing an educational institution and plan of study, a quantitative approach can be beneficial. As standardized data collection improves, we can expect that a more complete model can be built. Through the course of this investigation, we have been able to utilize the dataset to generate models that do a fair job in accounting for variability in the student’s ability to service their debt. These models show that a student’s choice of educational institution and their plan of study play a large part in their future financial stability. While these choices are often made by students and their parents utilizing qualitative measures, we have shown that taking a quantitative approach to choosing where to go to college and what to study is useful for their long term financial stability. As standardized data collection improves at the federal level, we can expect that a more complete model can be built.


Download ppt "Making Sense of Student Loan Data"

Similar presentations


Ads by Google