Presentation is loading. Please wait.

Presentation is loading. Please wait.

IDENTIFYING BERNIE SANDERS’ VOTER BASE THROUGH PREDICTIVE ANALYTICS

Similar presentations


Presentation on theme: "IDENTIFYING BERNIE SANDERS’ VOTER BASE THROUGH PREDICTIVE ANALYTICS"— Presentation transcript:

1 IDENTIFYING BERNIE SANDERS’ VOTER BASE THROUGH PREDICTIVE ANALYTICS
BIT5534 – SPRING 2016 SASIKUMAR NATESAN, RAJI RAMANATHAN, NIMA SOLEIMANI <Introduction with name and group members> This study focuses on the identification of a primary candidate’s core voter base through demographic analysis. In this presentation, we will describe the problem that our study will address, identify the sources for the data we used, outline our data preparation and exploration steps, examine the models that were produced, and draw conclusions based on our data analysis.

2 PROBLEM DESCRIPTION America’s demographics are quickly transforming, creating an impact on the electorate Candidates must be able to identify and mobilize their supporters Which demographic indicators such as age, race, gender, income, population density, and education impact primary outcomes? What is a Bernie Sander’s core demographic group? As the makeup of the American population transforms, traditional voting demographics are also changing. In order for Bernie Sanders to win the remaining Democratic primaries, his campaign must be able to identify his core supporters, target them in the remaining contests, and mobilize them to the polls. To assist the campaign in this endeavour, demographic data and primary voting records for each state were analyzed in an effort to identify key demographic groups that propelled Bernie Sanders to victory in completed contests.

3 DATA SOURCE 2014 population, demographic and housing unit data provided by the U.S Census Bureau for all fifty states Total of 239 attributes consolidated State wise 2016 primary voting results for all candidates The data source for this study consists of publicly available 2014 estimates of population, demographic, and housing unit data from census.gov. This data, along with the primary voting records for completed contests from kaggle.com were used to predict the outcomes of contests in the remaining primary states in the 2016 presedential primary election cycle.

4 DATA PREPARATION Data Cleansing Formatting Data Constructing New Data
Removed attributes with no data in them Removed % symbols from data Formatting Data Rows & Columns were transposed for further analysis Cumulative totals and category headers were removed Constructing New Data New dependent Boolean variables denoting a candidate’s win/loss Integrating Data Census Data from 3 sheets and Primary results data consolidated by state To prepare the data prior to performing analysis, we identified typos, missing values for attributes, and invalid entries. Attributes that did not have any data were removed from analysis. The numeric data in the files downloaded from the census bureau were erroneously represented as text, and percentage fields were denoted with a % symbol. The data in these fields were reformatted and the ‘%’ symbol was removed. To converge our data sources into a format that can be analyzed in JMP, pivot tables were used and the row and columns were transposed for the data downloaded from US Census bureau. The data from the three files were then integrated, based on the state column, into a single workbook. A new boolean attribute was added to represent whether a candidate won or lost each state’s contest. The data was then partitioned into training and validation data sets.

5 DATA EXPLORATION Scatter plots used to study relationship between Independent and Dependent variables Variance Inflation Factor used to determine the severity of correlation among variables Key Findings - High multicollinearity between several predictor variables In the next phase, we performed exploratory data analysis to deepen our understanding of the collected data. Our goal was to identify associations among the predictors and the response variable. We developed scatterplots to study relationships between the predictors and the response variables. We also studied the relationships between the predictor variables. Through this analysis, we discovered multi-collinearity between the predictor variables. Based on these results, we removed some of the highly correlated variables from the model because they supplied redundant information, and did not impact fit of the data points to the prediction line.

6 MODEL DEVELOPMENT Logistic Regression Principal Component Analysis
Highly significant predictors Public administration employees People aged 25 and over with a bachelor’s degree Cash and public assistance recipients Principal Component Analysis Two principal components contributed to 90% of the variance Decision Tree Predictors Identified Non-hispanic and non-latino Cuban Our first goal was to extract the best subset of potential independent variables for our forecasting model, which would point us to the most impactful demographics. Since our dependent variable is nominal, we opted for the logistic regression stepwise model. Logistic regression analyses were implemented because they yield powerful insights into attributes that are likely to predict an event outcome. Logistic regression analyses also show the extent to which changes in attribute values may affect the predicted probability of an event outcome. Furthermore, logistic regression parameters are easily interpretable and explainable. The stepwise regression model identified demographic groups that include those employed in public administration, those aged 25 years and over with a bachelor's degree, and households with cash public assistance income as significant predictors for Bernie Sanders’ performance in a given primary or caucus. Our next attempt was to generate a model using a linear combination of the predictor variables. During the data exploration phase of the study, we found a significantly large correlation between the predictor variables. Due to the presence of this correlation, we ran a dimension reduction technique, Principal Component Analysis, on the predictors to reduce the variables to a set of linearly uncorrelated variables. Two principal components account for around 90% of the total variance in the dataset. We then ran a regression on these Principal component variables as predictors for predicting Bernie Sanders’ Win/loss. We also ran the Decision Tree model on the data set to reveal any non-linear relationships between demographic groups. Because we required an interpretable model that could be easily explained to campaign managers, we chose to employ a Decision Tree rather than the more complex Neural Network. The Decision tree model identified those aged 25 years and over with bachelor’s degree, and non-hispanic and non-latino Cubans as significant indicators for predicting a Bernie Sanders win in a given primary or caucus.

7 MODEL PERFORMANCE Logistic Regression Decision Tree
The ROC curves displayed above for logistic regression and Decision tree exhibit solid model accuracy measurements of 0.91 and 0.92, respectively. Values between .5 and 1 are acceptable, with values closer to 1 providing more accurate performance. These accuracy measurements provide confidence in the performance of our Logistic Regression and Decision Tree models.

8 MODEL EVALUATION Both the Logistic Regression and the Decision Tree models performed well The models had nearly identical RSquare measurements The model comparison platform in JMP was used to compare the performance of our models. rated the logistic regression the highest followed by the decision tree model. A closer look at the RSquares reveal that both these models performed equally well with almost similar error rates as well. Based on the results of this comparison, we will rely on the logistic regression and decision tree models to identify Bernie Sanders’ key demographic groups.

9 CONCLUSION Two successful models produced Demographics Identified
Population 25 years and over with bachelor's degree Public administration employees Households with cash public assistance income Non-hispanic or non-latino Cuban voters Alternate approach - Analyze demographics and outcome at the county level instead of state The logistic regression and the decision tree models developed in this study were reasonably successful in predicting the outcome of 2016 primaries and caucuses, and are thus able to identify the core demographic groups that are key contributors to Bernie Sanders wins. These two models identified those employed in public administration, those aged 25 years and over with a bachelor's degree, households with cash public assistance income, and non-hispanic or non-latino cuban voters as the significant predictors for a Bernie Sanders win. Based on these results, we recommend that the Bernie Sanders campaign focus its resources on identifying these target demographic groups in upcoming primaries. The campaign should market its candidate to these groups, and mobilize them to vote. If the identified demographic groups are represented at the polls in large numbers, Bernie Sanders will have a greater chance of winning upcoming states, and securing the Democratic nomination.


Download ppt "IDENTIFYING BERNIE SANDERS’ VOTER BASE THROUGH PREDICTIVE ANALYTICS"

Similar presentations


Ads by Google