As I mentioned last time I'll use the predictive models I created to project the next batch of primaries coming up on March 15th. I'll also talk about some limitations of these models and I will keep a running list of predictions to check against as more results roll in.
Last time I created actually created two models (classification of winner/loser and regression/prediction of the percentage of votes by candidate). I'm going to focus on the regression model because in many states the percentage of votes determines the delegates while others are winner takes all. Additionally, there are several very close races and the closer to 50/50 the harder time a classification model will have, so a regression model solves a lot of these issues.
The previous regression model had a test and training set for Hillary, and got an r-squared of 79.4%. I'm now going to use all of the data to build a model, and handle cross validation and testing in a slightly different way. The test set method is just one way to make sure your model can handle out of sample prediction well, but the drawbacks are that you're leaving a lot of data on the table and not incorporating it's information in the model. I'm going to use a different method called 10-fold cross validation which only "wastes" 10% of the data set (instead of 30%, or 50%) and the only drawback is that it's computationally more intensive (but not a huge deal on this dataset, it runs in maybe 30 seconds on my computer). This should make my model more robust and we can test the model based on the actual primary results as they come out.
I'll also use a different package in R called caret, which allows for easy implementation of k-fold cross validation. Cross validation is explained very well here, in a nutshell instead of building just one model, it iteratively partitions the data,leaving out some data each time, and selects the optimized model out of all the iterations. Also in practice sometimes a second hold-out set is used for testing after validation, which I'm going to skip for now and just "test" on future primary results.
rf_model_hillary<-train(fraction_votes ~ median_income + under18 + over65 + black + asian + latino + white + foreign + college + homeowners + pop_density, data=Hillary , method="rf",trControl=trainControl(method="cv",number=10),prox=TRUE,allowParallel=TRUE) #key parameters specified: data = Hillary -> the full data set, not a split one method = "rf" -> to indicate we are creating a random forest model trainControl(method = "cv", number = 10) -> this indicates that we are using 10-fold cross validation
if we type
rf_model_hillary into the R console we get output:
Random Forest 171 samples 60 predictor No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 152, 156, 155, 154, 153, 155, ... Resampling results across tuning parameters: mtry RMSE Rsquared RMSE SD Rsquared SD 2 7.094176 0.8148689 2.194778 0.09526081 6 6.873771 0.8146887 1.955505 0.08293056 11 7.019837 0.8075776 1.875310 0.07988646 RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 6.
The r-squared of the final model is 81.5% which is slightly better than our previous model. Quick note on the mtry variable, it represents the number of predictors sampled for splitting at each node, and is actually tuneable in the randomforest() package, but by default its set to the square root of the total possible variables and is considered very reasonable, caret performs an optimization search for the best mtry value, which is what you see in the output.
predicted_hillary <- predict(rf_model_hillary, newdata=county) #note that newdata can be a matrix or data frame and DOES NOT have to be in the same form as the data you created the model on (which is the case here). It does however have to have all of the X variables used in your model or you'll get an error. Also in our previous code we rescaled median income by dividing by 1,000 (e.g. 55 instead of 55,000) so I made sure to do the same rescaling on the newdata as well
The county table has demographic data for all counties in the United States, and additionally has rollups by state. My original thought was to predict county by county and then do my own rollup to get statewide results, however I realized that we are actually missing one critical piece of data to do this. We know the population of every county, but not how many registered democrats and republicans there are, and thus what percentage of each county is democrat or republican.
This is an issue because imagine there is a county with 98 republicans and 2 democrats. You can win 50% of the democrat vote, but you only got 1/100 of the population of the county. This is a data limitation because we know population by county, and percentage of the vote for democrats and republicans but don't know the full breakout.
So the alternate solution is just to pretend the state rollups are giant counties, and use the county level model to predict the state level. Just to see how close we are at the county level I'll make a prediction for Cook County, IL where I used to live in addition to predicting Illinois overall.
Because the democratic race is functionally binary, we can just create one model for Hillary and say 1- Hillary = Sanders. The republican side is a little more complex as we have to create 4 separate models to handle the 4 candidates, but the code is largely the same so I won't repeat it.
The moment you've been waiting for:
These are my numbers, with Nate Silver's predictions for reference. Some observations:
- It's a slightly boring result that Hillary and Trump are likely going to win pretty much everything but that's just the reality
- The republican race is a little harder to predict than the democrats, because there are still 4 candidates and what we've seen previously is there is more nuance to why people choose one over the other. By contrast the democrat race is a binary choice and the candidates are pretty polarizing.
- It's important to point out that the model we've created only has two inputs 1) past performance on primaries 2) demographics, which leads us to not be able to predict a likely Kasich victory in Ohio. Nate Silver uses polling data which would capture things like sentiment and recent momentum etc. as well as things that just aren't captured in demographics data such as the fact that Jon Kasich is the governor of Ohio. But demographics clearly play a large enough role that we can still come up with a baseline to then layer expertise and judgement on top of
I'm pretty pleased overall with the results. Nate Silver is highly respected for his analysis and it's great that doing some relatively basic machine learning can get us to the same ballpark in many cases. After this batch of primaries the next upcoming races are on March 22nd (Arizona, Idaho, Utah). I'll examine the accuracy of these predictions versus actual, make any additional refinements to the model, and then come up with predictions for the next batch of primaries.