US Democratic and Republican Primaries - Part IV Results and Adjustments

Introduction

Last Tuesday was tough on the Sanders fans out there, but we roll on to what is likely a Hillary vs Trump showdown. This is a shorter post, just to keep up with the live primary schedule. At this point more candidates have dropped out and our dataset from just the initial 4 primaries might be getting a little stale. I just want to recap how our predictions did, minor adjustments to make, and then predict some more upcoming races.

Random Forest Prediction Accuracy - March 15 Primaries

  • The major themes I think are consistent with what we expected. Trump got a higher percentage than we predicted because several Republicans dropped out of the race and simultaneously Trump started surging (see below for how I adjust this going forward). And as expected Kasich won Ohio which was not predictable via a model built on demographics alone
  • More unexpected was that several Democrat races were tighter than both our model and polling data predicted. While we were on the money for Florida and Ohio, Illinois and Missouri were very tight races
  • I think one immediate conclusion to draw is that the model could be improved by incorporating both demographic data and recent polling data that can capture more intangibles like momentum, endorsements, and idiosyncrasies like Kasich in Ohio

Adjustments to the Data

With Marco Rubio dropping out, we should adjust our data at this point. It's complicated to try to dissect how votes will be redistributed, I think the simplest adjustment we can make is just to recalculate the percentage of votes variable to only take into account remaining candidates. For example let's say in our previous data for a given race Trump got 40 votes, Cruz got 40 votes, and Rubio got 20 votes. I would adjust Trump and Cruz's percentage of votes to 50/50, and use this new figure to train our model on. This isn't perfect but it's probably more straightforward to just use the data as if Rubio wasn't an option, rather than try to make a lot of adjustments for precisely how Rubio's vote would be distributed. When we make predictions on this retrained model, it will push the percentages up for the remaining candidates rather than be biased lower as it was in the previous predictions. This issue is more prevalent on the Republican side as they started with a wider more fragmented field, while Democrats have mostly voted for Hillary and Sanders anyway but I'll adjust the Democrat percentage of votes as well.

Upcoming Races

Arizona and Utah occur on March 22nd, and there are several other upcoming Democratic caucuses/primaries. As of right now on fivethirtyeight there isn't a prediction for some of the races because there isn't sufficient recent polling data, which is an opportunity for us to add some value by generating our own predictions!

*averaged polling data

  • Predictions are pretty consistent with fivethirtyeight's polls and predictions. Given how close the Democrat race was in Illinois and Missouri, Idaho is virtually a coinflip and could very well go to Sanders in addition to Utah
  • They really don't like Trump in Mitt Romney's home state, so it seems pretty safe to predict Cruz there. Meanwhile Trump probably takes Arizona with his stance on immigration
  • There's not much data out there on Alaska, Hawaii, and Washington state (which occur on March 26th) but the demographics are actually pretty favorable to Sanders

Conclusion

Next I'll make a few more predictions for upcoming east coast primaries and then see if there are any other tools or methods we might look at to generate additional insights (clustering for example). Finally I'll tie everything together with a synthesis and think about how this can inform us about the general election.

PART 1, PART 2,PART 3