Now that the primary process is nearly over, I am eager to try to draw any conclusions about the general election. As I've mentioned before, our previous models are not particularly relevant because they are based on how people choose between primary candidates, as opposed to the general election where they choose between a Democrat, Republican, and sometimes a 3rd party representative. My idea for solving this issue is to go back to the 2012 general election between Barack Obama and Mitt Romney and use the county level data. This isn't perfect, but let's just lay out the limitations and assumptions:
- We can abstract the idiosyncrasies of individual candidates, to parties. This doesn't seem like too much of a stretch because despite many Republicans initially declaring "Never Trump", they're now behind him presumably because party affiliation trumps their dislike of the candidate, or they were simply signalling or threatening to not support him during the primary as a tactic
- County level demographics have not materially changed. This is mostly because the convenient data set we have for demographics by county (from the Census) is from 2014. Technically we should get an earlier data set pre 2012 for training the model, and then predict on the 2014 data. However this data being based on the census, is only conducted periodically, so for simplicity let's assume demographics are pretty stable in such a short time frame
- We are probably underestimating the Libertarian party effect. This year Gary Johnson is polling around 10%, and analysis suggests that he pulls more support from Hillary than Trump. With the choice between two unappealing major party candidates, Gary Johnson could have outsize leverage on the outcome of the election without winning
- On average our predictions will be very similar to the 2012 election results. This is partly a consequence of states voting fairly consistently over time (think Texas going Republican for the past few decades). Where we can add some value is predicting specific counties, and then rolling them up to the state level to see if there are any surprising results, as well as estimate the percentage margin of victory for the winning party.
I downloaded the dataset from The Guardian.
Some key steps that were needed to clean this data and make it compatible with our county level demographics dataset:
gen_election <- read.csv("./2012_genelection.csv", header= T) #rename fields to be consistent with county data gen_election <- rename(gen_election, c("FIPS.Code" = "fips", "State.Postal" = "state_abbreviation")) #rollup to FIPS level, county data lowest level is FIPS, while the 2012 data has some cities that need to be rolled up to FIPS gen_election <- group_by(gen_election, fips) data.summary <- summarise(gen_election, obama_new = sum(Obama), romney_new = sum(Romney), new_total = sum(TOTAL.VOTES.CAST)) #recalculate the percentage of vote won by candidate at the FIPS level data.summary$obama_per <- data.summary$obama_new / data.summary$new_total data.summary$romney_per <- data.summary$romney_new / data.summary$new_total #join with county data table gen_data <- merge(x = county, y = data.summary, by = c("fips"), all.x = TRUE) gen_data <- na.omit(gen_data)
#create a training set using 2/3, with remainder for testing set.seed(123) sub <- sample(nrow(gen_data), floor(nrow(gen_data) * 0.67)) training <- gen_data[sub, ] test <- gen_data[-sub, ] #run random forest model rf_model_dem<-train(obama_per ~ median_income + under18 + over65 + black + asian + latino + white + foreign + college + homeowners + pop_density + women, data=gen_data , method="rf", trControl=trainControl(method="cv",number=10), prox=TRUE,allowParallel=TRUE) #note the way this model is setup we're predicting the percentage of the vote that the democrats will capture, based on Obama's performance against Romney. This should be sufficient to tell us if the democrats will win a majority of the state and thus it's electoral vote pool or not #test model SSE predicted <- predict(rf_model_dem, newdata=test) actual <- test$obama_per rsq <- 1-sum((actual-predicted)^2)/sum((actual-mean(actual))^2) print(rsq)
 0.9216383 or a RSQ of 92% on the test set.
#create prediction on all counties and write output rf_pred_dem <- predict(rf_model_dem, newdata=county) output <- data.frame(fips = county$fips, area_name = county$area_name, state = county$state_abbreviation, rf_dem = rf_pred_dem)
Model Output and Conclusions
- Differences vs 2012, Colorado, Iowa, and Ohio were democratic, but I'm modeling them as Republican in 2016. Based on this aggregation of polls it looks like Colorado is indeed favoring Trump while Iowa is very close
- Despite modeling these couple of states flipping to red, I still model democrats winning 299 electoral votes overall, and thus winning the election. For reference Obama won 332 electoral votes to Romney's 206, and 270 are needed to win
- As I mentioned before, Johnson could be a complete wild card. Because the model is based on the 2012 election results where Gary Johnson won 1% of the vote we could be vastly underestimating his impact on the election, given his current polling at 10%. This is especially critical in a close election and it really matters how many relative votes he siphons from each major party
- Sidenote, tomorrow is the critical California primary, which I predicted Hillary will win with 59% of the vote. Recent polling has it much closer, so I'm eager to see what happens. Numerically Hillary is basically the candidate already but she can put the nail in the coffin by winning California fairly decisively, while losing even by a small margin means having to deal with Sanders for that much longer