US Democratic and Republican Primaries - Part II Random Forests


What do you think the correlation is between having participated in the WWE, and winning primaries is? Because it's looking pretty good with our sample size of 1 Donald Trump.

Because of the historic upset in Michigan, I want to focus a bit more on the Democratic race. Continuing from the previous post, I'm going to create some predictive models using random forests. Random forests are one of the main concepts that I wanted to learn more about, I recently did some work where I was creating regressions to predict future consumption for forecasting purposes. I ran into a lot of the limitations of linear regression (especially around overfitting and poor prediction). In contrast, Random forest is a powerful machine learning algorithm that 1) does a very good job at out of sample prediction 2) tends to do quite well in Kaggle competitions and 3) is not too hard to execute in R. The catch is it's difficult to understand intuitively, making it sort of a powerful black box.

The core concept is it creates many random trees (which I used in my previous post) and averages across them to get a better result.

Candidate Visualization

Before diving into RF, let's look at a few scatterplots of what percentage of the vote each candidate received on some key demographic factors by county, to confirm some intuitions from our previous trees.

Hillary Clinton

#isolate Hillary Data
Hillary <- subset(relevant, candidate == "Hillary Clinton")

#plot Hillary Data
qplot(x = median_income, y = fraction_votes, data = Hillary, ylab = "fraction of votes") + geom_smooth(method='lm',formula=y~x)

qplot(x = college, y = fraction_votes, data = Hillary, ylab = "fraction of votes") +geom_smooth(method='lm',formula=y~x)

qplot(x = black, y = fraction_votes, data = Hillary, ylab = "fraction of votes") +geom_smooth(method='lm',formula=y~x)

qplot(x = white, y = fraction_votes, data = Hillary, ylab = "fraction of votes") +geom_smooth(method='lm',formula=y~x)

  • As income increases Hillary tends to receive a lower percentage of the vote (at least up to the ~60K range) versus Sanders. Above 70K there appear to be some outliers which suggest it's not necessarily a linear trend past a certain income band
  • There's a less steep, but negative, correlation between the percentage of college degree holders and votes for Hillary
  • The third plot seems to confirm what we've seen in the tree analysis that predominantly African American counties favor Hillary

Donald Trump

  • Black Republican counties favor Trump, as do Latino's but seemingly to a lesser extant
  • There's a slight relationship where a higher concentration of college graduates means less support for Trump, which has been noted before

I think the interesting question is IF Hillary and Trump were to meet in a general election how the African American vote and the less educated/lower income vote would shake out

Random Forests

We can see from some of the scatter plots above that the relationships we are modeling are not always linear in nature. Random Forests can handle non-linearity quite well and also have the advantage of being able to handle both classification and regression. I'll build one random forest model to predict which candidate will win a county based on demographics (classification model), and then another random forest where I'll predict the percentage of votes on the same demographic factors, as a continuous variable (regression model).

Split data into test and training sets

The first step is to split the data so we can train the model on one set of data and then test it on another to measure how successful it is at prediction. What this helps prevent is overfitting. In an extreme case, imagine for a simple linear model that we draw a line through every point of our training data. We've technically fit a model perfectly, but this model probably won't predict out of this sample very well because we're fitting all of the noise as well. While RF is already much more robust to this issue, what we want is a model that will both perform well on the training set but also will do a good job of fitting new data it's never seen before (and can be refined against), which we have reserved for this purpose.

Note that the random forest algorithm already handles bootstrapping, that is it on each iteration it selects 2/3 of the data randomly and reserves 1/3 of the data (referred to as the out-of-bag data) to estimate classification error and variable importance. So the algorithm actually produces a statistic called the out-of-bag error estimate which approximates the out of sample predictive power. There's some references to Random Forest not needing a separate test and training set at all for this reason, but reading stackoverflow, quora, kaggle, and speaking with practitioners it still seems standard to use a test/training set for random forests so I will follow that convention.

To do this we essentially want to randomly assign data points to training vs test, and this can be accomplished through a simple function:

splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
#note the curly brackets, R allows you to write functions and then call them similarly to Javascript or another scripting language

We then call the function we've created on our dataset

splits <- splitdf(Hillary, seed=808)
training <- splits$trainset
test <- splits$testset
#quick note if you haven't seen seed before, seed is basically a random number generator, and you can set seed to any number. What's useful about the function is if you use the same seed number again it will always generate the same random number sequence, such that your results are still randomized but reproducible

Initial Regression Random Forest

Create the random forest model and look at the relative importance of variables:

rfmodel <- randomForest(fraction_votes ~ median_income + under18 + over65 + black + asian + latino + white + foreign + college + homeowners + pop_density, data = training, importance=TRUE)

note: you have to add 'importance=TRUE' for R to store the order of variable importance

varImpPlot(rfmodel, type=1)

  • We again see that race and income are critical factors. This particular chart doesn't tell us the direction of the relationship but if we think about the trees and scatterplots we've produced previously, this is another lens on how important the African American vote is to Hillary

So far we've looked at what's happened in the past and tried to understand why, now let's shift into using that information to predict new results. I'm going to use the predict function, which takes the model we have created and will actually generate new predictions based on new inputs.

predicted <- predict(rfmodel, newdata=test)

note: we are asking R to use our existing model we trained but use new data inputs from our test set in order to predict the Y variable, which is why the inputs to predict() are rfmodel (the model), and test (the new data set). 

We can then compare it to the actual fraction_votes from the test data that we held out and see how well our model performed

actual <- test$fraction_votes

rsq <- 1-sum((actual-predicted)^2)/sum((actual-mean(actual))^2)

note: Because this is regression on a continuous variable, we're using SSE between actual vs predicted as the measure of how well the model fit

I got a rsq value of 79.4%, suggesting we were able to explain about 80% of the variation in the test set using our training model. Not bad for the first run.

Initial Classification Random Forest

In this case, I want to use a slightly different data table because instead of just looking at Hillary and predicting her % of vote by county, I actually want to classify each county as being Hillary or Bernie. I'll use our previously created subset with the democratic counties with just the winner (Bernie or Hillary) and we will use classification to predict if a county will go to Bernie or Hillary

splits <- splitdf(democrat_winner, seed=808)
training <- splits$trainset
test <- splits$testset
#split into test vs training as before using our splitdf() function

Create the classification rf model:

rfmodel_class <- randomForest(candidate ~ median_income + under18 + over65 + black + asian + latino + white + foreign + college + homeowners + pop_density, data = training, importance=TRUE)

varImpPlot(rfmodel_class, type=1)

note: The code is very similar to the previous code we used for the regression rf, we're just feeding it different data and feeding candidate as the Y variable. R understands that candidate is a factor variable (Bernie or Hillary) as opposed to a continuous numeric variable

  • We again see that race and income are critical factors for classification
  • It's hard for me to explain why under18 came up as the first variable. We did see it in our previous trees but as the third node. I will have to tune my model a bit and do a bit of background reading to think about why this could be the case

Let's test our classification model against the test set. I'm approaching this test a little differently since the result will return a predicted winning candidate's name by county. I think it's easiest and most intuitive to simply print the results to Excel and examine the resulting table to see how many counties we correctly predicted

#Predict the winner by county and extract to a table

test$predicted <- predict(rfmodel_class, newdata=test)

check <- test[,c("predicted", "candidate", "fips")]
write.table(check, file = "demcheck.csv", col.names = TRUE, row.names = FALSE, sep = ",")

What the extract looks like:

I got an accuracy of 70.5%, that is out of all the counties my predicted candidate matched the actual in 70.5% of cases. I realize that not all counties are equal in size/importance, this is just purely a measure of whether the model performed classification correctly or not. We beat a 50/50 coin flip!


Now that I've built some initial predictive models it can get really exciting. This is the schedule of completed and upcoming primaries. I'm going to further tune my model using some of the states that weren't in the initial data set but now have results, as well as generate predictions for the remaining states.

I'm not sure exactly how I'll handle this yet since I built the model based on counties but I'm not sure I'll be able to easily access future data on which candidates won by county (without doing a lot of my own scraping/cleaning). I could either try to predict by county and roll it up and then check my results by state, or just pretend that a state is a giant county and predict it directly.


Other References

Good academic background on Random Forests and their advantages

Reference code I used for splitting into test/training