With new data for additional states it's a good time to refactor my scripts for cleaning the data, previously I took a few shortcuts in Excel just to get going faster, and now I want to handle pretty much all of the cleaning natively in R. I will also retrain the existing random forest models and create new tree-based GBM models with the greater amount of data that we now have available.
Here's a link to the full script, I'll just highlight especially nifty functions below:
#read in data results <- read.csv("C:/~/primary_results.csv", header= T) county <- read.csv("C:/~/county_full.csv", header= T) #do a left join to merge the tables, much like the SQL function data <- merge(x = results, y = county, by = "fips", all.x = TRUE) #create a new percentage vote variable to re-base for just the remaining candidates #note: this is basically how you do a sumifs in R data <- group_by(data, party, fips) data.summary <- summarise(data, votes_new_base = sum(votes)) data <- merge(x = data, y = data.summary, by = c("fips","party"), all.x = TRUE) #for each entry calculate a new percentage of votes with a denominator based on remaining candidates data$fraction_votes_new <- data$votes / data$votes_new_base
Being able to work with databases either with SQL or just conceptually joining different data tables is pretty important in consulting, while SUMIFS is one of the most useful functions for summarizing or doing different cuts of data. I wanted to highlight the analogous functions in R, because a lot of day-to-day consulting work in Excel boils down to INDEX-MATCH (which is basically a join), and SUMIFS to manipulate data. This also seems relevant because as the size of databases increase, I've increasingly worked with data sets that pushed the limits in Excel and it's good to know some alternatives that run much faster and can handle much larger scale.
Gradient Boosting Machines
GBM is closely related to Random Forests, to generalize both RF and GBM are methods of improving tree-based models by averaging over many trees. They differ in methodology and tradeoffs summarized below from various sources:
I don't think there's a universal answer for "which is better", rather we should just fit both and see how well it performs via training/validation/test.
Bias Variance Tradeoff
Bias/variance tradeoff can be kind of a confusing concept, but it's important especially as you move from simple linear regression to more complex machine learning models. This is a high level summary, as well as a link to a more robust discussion on it:
- Model error is a combination of noise + bias + variance, noise is by definition largely unavoidable but bias and variance can be optimized for
- Bias is the difference between expected and actual values, or how close to the true value we are. Sometimes represented as how close darts are to the center of the target
- Variance is how repeatable the model is, sometimes represented as how closely grouped the darts are
- High bias generally implies low variance, and vice versa, an optimum model has low variance and bias. Ideally you want to throw all of the darts into the center of the target
Build GBM Model by Candidate
#set some tuning parameters, mostly just equal to defaults but force n.trees to test up to 5000 trees, by multiples of 50 trees gbmGrid <- expand.grid(n.trees = (1:100)*50, interaction.depth = 2:5, shrinkage = 0.1, n.minobsinnode = 10) #mostly the same syntax as randomforest using caret, note that method="gbm" gbm_model_hillary<-train(fraction_votes_new ~ median_income + under18 + over65 + black + asian + latino + white + foreign + college + homeowners + pop_density, data=hillary , method="gbm",trControl=trainControl(method="cv",number=10), verbose=FALSE, tuneGrid=gbmGrid)
gbm_model_hillary returns the R-squared for the models tested on different parameters as well as the final selected parameters. In this case the optimal model has a R-squared of about 80.0%, remember that random forest returned a R-squared of 80.8% so they perform similarly but random forest has a slight edge. If we had to select just one model (in the usual training/validation/test framework), we would probably go with random forest but I'm going to record predictions for both models just to have two "opinions" on yet unseen future primaries, which I'm implicitly using as the test set.
Write out final outputs for new models trained on the new data set:
output <- data.frame(area_name = county$area_name, state = county$state_abbreviation, rf_hillary = rfpred_hillary,gbm_hillary = gbm_pred_hillary, rf_trump = rf_pred_trump,gbm_trump = gbm_pred_trump, rf_cruz = rf_pred_cruz, gbm_cruz = gbm_pred_cruz)
I used Tableau (which has a great free version called Tableau Public) to summarize my results in a better format than a giant table.
For simplicity I am assuming 1-Hillary% = Bernie% and 1-(Trump%+Cruz&)=Kasich%
Democrats - % of Vote by State
Republicans - % of Vote by State
In General it looks like Hillary is going to take all of the Northeast + California. The Republican race continues to be harder to predict, for example recent polling for Republicans shows that Kasich is doing a lot better than he previously was. Our models will likely bias towards underestimating Kasich at the expense of overestimating Trump. Perhaps Kasich consolidated some Rubio voters looking for a third option besides Trump or Cruz.