This US election is probably the most interesting that many of us have seen in a while. Kaggle recently released a data set on the democratic and republican primaries
I thought it would be good chance to practice some data analytics on it. Personal Disclaimer: I'm a liberatarian but I would vote Hillary in this election.
Data Structure and Cleaning:
The link to kaggle has the full description of the data, but at a high level there are two csv files that contain polling data by county for Iowa, New Hampshire, Nevada, and South Carolina, showing votes and % of votes by candidate/party, and then demographic data for those counties.
I did part of the data cleaning in Excel instead of using R code, because I'm a consultant and it's just faster for me to join those two tables in excel by using the ID for the county as the unique key. Note: I've gone back and re-implemented the cleaning in R natively here. I also did a quick ranking to show which candidate had the highest percentage of votes for their party in each county, and created binary variable to tag them as winners. The processed excel files and full R scripts are available on github
After reading the data into R, first I wanted to focus on the candidates that are most relevant today, so I used the subset function to filter for Hillary, Sanders, Trump, and Cruz
#filter out less relevant candidates relevant <- subset(data, candidate == "Hillary Clinton" | candidate == "Bernie Sanders" | candidate == "Donald Trump" | candidate == "Ted Cruz")
In the next block I am doing two steps, 1. separating Democrats and Republicans into two sets to analyze separately and 2. condensing the data to just the candidate that won the district (e.g. right now there will be multiple entries for each district, one line for each candidate, I am condensing it to just the winner based on my rank variable)
#separate by party democrat_winner <- subset(relevant, winner == 1 & party == "Democrat") republican_winner <- subset(relevant, winner == 1 & party == "Republican") #this is an example of using the & operator where the previous example used or
The first visualization I want to do is on education and income since I believe these variables do a good job of capturing a lot of other demographic information that is multicollinear to income/education. Income and education have also driven much of the rhetoric in the election this year so I think it's interesting to see how these demographics drive voting
#plot counties won by candidates by education/income ggplot(democrat_winner, aes(x=college, y=median_income, size=votes)) +geom_point(aes(color = candidate), alpha = 0.5) ggplot(republican_winner, aes(x=college, y=median_income, size=votes)) + geom_point(aes(color = candidate), alpha = 0.5)
Democrats - Counties won by Education/Income
- quick note on how to read this chart, each dot is a county where size is the number of votes and the color coding is for the WINNING candidate
- I think it's interesting that the lower left quadrant (least educated, lowest income) seem to support Hillary instead of Sanders, whose socialist platform you would expect to resonate here
- Interestingly the larger/higher median income counties seem to go more heavily towards Sanders, though we're looking at medians so there is probably a point above the max of this dataset where people have high enough income to gravitate away from Sanders, and we're just seeing that middle class "sweet spot" for him here
Republicans - Counties won by Education/Income (same chart)
- Trump is not so surprisingly, destroying Ted Cruz, taking what appears to be most of the larger counties
- Trump has surprisingly broad appeal, across a wide range of income levels as well as education levels. I find this interesting because just from news clips you would think that his base is... not so broad but I was kind of surprised to learn that Donald Trump really appeals to minorities
Trees: Looking Deeper into Demographic Drivers
Trees are a useful concept for quickly identifying key drivers. I'll leave the theoretical explanation to others, but in a nutshell trees help us quickly see what variables have the largest impact on either classification or regression. In this case we'll use a classification tree to see what demographic variables drove people to vote for which candidate.
Democrats - classification tree
#Full tree on demographic factors tree <- tree(candidate ~ median_income + under18 + over65 + black + asian + latino + white + foreign + college + homeowners + pop_density, data=democrat_winner, mincut=1, mindev=0.005) plot(tree); text(tree) #note I picked what I thought were the most interesting variables out of a larger pool
Just an example of the full tree, not really useful before we prune it. This is based on the CART algorithm which recursively splits the data into two decisions about y on some x, then takes each partition and tries to splits again
#prune tree nodes cut <- prune.tree(tree, best=4) plot(cut) text(cut) #note the best practice is to use cross validation to find the nbest to prune to, I just did 4 because my cross validation gave me an overly shallow tree. My hypothesis is that because we have to separate the data into Democrats vs Republicans we are already reducing the amount of variation that can be picked up by a tree (e.g. a person who votes for Trump is already demographically fairly close to someone who would vote for Cruz, versus comparing a democrat and a republican outright)
- The way to read this is at the top fork, if <10.3% of the county is black is false, then the county is likely to vote for Hillary (the left fork means the boolean above is true, while the right fork means false). That is race is the number one driver statistically of whether a county will vote for Hillary or Sanders, and a higher concentration of African Americans favors Hillary
- The next node is income, as we saw in the earlier plots, less than ~48K in median income favors Hillary as well
- The final node is age, or the % of the population under 18. (Unfortunately there wasn't a good variable to capture millennials since this seems like an interesting demographic group to explore with regards to Sanders) but if less than 21.7% of the population is younger than 18, the county is likely to favor Sanders, which is a little hard to interpret but I will hypothesize that a higher % of the population under 18 means more families, and families perhaps support Hillary
- The other important concept to keep in mind about trees is that each time the data gets split, the resulting "leaf nodes" hold a data subset, so for example we already first split the data into counties that are <10% or >10% in terms of African American population, when we get to the median_income node, we are further splitting the subset that is <10% African American by income, such that predominantly non-black counties with low income favor Hillary
Republicans - Classification Tree
- Again, race seems to be a strong driver, with African Americans supporting Trump more heavily
- Where this tree differs from the democrats side, interestingly income does not come up while population density does suggesting the republican party differs more strongly between more rural and urban counties, as well as along education
- Surprisingly when the latino population is higher, Donald Trump is favored, despite his comments on immigration. This may not be representative of a national election but Trump did win the Hispanic vote in Nevada which is in this data set
Looking at the data gives us a bit more nuance than some of the broad strokes we get from the news. Initially the breadth of demographic support seems to favor a Trump vs Hillary race. I'm not sure that we can use this data set to directly compare Hillary vs Trump, since the intent of the primaries is to pick the candidate for each party e.g. voters are presented a choice of X candidates from one party, and Hillary or Trump is a fundamentally different question. Even Mitt Romney is telling republicans to vote strategically.
In my next post I will extend the tree's concept by implementing a classification-based random forest model. Additionally I will do a deeper dive into Hillary and Trump specifically, and use a training and test sample to cross-validate a model to predict if a specific candidate will win a county given demographic inputs.
Jump to PART 2