US Democratic and Republican Primaries - Part V K-clustering

Introduction

I had last predicted that Hillary would win Arizona and Sanders would take Utah and Alaska, while Idaho and Hawaii were too hard to call with my model. It turns out that Sanders took everything except Arizona with much higher margins than I had predicted.

Fortunately I just became aware that Kaggle has updated the dataset for all of the recent primaries. I will accordingly retrain all of my models to include this new data, which I believe will help solve some issues of bias towards the past that we've seen (e.g. not capturing recent Trump and Sanders momentum for instance). In addition to retraining the data I'm going to create another tree-based machine learning model using Gradient Boosting Machines (GBM for short, I will explain it in more detail soon). While I work on that, I want to quickly show how I implemented clustering as a useful dimension reduction tool with a visual example.

K-clustering

I want to step back and generalize a bit to ask "What states are similar to each other and how?" One quick way to do this is by K-clustering, which is just an algorithm that partitions a dataset into a specified number of clusters. It does this (in a nutshell) by calculating the euclidean distance between points along as many dimensions as you want. The easiest way to understand clustering is just to think about the simplest two-dimension case:

The more robust cases just involve more clusters and/or more dimensions.

The code is relatively simple to implement, but an important first step is to scale the data to avoid one metric having outsize influence simply because it's on a very different scale (e.g. if you have population as a raw number but income expressed in thousands)

#filter the county data into just the state level rollups with some data manipulation

clusterstates <- county[which(county$state_abbreviation ==''),]
clusterstates <- clusterstates[which(clusterstates$area_name != "United States" & clusterstates$area_name != "District Of Columbia"), ]

#filter to just our selected features

keep <- c("median_income", "under18", "over65", "black", "asian", "latino","white","foreign","college","homeowners","pop_density")

clusterdata <- clusterstates[keep]

Now just use the scale function: clusterdata <- scale(clusterdata)

Our data is now centered at 0 and scaled to the z-scores. We can now run clustering on the scaled data:

#partition data into 4 clusters by state

kcluster <- kmeans(clusterdata, 4, nstart=25)

#note: rule of thumb is to use nstart between 20-50, nstart controls how many random starting assignments there are

#summarize results

clustersummary <- data.frame(area_name = clusterstates$area_name, cluster = kcluster$cluster)

The function returns a value from 1 to n for each data point, which tells us which cluster it belongs to. There is no true "optimal" value of choosing n (number of clusters to divide into) in the traditional sense, it's common to try a few values and see what makes the most sense. You want just enough clusters to show meaningful differences but not so many that it's overwhelming.

Because clustering just mechanically groups the data, it's up to us to look at the results and come up with a story or interpretation. One way to do this is by creating a visualization in two dimensions and look at how the data points are literally clustering visually. Since income and race have continually been key driving demographic forces it's a logical place to start.

#create a table with income, race, and cluster values

clusterplot <- merge(x = clustersummary, y = data.frame(area_name = county$area_name, white = county$white, median_income = county$median_income), by = "area_name", all.x = TRUE)


#Plot by cluster
ggplot (clusterplot, aes(x=white, y=median_income, color=factor(cluster))) 
+geom_text(aes(label =area_name)) #label data points as state names
+scale_color_manual(values=c('red','blue','black','gray50')) #color each cluster
+theme (legend.position= "none") #remove legend 
+labs(x="Percentage White", y="Median Income $K") #label axes
+geom_hline(aes(yintercept=51.9), linetype="dashed") #add national average income
+geom_vline(aes(xintercept=63), linetype="dashed") #add national average race

Race and Income by State Clusters

  • The national median wage is approximately ~$51,900, while nationally about 63% of the population is white (which is actually an all time low)
  • I purposely colored the "red" cluster that way because it's predominantly lower income states with high white population that tend to vote republican (there are some exceptions like Maine)
  • The other end of the spectrum is "blue" states that are relatively wealthy and more diverse, as well as being known for "liberal-ness"
  • The gray cluster is a little hard to interpret because it's kind of "middle of the road" demographically, I think it's interesting that Texas, Arizona, and Georgia are heavily Republican and look very much like the red cluster except for the race factor
  • I would classify the black cluster as high median income states, though Florida is an exception when it comes to income, so it must be heavily influenced by another axis that is not shown here (perhaps age)

Conclusion

I think it's pretty interesting that if we just visually plot these two metrics we essentially explain a good chunk of how states typically vote in national elections. Demography is to some extant, destiny. I should actually have included this earlier because this is a good step to take early on to start making some sense of the chaos quickly. This is also a great analysis to learn for people who are first getting into data science I think, because at its core it's just three simple steps: 1. scale 2. cluster 3. interpret, and the code can be implemented in a few lines.