Using networkD3 in R to create simple and clear Sankey diagrams

I find Sankey diagrams super useful for illustrating flows of people or preferences. The networkD3 package in R offers a straightforward way to generate these diagrams without needing to know the ins and outs of the actual D3 code.

To show you what I mean, I generated a Sankey diagram to show how the twelve regions of the UK contributed to the overall result of the 2016 Brexit referendum, where voters chose to leave the European Union by 17,410,742 votes to 16,141,241.

If you want to see the fully interactive Sankey diagram for this, you can view the code via an RMarkdown document on RPubs here. Unfortunately only static images can be displayed on Medium.

Getting the data in shape

Very detailed data on the Brexit referendum can be obtained from the UK’s Electoral Commission website. The first step is to get our libraries loaded and to get the data into R. Since the data is very detailed down to the most localized voting centers, we need to aggregate all the Leave and Remain votes to get a total for each region.

## load libraries
library(dplyr)
library(networkD3)
library(tidyr)
# read in EU referendum results dataset
refresults <- read.csv("EU-referendum-result-data.csv")
# aggregate by region
results <- refresults %>% 
dplyr::group_by(Region) %>%
dplyr::summarise(Remain = sum(Remain), Leave = sum(Leave))

We then need to create two dataframes for use by networkD3 in its sankeyNetwork() function:

  1. A nodes dataframe which numbers the source nodes (ie the 12 UK regions) and the destination nodes (ie Leave and Remain), starting at zero.
  2. A links dataframe which itemized each flow using a source, target and value column. For example, the West Midlands region cast 1,755,687 votes for Leave, so in this case the source would by the node for West Midlands, the target would be the node for Leave and the value would be 1,755,687.

Here is some simple code to build the data in this way:

# format in prep for sankey diagram
results <- tidyr::gather(results, result, vote, -Region)
# create nodes dataframe
regions <- unique(as.character(results$Region))
nodes <- data.frame(node = c(0:13),
name = c(regions, "Leave", "Remain"))
#create links dataframe
results <- merge(results, nodes, by.x = "Region", by.y = "name")
results <- merge(results, nodes, by.x = "result", by.y = "name")
links <- results[ , c("node.x", "node.y", "vote")]
colnames(links) <- c("source", "target", "value")

Now that we have our data constructed the right way, we can simply use the networkD3::sankeyNetwork() function to create the diagram. This produces a simple, effective diagram, with rollover interactivity displaying the details of each voting flow. The static version is presented here.

# draw sankey network
networkD3::sankeyNetwork(Links = links, Nodes = nodes, 
Source = 'source',
Target = 'target',
Value = 'value',
NodeID = 'name',
units = 'votes')
Brexit Referendum 2016 vote flows by region

Leave a Reply

%d bloggers like this: