What would the London Tube Map look like if Data Scientists designed it?

The London Tube Map is a seminal piece of design. Anyone who has looked at it remembers what it looks like. Over the years it has developed a colour scheme and typography that has contributed to its unique look and feel, but what makes it so special is the simple stroke of genius that led to its design.

In the early 1900s, the many different ambitious underground and overground rail projects that had taken place in London in the previous half century or more had led to a complex array of interconnected lines, with many stations connecting 2, 3 or even 4 different railways. Attempts to draw the entire network in a map had proved un-intuitive and complex for users.

Then along came Harry Beck, who realized that when people take the underground railway, they have no concern for where they are geographically — all they care about is how many stops, and where they need to change over. Realizing this, Beck created the first versions of today’s map, finding the simplest way to draw all the various lines as straight as possible, showing clearly where the interconnections were. Importantly, however, Beck realized that the vague geographic direction of lines was necessary as people could not completely ignore north from south or east from west — people need a basic orientation when they look at a map. So in many ways Beck’s map is a compromise of design and accuracy. Here is a modern version of Beck’s map.

Today’s version of Beck’s tube map

What if Beck was a data scientist?

I was stuck inside on a rainy day in Melbourne, Australia and was thinking about networks (as I am currently writing a series about them). I realized that if a data scientist had been asked to develop a map of the London Tube network, there would be no room for compromise. They would have likely considered one of the two most extreme designs:

  1. Completely geo-free: Use a force-directed network to determine the positioning of the stations, irrelevant of their actual location
  2. Completely geo-accurate: Akin to the original pre-Beck Tube maps, superimpose the network onto a map of London using spatial co-ordinates

First, I needed to find a data source that could tell me the connections in the London Tube network, as well as the stations and lines. Don’t you just love it when someone has already uploaded data sets that are perfect for your aims? I found a data set that I could easily adjust here. The data even included the hexadecimal colour codes for the Tube lines. Transport for London publish a design style guide here, by the way.

Completely geo-free (force-directed) Tube map

If you don’t want to continue reading the ‘how’, feel free to go right to my interactive force-directed Tube map here. Or you can read on and see it at the end of this section.

I needed to find an algorithm for generating a force directed network that also visualized it easily. I’ve been using the networkD3 package in R a lot recently and was happy to see that there is a forceNetwork() function available there

Given the available data and the user-friendliness of networkD3, not a lot of complex coding is needed here. First we load up the libraries and load in three files which I had tweaked from the originals.

# load libraries
# load data
stations <- read.csv("stations.csv")
connections <- read.csv("connections.csv")
lines <- read.csv("lines.csv")

The stations dataframe is simply a list of station names with an ID number for each station, and their spatial co-ordinates (not needed for the geo-free version). There are 302 stations in the Tube map. The lines dataframe is a list of the 13 lines on the network, with a numbered ID, line name and official colour of the line. The connections data frame is a list of all connections between any two stations, along with the line number that connects them. There are 406 total connections in this dataframe.

I decided to colour the edges in my network to match the line colours in the official tube map. I also decided to colour the nodes (ie stations) according to the line they were on, and when they were on multiple lines, I just selected the colour according to the minimum line ID number (fairly random). So this meant I needed to add some columns to my stations and connections dataframes to capture the colour of the station and the colour of the connection.

# bring in line colour into connections dataframe for edge colours
connections <- merge(connections, lines)
connections <- connections[ ,c("station1", "station2", "line", "colour")]
# define a colour for each station using min of line ID 
connections_unique_lines1 <- connections %>% dplyr::group_by(station1) %>%
dplyr::summarise(line = min(line))
colnames(connections_unique_lines1) <- c("station", "line")
connections_unique_lines2 <- connections %>% dplyr::group_by(station2) %>%
dplyr::summarise(line = min(line))
colnames(connections_unique_lines2) <- c("station", "line")
connections_unique_lines3 <- rbind(connections_unique_lines1, connections_unique_lines2)
connections_unique_lines <- connections_unique_lines3 %>% dplyr::group_by(station) %>%
dplyr::summarise(line = min(line))
# merge line IDs into stations dataframe
stations <- dplyr::left_join(stations, connections_unique_lines, by = c("name" = "station"))
# merge with lines dataframe to capture line_name
stations <- dplyr::left_join(stations, lines, by = "line")

Now we have done most of the work we need. We just need to number the stations in a zero indexed format for the requirements of D3.js:

# create indices for each name to fit forceNetwork data format
connections$source.index <- match(connections$station1, stations$name) - 1
connections$target.index <- match(connections$station2, stations$name) - 1

So now we have everything we need to draw the network. We will use the forceNetwork() function in the networkD3 package. The connections dataframe contains the links we need, and the stations dataframe contains the details of the nodes. We use theline_name column in the stations dataframe to group the stations for the purposes of colour coding the nodes, and we use the colour column in the connections dataframe to colour code the edges (according to the official colours of the lines they are on).

We also need to define the colours of the nodes to match the lines, and then let’s also try to use a similar typeface to the London Tube Map. I’ve used Gill Sans, which is not the official typeface, but is pretty close (Eric Gill actually worked for Edward Johnson who designed the original typeface for the Tube).

So here is the code to generate the network.

networkD3::forceNetwork(Links = connections, Nodes = stations,
Source = "source.index",
Target = "target.index",
NodeID = "name",
Group = "line_name",
colourScale = JS('d3.scaleOrdinal().domain(["Bakerloo",
"East London",
"Hammersmith & City",
"Waterloo & City", "Docklands"]).range(["#AE6017",
linkColour = as.character(connections$colour),
charge = -30,
linkDistance = 25,
opacity = 1,
zoom = T,
fontSize = 12,
fontFamily = "Gill Sans Nova",
legend = TRUE)

So how did it turn out? Here’s a static image. If you want to see and play with the full interactivity offered by networkD3, I have uploaded the map here.

Data Science can’t make up for good design

So if we took Beck’s principle that people don’t need to make geographical sense of the Tube map to the extreme, this is what data science tells us is the most aesthetically pleasing way to visualize the Tube map. Some things make sense — for example, Heathrow and Richmond are approximately where they should be — but other things look completely bizarre. Epping is now in South London instead of Essex, for example. I’m sure there are many other interesting observations to be made here. If you are a Londoner, please explore and feel free to highlight any observations you make.

Completely geo-accurate Tube map

Again, if you don’t care about the ‘how’, you can see the result here.

To achieve this, I needed to get a map of London and superimpose the stations and links using their spatial co-ordinates.

We will mostly use ggplot2 for this, and we need to use more packages here, but we will use the same datafiles as before.

# load libraries
# load data
stations <- read.csv("stations.csv")
connections <- read.csv("connections.csv")
lines <- read.csv("lines.csv")

I got a GIS data file for the London map with the various borough boundaries from here, and unzipped it into a folder called london-map-data. Then I needed to convert this data into a format for ggplot2 to use.

# import London borough GIS data
london <- rgdal::readOGR(file.path("london-map-data"))
sp::proj4string(london) <- sp::CRS("+init=epsg:27700")
london.map <- sp::spTransform(london, sp::CRS("+init=epsg:4326"))

Now with the London map data in the right format, we can plot it using ggplot2.

# plot London boundaries
map1 <- ggplot(london.map) +
geom_polygon(aes(x = long, y = lat, group = group), fill = "white", colour = "black")
map1 <- map1 + labs(x = "Longitude", y = "Latitude", title = "London Tube Routes")

Here’s the simple map that we will draw the Tube lines and stations onto:

It will be easy to plot the stations on this map since the stations dataframe has the spatial co-ordinates for every station. To plot the lines, we need to pull in the spatial co-ordinates of each station pair in the connections dataframe.

# get spatial co-ordinates for each station pair in network
connections <- connections %>%
dplyr::inner_join(stations, by = c('station1' = 'name')) %>%
dplyr::rename(x = longitude, y = latitude) %>%
dplyr::inner_join(stations, by = c('station2' = 'name')) %>%
dplyr::rename(xend = longitude, yend = latitude)
connections <- merge(connections, lines)

We also need to manually define the colours of the lines as close as possible to the official colours, since ggplot2 won’t have some of the hex colours in its palette.

#define line colours
linecolours <- c("brown", "yellow", "pink", "grey", "lightblue", "red", "darkgreen", "orange", "maroon", "black", "darkblue", "lightgreen", "#00A77E")
names(linecolours) <- lines$line_name

Now we are ready to plot the station and lines onto the London map, again using the Eric Gill typeface for authenticity.

# plot network on London map
map1 +
geom_point(data = stations, aes(x = longitude, y = latitude)) +
geom_curve(aes(x = x, y = y, xend = xend, yend = yend,
color = line_name),
data = connections, curvature = 0.33, size = 1) +
scale_color_manual(values = linecolours, name = "Line") +
theme(text = element_text(family="Gill Sans Nova"))

And here is the result. You can see a better resolution here:

This was a super fun way to explore force-directed networks and geospatial mapping. My main conclusion is : there’s no substitute for clever human design — yet, anyway!

Leave a Reply

%d bloggers like this: