This is the second in a series of articles explaining the principles of graph theory for those who may use it in a data science context. The first article, which focuses on the origins of graph theory and the basic properties of graphs, can be found here.
One July morning in the mid-19th century, the people of London awoke to a disgusting, gut-twisting stench. They couldn’t leave their homes without being sick. The more well-off laced their handkerchiefs with perfume and walked around permanently covering their faces. Many of the poor left town to find work in the countryside because they just couldn’t bear it. It was without a doubt the smelliest incident in British history.
It was the beginning of what became known as the Great Stink of 1858. The River Thames, full to the brim with centuries of human waste that had been dumped directly from the medieval wooden sewer system, was finally getting its revenge. Washed up onto the banks of the river, the cholera-ridden sludge basked in the unusually hot summer temperatures and formed a miasmic stench that was inescapable for miles.
The City of London Corporation, not known at the time for being particularly proactive on public health, realized enough was enough, and invited submissions for the design of a new sewerage plan for the city. The man whose plans were accepted, Joseph Bazalgette, is now considered one of the major civic heroes of London’s past. A talented civil engineer, he oversaw a monumental project of public works which transformed hygiene levels and quality of life in London. Bazalgette’s sewer network is widely regarded as the first step in the creation of the modern city of today, as well as the beginning of the end of cholera in London.
Bazalgette’s sewer network is still going strong, carrying the waste of millions of people eastwards to processing facilities towards the mouth of the Thames. As an engineering project, it was an astounding example of human endeavor: 22,000 kilometers of sewers, 318 million bricks, 2.7 million cubic meters of excavated earth.
Bazalgette was known for how hard he worked himself. He left no stone unturned in making this massive effort future-proof. The gravity and slope of the network to ensure the flow of water, the diameters of the tunnels, all were details he obsessed about. But there were two questions that were crucial to answer at the very outset to make the project manageable and sustainable: first, how do we minimize the route of sewerage between any two points in the network and second, which are the most important connecting points?
Bazalgette’s feat of engineering is an example of some of the first uses of the emerging field of graph theory, and illustrated the importance of two concepts which we look at all the time today in relation to networks: distance between vertices and the importance of vertices.
Measuring distance in a graph
Distance is a fairly simple concept in graph theory but extremely useful in practice. Recall from the previous article in this series that a graph consists of a set of vertices and a set of edges that link pairs of vertices. Given any two vertices, the distance between them is defined as the number of edges in the shortest path between them. This is also sometimes called the ‘geodesic distance’ and is by convention described as ‘infinite’ if no path exists between the vertices. For example, in the above simple graph, the distance between vertex 2 and vertex 6 is 3 (there are two paths of this length that can get you there).
Distance is an extremely useful concept because we often will want to optimize for it. Minimizing distance is an extremely common requirement in complex networks for engineering purposes. In the study of people, minimum distance is also a common question of interest. The question of the six degrees of separation, which argues that any two people in the world can be connected to each other by a maximum of six intermediate vertices or seven edges, is a question of minimum distance in a graph network. Recent studies on Facebook show the the average minimum distance between individuals on that network to be 4.57.
But maximum distance can also be of interest, because it implies unfamiliarity and difference. For example, it can be possible to use certain company data to develop a graph representing past collaboration between employees. Then, at company events where you are organizing people into discussion groups, if you want to maximize the forming of new connections and a diversity of points of view, you can ask questions like: how do we break these 100 people into ten groups of ten, so that these groups have the maximum average distance and therefore are least likely to have worked with each other before? Used in this way, graph theory can have a meaningful impact on people’s experience within an organization.
Measuring the importance of vertices in a graph
In any graph, some nodes are more important. In Bazalgette’s sewer network for example, there will be some junctions that need greater monitoring because any failure or leak will have a greater impact on the entire network. Similarly, in a network of people, certain individuals have greater influence because of their positioning and connectivity relative to others in the network.
One simple measure of importance is the valence of a vertex. That is the number of different edges that connect to the vertex. On Facebook for example, your valence is the number of connections you have. But that doesn’t fully grasp the concept of influence or importance, does it? Not everyone with a large number of connections is playing a really important role in the network.
In my experience the best measure of importance in a network is betweenness centrality. Simply put, the betweenness centrality of a given vertex is the number of times that vertex is seen to be on the shortest path between two other vertices in the network. Vertices with high degrees of betweenness centrality influence the spread of information to a greater degree, and their loss from the network tends to have a much more significant impact on its overall connectivity. In the graph above, red vertices have the least degree of betweenness centrality while blue vertices have the greatest.
Understanding betweenness centrality can be very important in people networks. It can help identify which individuals to invest in to ensure that a certain message gets out as widely as possible. It can help with getting new members of a network more connected through introductions to the right people. It can help determine how much concern you should have regarding the loss of an individual from the network and its potential impact on others.
Betweenness centrality is complicated to measure because you need to calculate the paths between all pairs of vertices in a network. For large networks this can be highly computationally intensive. However there are excellent data science packages for computing network characteristics including betweenness centrality. In the R ecosystem, which I work in, the
igraphpackage is particularly handy.
Next time we will look at how information flows through networks, which has really interesting applications to the study of rumours and trends. Read it here.