This is the first in a series of articles explaining the principles of graph theory for those who may use it in a data science context.
In the early 1700s, the good people of Königsberg in Prussia had a problem on their minds. There were seven bridges that crossed the river in that town, and the big question was: could a route be found that allowed someone to cross all seven bridges without crossing any one bridge more than once?
Why exactly this was a concern is lost to history. Perhaps there were tolls on the bridges and people were trying to find the cheapest way about town? Or perhaps they were scared or bored of bridges? In any case, the point is that Facebook and Twitter owe a lot to the man that solved this problem.
That man was the Swiss-born mathematician Leonhard Euler. Euler seized the problem of the Königsberg bridges with the relish that only a true mathematician could have for a problem like this. And like a true mathematician always does, he reduced it to its simplest form. This problem was not about bridges — it was about something much simpler. What it reduced to was this: can you draw this graph with a pencil without crossing the same line twice and without taking your pencil off the paper:
The answer is no — give it a try! And with that simple insight was born a branch of mathematics that is central to every like, every follow, every friend request that we send on our smartphones or computers today: graph theory!
What are graphs?
Most people understand the term ‘graph’ very broadly, to represent most mathematical pictorial depictions. However, as you’d expect, in mathematics there is a very clear definition of what a graph is which allows us to set up helpful rules and calculations around them to make them useful to the problems we are trying to solve.
In its most abstract form, a graph is two sets. The first set is called the vertex set. You can think of this as a set of different objects: for example it could be people, or it could even be sewer junctions (more on this in a later article). Sewers can smell a bit, so lets stick with people for now. So let’s say our vertex set is John, Alex, Somesh, Lily. The second set is called the edge set. Each edge in the edge set is defined by a pair of vertices, which denotes that there is an edge that connects those two vertices. If the pair {John, Lily}is in the edge set, then John and Lily are connected.
As with many mathematical concepts, it’s often helpful to draw them, and in the case of graphs it is a very natural, intuitive thing to represent them pictorially. Here’s a graph that has elements of social networking in it:
Ignoring the direction of the arrows, and the types of connections that the arrows represent (all of this we will dig into in a later post), can you define the vertex set and the edge set for this graph? (The answer is at the end of this article).
Types of graphs
The way I have defined graphs in the previous section is its most general definition. By adding additional rules or criteria to the general definition of a graph, you can generate specialized classes of graphs. Here are some of the more common ones, along with real life examples of their uses:
- Digraphs or Directed Graphs are graphs where the edge set also has the property of having direction. This means that, for example, the edge (Justin, Movies) is different from the edge (Movies, Justin). The first is the edge depicted using the arrow in the previous image, the second would mean the relationship is in the other direction (Movies likes Justin — probably impossible unless movies develop emotions). Following people on Twitter can be considered as a digraph — there are people who follow you, but you don’t follow them.
- Multigraphs are graphs where there can be multiple edges between two vertices, usually depicting different relationships. Think of a map of airline routes where each edge is a flight number. London and New York would have a heck of a lot of edges between them.
- Pseudographs are graphs which allow edges that connect vertices to themselves. Needless to say, this is not usually needed in graphs depicting people relationships. But if for example you needed to use a graph to depict coffee orders in the office and who was buying what for whom, a pseudograph would be useful or you risk missing out on your morning caffeine hit!
- A complete graph is exactly what it says on the tin, In a complete graph, there are no more edges that can be added to the edge set. All the vertices are connected to each other. It can be a useful mathematical tool to prove that graphs are complete.
- Trees are pretty obvious also, at least to look at, but they are mathematically defined as a graph which is connected and without cycles. This means that any pair of different vertices can be connected to each other through a set of edges but it is not possible to connect a vertex to itself through a set of edges. A family tree is usually an example of this, but if you are a member of a Royal Family you may be concerned about that statement. Check out King Charles II of Spain — ugh, awkward — definitely not a tree in the graph theory sense!
All of these different types of graphs have been defined for a reason — there are real life problems that they help solve. For the most part, we can stick to graphs, digraphs and multigraphs for most of the work that we need to do in data science.
The adjacency matrix
Before I wrap up this piece, I’ll touch on the adjacency matrix which is a super useful way of starting to measure network related phenomena.
Imagine we are dealing with the simplest form of graph. If you have hundreds or thousands of vertices, it can become very hard to graphically depict this and you need a more systematic method to notate edges. One way to do this is to create a square array with the vertex names across both the horizontal and vertical axes. You can then use 1’s and 0’s to denote whether an edge exists between these vertices or not. Here is an example:
Depending on the type of graph, the adjacency matrix can have different properties. In basic graphs, the adjacency matrix is symmetrical along its diagonal, but digraphs may not have this property. Graphs and digraphs have zeros on the diagonal, but pseudographs may not.
Adjacency matrices are very powerful tools. For example, if you multiply the above adjacency matrix by itself, you generate a new matrix which tells you the number of paths of length 2 between any two vertices. If you raise it to the power n, it generates a matrix telling you the number of paths of length n between any two vertices. See if you can work out why that is the case. This kind of mathematics can easily be used to measure first, second and higher order connections.
There is a great mathematical richness to graph theory, much of which has direct and exciting application to the work of people analytics professionals and increasingly in other disciplines like medicine, epidemiology, politics and sociology. There are numerous interesting open problems in this space. Can social networking aid emergency response? How can social networks influence political outcomes? Can they be used to track the spread of disease?
I will cover this substantial and exciting topic over a number of articles — if you intend to follow these it would be worth making sure you are comfortable with what was covered here.
Next time around we will look at how you can apply graph theory to measure network connectivity, network growth and how important someone is in a network. Read this here.
(The answer to my earlier question was: Vertex set: Anna, Justin, Alana, Books, Movies. Edge set: {Anna, Books}, {Justin, Anna}, {Justin, Alana}, {Justin, Movies}, {Alana, Movies})
3 thoughts on “Facebook and Twitter were born in 18th century Europe”