This is the third in a series of articles explaining the principles of networks for those who may use them in a data science context. The first article, which focuses on the origins of graph theory and the basic properties of graphs, can be found here.
In the summer of 1906, the Warrens — a large, well-to-do New York family — were having dinner in their vacation house in Oyster Bay, Long Island. The most anticipated part was dessert. On a hot summer evening, Peach Melba served with ice cream was a thoroughly delicious, cool and refreshing treat which the children in particular savored.
A few days later, one of the children began to fall ill with the symptoms of typhoid fever. Within a few more days, five more people including maids and a gardener started to display symptoms of the disease. It was perplexing, because typhoid was thought to mostly propagate in dirty and unhygienic environments, usually associated with the poor. Typhoid fever just did not break out in Oyster Bay.
The landlord who rented the holiday home to the Warrens was very concerned. This particular prestigious rental property was expensive to run and he could not afford for it to be unoccupied because of an infectious disease. He called in the local authorities, who ran checks of all the usual suspects, human and mechanical. Toilets and running water were all inspected as well as local people from whom food had been purchased in the town. All tested inconclusive as sources of the outbreak.
So Thompson went to New York City to find his typhoid hunter. He found him in Dr George Soper — not a medical doctor but a sanitary engineer. Conducting thorough research into the background circumstances of the Warren family, Soper realized that there was one recent major change in the family’s environment that could result in the transmission of new bacteria. At the beginning of August, they had hired a new cook. When he interviewed the Warrens and discovered that the cook’s specialty was a cold dish, Peach Melba, he became doubly suspicious, knowing that cold dishes were massively more likely to carry live bacteria.
The cook, an Irish immigrant named Mary Mallon, was nowhere to be seen — she had fled not long after the family started getting ill. Soper embarked on several months of investigation into Mallon’s history and discovered seven previous families for which she served as a cook, each resulting in an outbreak of typhoid that had infected a total of twenty-two people. If Mallon was indeed a healthy carrier of typhoid, she needed to be tracked down as soon as possible.
Mary Mallon — or Typhoid Mary as she became widely known in the press and public health campaigns over the next several years — was finally tracked down by Soper in March 1907 and was confirmed as the original healthy carrier of the typhoid bacillus. She spent most of the next 30 years of her life in some sort of isolation, at one time escaping and returning to cooking for families before being arrested again and returned to quarantine. She died in 1938, not of typhoid but of pneumonia. Never once in her life did she display any symptoms of typhoid fever, but an autopsy revealed substantial amounts of the bacillus in her gallbladder.
Transmission through networks
Soper’s work marked a first in the understanding of the spread of disease. Numerous successful identifications of disease outbreaks over the prior 50 years had focused on the study of locations and places, including the famous discovery of the Broad Street cholera outbreak in London in the 1850s.
But this was the first time that disease had been successfully tracked through people, not places. It was the beginning of an understanding of how things flow through networks of people. To a certain extent, information flows through networks of people in similar ways to disease, hence why we use the phrase ‘viral’ to describe rapid and extensive information transmission today. This early work to understand the flow of disease through networks represented the origin of how we look at the spread of information today, whether it be trends, rumors or advertising.
Thinking about information flow in its simplest form, the question of whether a piece of information flows from one person to another depends on two factors: the number of people that person is connected to, let’s call that k, and the probability that the information will be passed on, let’s call that p. In Mary Mallon’s case, the authorities realized that p was very difficult to influence — you will struggle to change the lifelong hygiene habits of an uneducated cook. So reducing k by isolating Mallon was the only real option to control the spread of the disease.
In social networks, understanding how to manipulate p and k can be critical in managing the spread of marketing, advertising, news, or fake news. p can be increased by ensuring the topic has strong commonality with the known interest of the individual, or by making an explicit agreement with the individual that the information will be passed on. k can be increased by working through individuals with high valences in the network (see the previous piece in this series for more information on valence).
Rumors and information propagation
In reality, our p and k above represent a simplistic model for how trends and rumors propagate in a network. For one thing, whether or not information transacts between two individuals depends on both the propensity for it to be passed on and the interest in it being received. It also assumes that the network graphs we are dealing with are trees, with information passing downwards only, which is usually not the case in complex social networks.
In social networks, numerous types of interactions may happen. If A is the set of people who have received the information, and B is the set of people who have not received it, several instances are possible. A can pass to A, A can pass to B. In both these cases, the recipient can choose to accept or reject — even if they have previously chosen, they can change their minds.
Another big difference when comparing the propagation of trend and rumors compared to that of disease is that we are more interested in where the information goes rather than where it originated from. Research conducted on how rumors propagate on social networks have thrown up some very interesting observations.
Large networks like Facebook and Twitter may demonstrate some of the properties of scale-free networks. There is a large number of people with a small valence or number of connections, and a small number of people with a large number of connections. If you plot people against number of connections, you get something that looks like the diagram above, with an asymptotic real curve or a linear log curve. Recent research here suggests that, while we cannot precisely fit social networks to the formal mathematical definition of scale free, there is no other mathematical model that fits them better than the scale free model.
Scale-free networks exist in a number of fields, including biochemistry, technology and finance. In scale-free networks, information propagation is directly correlated to the size of the network. Twitter and Facebook are networks that have massive size, but also have mathematical properties that put them precisely in the range for information to propagate extremely rapidly. On Twitter, for example, a piece of information can reach almost 50 million users in just 8 steps of information exchange. The power of these networks is scary!
When you look at your social media feed, you will often be told what is ‘trending’. Social media engines use the study of information propagation to design algorithms to automatically determine when a topic seems to be outpacing other topics in how it is moving through their network.
Simple trend models follow a ‘topic’ or ‘tag’ within a certain statistical window. The window, indicated in the diagram above, shows the expectation of transmission activity over time for most topics on the network. Live tracking of these topics can reveal those whose activity curves unexpectedly jump outside this statistical window, and are regarded as ‘trending’ if the jump is substantial. For example, the red dot above may represent the point in time and activity that the network decides to classify the topic as a trend.
However, this method of tracking a single trend against an expectation window can result in errors and has a high dependence on human judgment which is not feasible across extremely large networks. Trends can occur in unpredictable ways, or can quickly ‘untrend’ soon after they have been classified as trends, which can be embarrassing for the predictor.
In 2012, a study at MIT found a fully automated way of predicting trends on Twitter two hours before Twitter themselves predicted the trend, with a 95% accuracy rate. The researchers used a database of prior topics that trended and did not trend, and compared the tweeting activity of any current topic to that database, calculating the distance between the activity curves. Those curves whose distance were closest to the curves of prior topics that trended were themselves predicted to trend. This study gave rise to the most common methods used by social networks today to predict trends.
In the next piece in this series, I will look at the data systems that best capture data in ways that facilitate efficient network analytics, and how this has enabled many fascinating discoveries. Read it here.