Someone somewhere right now is building a model. Many, many people in fact. Whether for a business, an academic study or even personal interest, people have been using mathematics more and more to model real world phenomena in order to generate insight or to make decisions on how control or to respond to those phenomena.
More recently — enabled by greater computing power — modelling has become more complex. Instead of a few cells in an Excel spreadsheet, models are being built on various platforms and in various programming languages. Some are based on small data, and some are based on huge data. The efforts to create them can range from a few hours to an iterative project lasting months or even years.
But often the creators of these models don’t ask enough questions before they start. They can just jump into it without thinking. Grab some data, set up some formulas and you are off. I’ve learned over many years working in mathematics and statistics that the success of your model depends a great deal on the up-front thinking that goes into it before you even open a data file.
In particular there is one questions which I always ask at the very beginning — and one which I believe analysts, data scientists and other modelers should always ask: Is my model supposed to be explanatory or predictive?
It’s probably obvious from the words, but an explanatory model is created to help understand why something is happening. It can help answer questions like: why does this disease seem to occur in these types of people? What might have caused temperature surges? A predictive model is created to make predictions as accurately as possible regarding what will happen — it will answer questions like: how many people can we expect to visit this shopping mall tomorrow? How many votes will each political party get at the next election?
One way to illustrate this quite simply is to use the analogy of a lemonade stand owner. The lemonade stand owner would use an explanatory model to understand the reasons why her customers like her product, or why she has more customers in the middle of the day versus in the evening — she’s basically interested in the lemonade and why it sells. However, if her main aim is to make sure she has enough lemons for the rest of the week, she would use a predictive model to help her with that.
Models are rarely optimally able to achieve both goals. I don’t think I have ever built a model that is both great at explaining a phenomenon and equally great at predicting that phenomenon. And there are good reasons for this. In this article I will lay out how this choice affects every part of how you build a model, starting with the initial data inputs all the way to how you measure its effectiveness.
1. Choices of input data (one off or repeated use)
If the model is to be explanatory, then the modelling process is to happen only once, or on occasion in the future. The priority it to get the deepest possible understanding of the question. Therefore no data source is out of scope. Data that is poorly formatted and needs substantial cleaning can go on this list. Even old data that does not exist electronically and is still in filing cabinets could be considered for digitization in an effort to be as exhaustive as possible. Equally, certain data might be removed from the model for the purpose of unearthing deeper explanatory variables. In a medical model, age might be removed because it is a known factor in disease susceptibility and it might dominate the model and disguise other important factors.
A predictive model is designed to be run again and again so that the relationship identified in the training set can be utilized to make predictions based on new data that is fed into the model. Therefore the data is selected primarily based on how available it will be to run through models in the future. In many modern day contexts this often means predictive models are restricted to only use data that are in connected sources, readily available and pre-formatted to work with the model. In addition, usually the primary goal is accurate prediction, and so any data that helps improve the accuracy of the prediction is in play (although there should usually be a healthy discussion on the trade off between accuracy and inductive bias in predictive models).
2. Modelling techniques used (interpretable or ‘black box’)
For an explanatory model, modelling techniques that lend themselves well to interpretation are critical. Control of insight is of supreme importance in an explanatory model. In Logistic Regression, odds ratios can help us understand the degree to which an input variable influences the dependent variable. Simpler decision tree models can have useful explanatory purpose, because they can help identify and quantify the impact of certain decision points on the result.
Predictive modelling has little regard for interpretability. You may have heard the term ‘black box model’ to describe a model which maximizes predictive power but is far too complex in nature to tease out the influence of the individual input factors. Neural networks are quite common black box models. They are highly complex under the hood and make decisions based on many hundreds or thousands of simulated and interconnected neurons, each one acting on a behavior learned from the training set.
3. Measuring the performance of the model (fit versus accuracy)
Explanatory models are judged primarily by the insights they produce and their overall goodness of fit. The goodness of fit is a measure of the closeness between the expected values of the dependent variable and the actual observed values. It is possible, and indeed quite common, for an explanatory model to to generate valuable insights even if the overall fit is poor — this is quite common for example in the field of social sciences that I work primarily in. Typical measures used in the results of explanatory modelling include odds ratios, R-squareds (incl pseudo R-squareds), chi-squared tests and G-tests.
Predictive models live or die based on their accuracy. Accuracy measurement usually involves a calculation of the error in a regression model, or the tradeoff between true positives and false positives in a classification model. Measures such as mean absolute error and root mean squared error will typically be used to describe how well a regression model makes predictions. Precision, recall, the area under an ROC curve or the F1-score (for imbalanced models) are more typical measures used for evaluating predictive accuracy.
I have learned the habit over the years of putting myself in the shoes of the lemonade stand owner. Am I interested in the lemonade or the lemons? It’s a really good habit which I hope you can pick up.