Over the time I’ve spent involved in analytics professionally, I’ve seen some common howlers. I can’t necessarily blame the people who committed them, because they were never taught that they were bad things to do.
Statistics classes at college are still overly-theoretical. It’s not as bad as it used to be, now that data science has encouraged more learning based on real data and case studies. But there are still too many formulas and not enough practical advice based on the things that are likely to happen once you enter the real world. After all, isn’t that what statistics is supposed to be about?
If I were teaching a statistics class in college, I would probably call it ‘Statistics for the Real World 101’. Here are a few things I’d fail people for.
1. Averaging averages
This is one I see so often. Someone has calculated an average metric for a whole bunch of subgroups, and then wants to give the average metric for the entire population. So they just average the averages. This is almost always the wrong thing to do.
Unless the data for every subgroup is commensurable, of approximately the same cardinality and of similar representativeness — which is basically never — then averaging the averages will just artificially inflate or deflate the genuine metric across the entire population. Here’s a simple example of what happens if you try it on World Bank data on women representation in the workforce — makes it look a lot higher than it really is:
2. Ignoring range restriction
This one pops up a lot if you work in any environment where you have to analyze a process where the data points degrade over time — for example, a selection process. A common situation is where people want to know if the information from earlier in the process can predict something later in the process. Maybe, for example, you want to correlate interview ratings with subsequent job performance.
I often see people ignore the fact that the data points later in the process are a subset of the data points earlier in the process. Data points have dropped out because of the selection in between. Often they conclude that the correlation is low or zero and use that as a basis to denigrate the earlier stages of the process as being not predictive of later stages.
This can be a major problem, especially if the process is highly selective — like if only a fraction of those at the early stages made it to the later stages. Often the statistics of the data points who made it through are high and compressed, because data points with lower statistics didn’t make the cut.
There are ways of correcting a correlation for range restriction. Here is a commonly used formula:
But I want to be clear — all formulas are unreliable if the restriction is substantial. In these situations, if correlation analysis does not reveal anything notable, I simply declare that we cannot conclude anything because of range restriction issues.
3. Using linear regression on a binary outcome
I think when people do this, they either forget their statistics lessons completely or they just slept through them.
Linear regression is a pretty simple process which basically facilitates the prediction of a variable that has a continuous numerical scale. Like the price of a car. Intercepts and coefficients are determined and applied directly to the new inputs to determine a predicted value. Model fit is easy to determine using sum of squares (an extension of Pythagoras’ Theorem to calculate distance).
Trying to use this method on binary outcomes is a really bad idea. Most of the underlying assumptions in linear regression about variance and residual error are violated, and the output is not designed to predict a simple binary outcome. It’s madness, and an indicator that the person who is doing it is not particularly well-trained in statistics, or scared of logistic regression, or something!
Some people try a Linear Probability Model as a way of making linear regression methods work with binary outcome data. I am not at all convinced of this, and it doesn’t solve the fact that probabilities outside the [0, 1] range can still occur.
4. Putting all your eggs in the p-basket (or having no p-basket at all)
p-hacking is becoming a topic of greater and greater awareness in the statistics and data science community in recent years. People are becoming less and less comfortable with the idea that a cold hard significance line is the sole determinant of whether something is deemed worthy of communication as an analytical insight.
Often I see two extremes of this problem. I often see p-values ignored completely, so some pattern that has a p-value of 0.5 is raised as an insight. And I also see too much dependence on the p < 0.05 boundary.
This is where common sense goes out of the window. Instinctively, whether a pattern in data is notable or not depends on the effect it is having and whether that effect could be considered ‘unusual’ or not. This speaks to some element of judgment from the statistician:
- If the data is really big, even a miniscule effect can pass the p < 0.05 condition. The important thing is that the effect is miniscule.
- If the data is not so big, but the effect seems to be, then p < 0.05 should not be the sole consideration in whether or not this insight is notable.
5. Using bad language
I don’t mean swearing (although I’ve done plenty of that with some of the datasets I’ve had to deal with). I mean not writing your insights and conclusions accurately.
Language is so important in helping others understand what they can conclude from the analysis. I often see poorly crafted language which can lead people to the wrong conclusions, for example suggesting that a causative relationship exists when there is no such evidence, or not appropriately qualifying conclusions.
For example, look at this correlation matrix of the James Bond movies:
It’s so tempting to say that Bond’s drinking and killing is what increases the movie budgets, but that assumes a causality which we haven’t proved. More than likely, longer movies involve more drinking and killing and also cost more to make. But we can’t say this conclusively either. The language I would use here is simple: ‘Movie budgets correlate primarily with the amount of Martinis drunk and the number of people Bond kills.’ That’s still a pretty entertaining conclusion!