There was a time when the only way I knew how to do analysis was to use a spreadsheet. I spent over 15 years building highly complex models and analytics on those screens with the rectangular boxes that so many have come to rely on.
Then, in early 2016, it all changed for me. The fuss had started about R and Python, and I asked a few people who were ‘in the know’ which software they believed to be the best for conducting the widest range of analytics. The unanimous answer I received pointed to R (probably due to the people I spoke to). So I decided I am going to learn this thing, and find out if all the hype is worth it.
Six months later, after a lot of late nights and weekends, and through finding whatever chances I could to solve work related problems using R, I was in a situation where I could not bear the idea of working in a spreadsheet again. I’m serious. Today it is a chore to open them, and the only reason I do so is because I communicate with some people who are in the same position as I was 3 years ago.
I look at functions like
VLOOKUP and I compare them to
dplyr::left_join(). It’s like I am standing on a street littered with trash, and on one side is an unfortunate man who has a small litter picker and is picking the pieces up one by one, and on the other side is someone carrying the biggest, most powerful litter vacuum you’ve ever seen.
So I believe spreadsheets are not good for us. They trap us in this narrow set of views and options, use up all our computer’s memory on their pretty look and feel, decide to sulk and storm off to the other room when there’s too much data, and all because (like our kids with Snapchat) we have become addicted to instant visual-analytic gratification.
There’s another reason spreadsheets have caused harm over the years — they have created a corporate reproducibility crisis.
What is a reproducibility crisis?
An important part of the scientific method is the requirement to reproduce results in order to validate them. Most academics agree that, at least in the social and life sciences, we are in the middle of a reproducibility crisis. Psychology is a field that has been heavily tarred with this brush in recent years. The speed at which ‘research’ is published in recent years has led to a lot of questionable research practices. In a survey of 2,000 psychologists, the majority admitted to using at least one questionable research practice, with false positives and confirmation bias being rife among respondents. The extreme end of this is, of course, outright fraudulent research, which psychology and other fields have by no means been free of.
The rise of ‘pop psychology’ publications and the massively increasing quotation of psychology theories and research in the general media has led to situations where unvalidated ideas are gaining ground as ‘truth’ with no attempt to critique or validate the results that gave rise to the theories. In attempts to study the extent of the reproducibility crisis in psychology, studies have found that less than half of research that was conducted to validate previous research did actually validate it.
For a field like psychology, this is potentially very damaging. We have seen a positive trend in recent years towards increasing trust in the value of psychologists and psychometricians in business and enterprise as part of the increasing recognition of the strategic importance of talent. But continued undisciplined publishing will no doubt drive increasing confusion and lack of clarity in the field — there will likely come a time when you can find something that backs up whatever point you want to make. This dilution will harm the field in the long term.
What have spreadsheets got to do with this?
Putting academia to one side for a moment, I propose that in enterprise or corporate contexts analytics needs to be reproducible to be successful. Often it’s not necessarily to validate previous results (although I think that’s always a good thing to do), but commonly similar analysis needs to be repeated for operational reasons. We are seeing increasing complexity in the nature and type of analysis that is being conducted in enterprise settings, and it is very dangerous to conduct such analysis once and leave no straightforward way of reproducing it in the future. The analysis might need to be run again when the context changes, or another group within the business may need to try to repeat it for a different purpose,
Spreadsheets make it much, much harder to reproduce analytics, particularly if it is complex in nature. Here are a few reasons why:
- The structure of spreadsheets are chaotic and non-linear. To trace a particular value you have to find the dependent values located wherever the original analyst chose to locate them. You end up finding them in some hidden sheet called “Unimportant” in cell EE:1254. What?
- Spreadsheets make commentary difficult. It’s challenging for the author to explain what they are doing when it spans multiple sheets and cells.
- Spreadsheets confine users to a common, narrow operating environment. It makes it very challenging for non-spreadsheet users to easily and intuitively dig into and understand what has been done. And trust me, non-spreadsheet users are on the increase.
Open source data science languages overcome all of these challenges. Everyone has access to the software and code base they need. Code is written in a linear, logical way and commenting is easy so that fellow users can follow your logic and understand exactly what you are doing at each step. The ability to integrate code into vertical documents through Markdown and Jupyter notebooks allows us to write full research descriptions containing methodology, code and results.
What should we do about this?
If you are an analytics leader in an enterprise where the majority of the work is still being done using spreadsheets, then you are well and truly trapped in the corporate reproducibility crisis. It’s time to bite the bullet and make a move towards coding languages. It’s not easy, and the beginnings can be rough, but with the right commitment and resources it’s very doable.
If you are already on that journey, here are a few things I insist on as best practices to aid reproducibility of work:
- Put encouragement and incentives in place to ensure that all code is commented liberally
- Ensure that all new research is written in a vertical document to include embedded code (like an R Markdown or Jupyter Notebook).
- If a piece of work needs to be reproduced from a prior spreadsheet approach, don’t bother trying to pick the spreadsheet apart, just start from the very beginning, define the requirements again and build it from scratch in a new environment. Picking a spreadsheet apart is like asking that man with the trash picker to place all his pieces back in their original position before we vacuum them all up again with that powerful vacuum. It’s pointless, wasted effort.
A future where we can reproduce analytics with ease is an exciting one, and increasingly important to achieve . But spreadsheets won’t help us get there.