It struck me recently through collaborating with a number of other users of the tidyverse that there are many people who are not aware of all the things that this collection of packages offers them to help with their day to day data wrangling. In particular, two critical packages have had major updates in the past year, and have introduced new features which I regard as transformative – allowing users to step up a gear in the control of their data and in the efficiency of their code.
In late 2019,
tidyr 1.0.0 was released. Of many updates, the key ones were the introduction of the functions
pivot_wider() to better manage and control transformations of dataframes from wide to long form – one of the most common data-wrangling tasks. Replacing
spread(), these new functions introduced more capability to manage the specifics of the transformation, cutting time for users in terms of how they tailor their outputs.
In early 2020,
dplyr 1.0.0 was released. There was a vast scope to the new functionality that came into play with this release, but in particular the introduction of
c_across() as adverbs to be used with
mutate() simplified the number of scoped variants that users needed to work with and, like the
tidyr changes, allowed greater control of what the output looked like.
Both of these updates took advantage of major new innovations in the R ecosystem, including updates in
glue among others.
So if you haven’t checked out these updates, it’s a good time to check back in with the tidyverse packages. In this article I want to show you how they can make your life significantly easier and how you can use them to wrangle data with less code. I’ll do this by showing five simple examples of things you can do which you may not know about.
1. Combine column names however you want in
The whole idea of
pivot_wider() is that you want to take data that is in long form and transform it to wide form. For example, let’s say your data looks like this:
Now let’s say that you are interested in seeing the mean and median pressure by year for each storm status. You can use
pivot_wider() which cleverly knows what you are trying to do and pastes column names together by default:
You also have the benefit of the
names_glue argument, which allows you to structure the combined column names as you wish using
glue syntax which is simple and intuitive:
2. Break up column dimensions however you want using
If you have wide data representing multiple dimensions, you can transform it to long data and break out an arbitrary number of dimensions from the column names using regex in the
names_pattern argument of
pivot_longer(). So for example to move out previous table back to long form we would do this:
3. Summarise or mutate across as many columns as you want using
across() is a new adverb in dplyr that allows you to flexibly work with an arbitrary number of columns and better control your output. Previously if you wanted to do numerous operations on numerous columns, you would need to use scoped variants of
mutate() to achieve this, and the default output was not easy to control. For example:
Now the adverb
across() can be used for all cases to achieve the same goal, so you no longer need to use the
_all variants. It acts as a selecting function to be used inside
mutate(), and provides an easy use of the
.names argument to control the columns names of the output using
4. Run models on nested data using
It’s very common to need to perform operations on subsets of a dataframe, and this requires you to nest the data according to certain variables. Previously this was only possible by combining functions like
dplyr::rowwise(), but the new
dplyr::nest_by() function now takes care of all that and cuts down the amount of code you need to type.
Let’s say you want to run a linear model over
mtcars, but you want to do it separately for different cylinder models. This is a perfect job for
nest_by(). When you use
nest_by(), a ‘list column’ of nested dataframes is generated in a column called
data, like this:
You can then refer to this column in further commands, allowing you to do quite sophisticated things like run models, or execute your own functions:
Now to make our output more pretty and user friendly we can use some of the tricks we learned earlier:
5. Generate objects – not just values – using
The beauty of list columns is that you can store many different things in them, not just values. This now allows you to use our old friends
mutate() to generate columns containing not just values, but dataframes, models and even plots.
In this example I create some simple functions to generate scatter and box plots using
ggplot2. I then nest my data and mutate new columns that generate these plots.
As you can see, all the plots are now stored inside the two list columns we have created. We can now easily pull them out and use them at our convenience. For example, using