Five Tidyverse Tricks You May Not Know About

It struck me recently through collaborating with a number of other users of the tidyverse that there are many people who are not aware of all the things that this collection of packages offers them to help with their day to day data wrangling. In particular, two critical packages have had major updates in the past year, and have introduced new features which I regard as transformative – allowing users to step up a gear in the control of their data and in the efficiency of their code.

In late 2019, tidyr 1.0.0 was released. Of many updates, the key ones were the introduction of the functions pivot_longer() and pivot_wider() to better manage and control transformations of dataframes from wide to long form – one of the most common data-wrangling tasks. Replacing gather() and spread(), these new functions introduced more capability to manage the specifics of the transformation, cutting time for users in terms of how they tailor their outputs.

In early 2020, dplyr 1.0.0 was released. There was a vast scope to the new functionality that came into play with this release, but in particular the introduction of across() and c_across() as adverbs to be used with summarise() and mutate() simplified the number of scoped variants that users needed to work with and, like the tidyr changes, allowed greater control of what the output looked like.

Both of these updates took advantage of major new innovations in the R ecosystem, including updates in rlang, vctrs and glue among others.

So if you haven’t checked out these updates, it’s a good time to check back in with the tidyverse packages. In this article I want to show you how they can make your life significantly easier and how you can use them to wrangle data with less code. I’ll do this by showing five simple examples of things you can do which you may not know about.

1. Combine column names however you want in tidyr::pivot_wider()

The whole idea of pivot_wider() is that you want to take data that is in long form and transform it to wide form. For example, let’s say your data looks like this:

Now let’s say that you are interested in seeing the mean and median pressure by year for each storm status. You can use pivot_wider() which cleverly knows what you are trying to do and pastes column names together by default:

You also have the benefit of the names_glue argument, which allows you to structure the combined column names as you wish using glue syntax which is simple and intuitive:

2. Break up column dimensions however you want using tidyr::pivot_longer()

If you have wide data representing multiple dimensions, you can transform it to long data and break out an arbitrary number of dimensions from the column names using regex in the names_pattern argument of pivot_longer(). So for example to move out previous table back to long form we would do this:

3. Summarise or mutate across as many columns as you want using dplyr::across()

across() is a new adverb in dplyr that allows you to flexibly work with an arbitrary number of columns and better control your output. Previously if you wanted to do numerous operations on numerous columns, you would need to use scoped variants of summarise() and mutate() to achieve this, and the default output was not easy to control. For example:

Now the adverb across() can be used for all cases to achieve the same goal, so you no longer need to use the _if, _at and _all variants. It acts as a selecting function to be used inside summarise() or mutate(), and provides an easy use of the .names argument to control the columns names of the output using glue syntax:

4. Run models on nested data using dplyr::nest_by()

It’s very common to need to perform operations on subsets of a dataframe, and this requires you to nest the data according to certain variables. Previously this was only possible by combining functions like tidyr::nest(), dplyr::group_by() and dplyr::rowwise(), but the new dplyr::nest_by() function now takes care of all that and cuts down the amount of code you need to type.

Let’s say you want to run a linear model over mtcars, but you want to do it separately for different cylinder models. This is a perfect job for nest_by(). When you use nest_by(), a ‘list column’ of nested dataframes is generated in a column called data, like this:

You can then refer to this column in further commands, allowing you to do quite sophisticated things like run models, or execute your own functions:

Now to make our output more pretty and user friendly we can use some of the tricks we learned earlier:

5. Generate objects – not just values – using dplyr::summarise() and dplyr::mutate()

The beauty of list columns is that you can store many different things in them, not just values. This now allows you to use our old friends summarise() and mutate() to generate columns containing not just values, but dataframes, models and even plots.

In this example I create some simple functions to generate scatter and box plots using ggplot2. I then nest my data and mutate new columns that generate these plots.

As you can see, all the plots are now stored inside the two list columns we have created. We can now easily pull them out and use them at our convenience. For example, using patchwork:

Leave a Reply

%d bloggers like this: