What you need to know about dplyr 1.0.0 – Part 1: The across() adverb

Anyone who codes in R knows about dplyr. It’s really the defining package of R, and is designed to make operations on dataframes more intuitive to those that buy into the principles of ‘tidy data’ (which would be most data scientists, I suspect). In fact, many people can code in dplyr better than they can code in R base. That’s how central dplyr has become in the R ecosystem, along with the other packages that currently make up the tidyverse universe.

So the fact that a new version is being released is exciting for most R users. The fact that it’s version 1.0.0 means it’s a real event. Hadley Wickham and the extensive team of open-source developers behind dplyr would not give it this version number lightly. A huge amount of effort has gone into superpowering dplyr‘s functionality by making it more powerful, by unifying a number of previously distinct functions under a more abstracted umbrella, and above all in trying to offer more day-to-day users solutions to their most common dataframe-wrangling problems.

I’ll be writing a series of articles highlighting some of the key developments in dplyr 1.0.0 in preparation for its release next month. If you can’t wait for its release you can get right in there and start using the development version already using devtools::install_github('tidyverse/dplyr').

In this article I want to highlight one of the key developments of this release – the across() function.

What is across()?

Possibly one of the most common uses of dplyr functions is group_by() and summarise(). Many beginners to the language learn this on their very first day, and it keeps on giving even to more advanced programmers. It’s a rare day when I don’t group and summarise something.

Grouping and summarising across multiple variables/columns has previously been possible using a limited set of scoped variants of summarise(), such as summarise_if() and summarise_at(). However, there was clearly some space here to make this more powerful through create a unifying function which could:

  1. Summarise across an arbitrary set of columns, defined manually or through a condition
  2. Simultaneously summarise an arbitrary set of functions on those columns.

So, in short, the new function across() operates across mutiple columns and multiple functions within existing dplyr verbs such as summarise() or mutate(). This makes it extremely powerful and time-saving. There is now no longer any need for the scoped variants.

Examples of across() in use

First, you can replicate summarise_at() by manually defining a set of columns to summarise using a character vector of column names, or by using column numbers:

library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(c("mpg", "hp"), mean))

# A tibble: 3 x 3
    cyl   mpg    hp
* <dbl> <dbl> <dbl>
1     4  26.7  82.6
2     6  19.7 122. 
3     8  15.1 209. 

You can replicate mutate_if() by using a function to select your columns. Here we turn the name and status columns in the dplyr::storms dataset from character to factor.

storms %>% 
  dplyr::mutate(across(is.character, as.factor)) %>% 
  dplyr::select(name, status)

# A tibble: 10,010 x 2
   name  status             
   <fct> <fct>              
 1 Amy   tropical depression
 2 Amy   tropical depression
 3 Amy   tropical depression
 4 Amy   tropical depression
 5 Amy   tropical depression
 6 Amy   tropical depression
 7 Amy   tropical depression
 8 Amy   tropical depression
 9 Amy   tropical storm     
10 Amy   tropical storm     
# … with 10,000 more rows

You can also apply multiple named functions to your multiple columns by using a list. The across() function will by default glue your function and column names together with an underscore:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(c("mpg", "hp"), list(mean = mean, median = median, sd = sd))) 

# A tibble: 3 x 7
    cyl mpg_mean mpg_median mpg_sd hp_mean hp_median hp_sd
* <dbl>    <dbl>      <dbl>  <dbl>   <dbl>     <dbl> <dbl>
1     4     26.7       26     4.51    82.6       91   20.9
2     6     19.7       19.7   1.45   122.       110   24.3
3     8     15.1       15.2   2.56   209.       192.  51.0

And if you want to use a different glueing formula, you can do so using glue syntax:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(c("mpg", "hp"), 
                   list(mean = mean, median = median, sd = sd), 
                   .names = "{col}_{fn}_summ")) 

# A tibble: 3 x 7
    cyl mpg_mean_summ mpg_median_summ mpg_sd_summ hp_mean_summ hp_median_summ hp_sd_summ
* <dbl>         <dbl>           <dbl>       <dbl>        <dbl>          <dbl>      <dbl>
1     4          26.7            26          4.51         82.6            91        20.9
2     6          19.7            19.7        1.45        122.            110        24.3
3     8          15.1            15.2        2.56        209.            192.       51.0

If you need to add optional arguments into your functions, you can use formulas:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(c("mpg", "hp"), 
                   list(mean = ~mean(.x, na.rm = T), 
                        median = ~median(.x, na.rm = T), 
                        sd = ~sd(.x, na.rm = T)), 
                   .names = "{col}_{fn}_summ")) 

# A tibble: 3 x 7
    cyl mpg_mean_summ mpg_median_summ mpg_sd_summ hp_mean_summ hp_median_summ hp_sd_summ
* <dbl>         <dbl>           <dbl>       <dbl>        <dbl>          <dbl>      <dbl>
1     4          26.7            26          4.51         82.6            91        20.9
2     6          19.7            19.7        1.45        122.            110        24.3
3     8          15.1            15.2        2.56        209.            192.       51.0

And similarly you can use formulas to combine functions to avoid unnecessary extra mutating:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(mpg, 
                   list(minus_sd = ~(mean(.x) - sd(.x)), 
                        mean = mean, 
                        plus_sd = ~(mean(.x) + sd(.x)))
                   )) 

# A tibble: 3 x 4
    cyl mpg_minus_sd mpg_mean mpg_plus_sd
* <dbl>        <dbl>    <dbl>       <dbl>
1     4         22.2     26.7        31.2
2     6         18.3     19.7        21.2
3     8         12.5     15.1        17.7

Extra tip: c_across() for working row-wise

Similarly c_across() allows you to mutate based on an arbitrary set of columns. Here are a couple of examples:

WorldPhones %>% 
  as.data.frame() %>% 
  rowwise() %>% 
  mutate(mean = mean(c_across(N.Amer:Mid.Amer), na.rm = TRUE))

# A tibble: 7 x 8
# Rowwise: 
  N.Amer Europe  Asia S.Amer Oceania Africa Mid.Amer   mean
   <dbl>  <dbl> <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
1  45939  21574  2876   1815    1646     89      555 10642 
2  60423  29990  4708   2568    2366   1411      733 14600.
3  64721  32510  5230   2695    2526   1546      773 15714.
4  68484  35218  6662   2845    2691   1663      836 16914.
5  71799  37598  6856   3000    2868   1769      911 17829.
6  76036  40341  8220   3145    3054   1905     1008 19101.
7  79831  43173  9053   3338    3224   2005     1076 20243.
starwars %>% 
  rowwise() %>% 
  mutate(
    stuff_they_own = length(c_across(c("vehicles", "starships")))
  ) %>% 
  select(name, stuff_they_own) 

# A tibble: 87 x 2
# Rowwise: 
   name               stuff_they_own
   <chr>                       <int>
 1 Luke Skywalker                  4
 2 C-3PO                           0
 3 R2-D2                           0
 4 Darth Vader                     1
 5 Leia Organa                     1
 6 Owen Lars                       0
 7 Beru Whitesun lars              0
 8 R5-D4                           0
 9 Biggs Darklighter               1
10 Obi-Wan Kenobi                  6
# … with 77 more rows

These are just a few examples of what’s possible with across(). If you discover any other uses that you’d like others to know about, feel free to post a comment.

One thought on “What you need to know about dplyr 1.0.0 – Part 1: The across() adverb

  1. I have enjoyed your commonsense approach to offering advice on up-to-date R programming. It has helped me a great deal. I am working on a project dealing with ‘churn’ rates for commercial truck drivers in the USA (often close to 100% for long distance carriers). I am looking into the R package lme4 and have heard that ‘Item Response Theory’ is useful. Can you offer insight into what might be a good approach? Thank you!

Leave a Reply

%d bloggers like this: