Anyone who codes in R knows about dplyr
. It’s really the defining package of R, and is designed to make operations on dataframes more intuitive to those that buy into the principles of ‘tidy data’ (which would be most data scientists, I suspect). In fact, many people can code in dplyr
better than they can code in R base. That’s how central dplyr
has become in the R ecosystem, along with the other packages that currently make up the tidyverse
universe.
So the fact that a new version is being released is exciting for most R users. The fact that it’s version 1.0.0 means it’s a real event. Hadley Wickham and the extensive team of open-source developers behind dplyr
would not give it this version number lightly. A huge amount of effort has gone into superpowering dplyr
‘s functionality by making it more powerful, by unifying a number of previously distinct functions under a more abstracted umbrella, and above all in trying to offer more day-to-day users solutions to their most common dataframe-wrangling problems.
I’ll be writing a series of articles highlighting some of the key developments in dplyr 1.0.0
in preparation for its release next month. If you can’t wait for its release you can get right in there and start using the development version already using devtools::install_github('tidyverse/dplyr')
.
In this article I want to highlight one of the key developments of this release – the across()
function.
What is across()
?
Possibly one of the most common uses of dplyr
functions is group_by()
and summarise()
. Many beginners to the language learn this on their very first day, and it keeps on giving even to more advanced programmers. It’s a rare day when I don’t group and summarise something.
Grouping and summarising across multiple variables/columns has previously been possible using a limited set of scoped variants of summarise()
, such as summarise_if()
and summarise_at()
. However, there was clearly some space here to make this more powerful through create a unifying function which could:
- Summarise across an arbitrary set of columns, defined manually or through a condition
- Simultaneously summarise an arbitrary set of functions on those columns.
So, in short, the new function across()
operates across mutiple columns and multiple functions within existing dplyr
verbs such as summarise()
or mutate()
. This makes it extremely powerful and time-saving. There is now no longer any need for the scoped variants.
Examples of across()
in use
First, you can replicate summarise_at()
by manually defining a set of columns to summarise using a character vector of column names, or by using column numbers:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"), mean))
# A tibble: 3 x 3
cyl mpg hp
* <dbl> <dbl> <dbl>
1 4 26.7 82.6
2 6 19.7 122.
3 8 15.1 209.
You can replicate mutate_if()
by using a function to select your columns. Here we turn the name
and status
columns in the dplyr::storms
dataset from character to factor.
storms %>%
dplyr::mutate(across(is.character, as.factor)) %>%
dplyr::select(name, status)
# A tibble: 10,010 x 2
name status
<fct> <fct>
1 Amy tropical depression
2 Amy tropical depression
3 Amy tropical depression
4 Amy tropical depression
5 Amy tropical depression
6 Amy tropical depression
7 Amy tropical depression
8 Amy tropical depression
9 Amy tropical storm
10 Amy tropical storm
# … with 10,000 more rows
You can also apply multiple named functions to your multiple columns by using a list. The across()
function will by default glue your function and column names together with an underscore:
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"), list(mean = mean, median = median, sd = sd)))
# A tibble: 3 x 7
cyl mpg_mean mpg_median mpg_sd hp_mean hp_median hp_sd
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 26 4.51 82.6 91 20.9
2 6 19.7 19.7 1.45 122. 110 24.3
3 8 15.1 15.2 2.56 209. 192. 51.0
And if you want to use a different glueing formula, you can do so using glue syntax:
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"),
list(mean = mean, median = median, sd = sd),
.names = "{col}_{fn}_summ"))
# A tibble: 3 x 7
cyl mpg_mean_summ mpg_median_summ mpg_sd_summ hp_mean_summ hp_median_summ hp_sd_summ
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 26 4.51 82.6 91 20.9
2 6 19.7 19.7 1.45 122. 110 24.3
3 8 15.1 15.2 2.56 209. 192. 51.0
If you need to add optional arguments into your functions, you can use formulas:
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"),
list(mean = ~mean(.x, na.rm = T),
median = ~median(.x, na.rm = T),
sd = ~sd(.x, na.rm = T)),
.names = "{col}_{fn}_summ"))
# A tibble: 3 x 7
cyl mpg_mean_summ mpg_median_summ mpg_sd_summ hp_mean_summ hp_median_summ hp_sd_summ
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 26 4.51 82.6 91 20.9
2 6 19.7 19.7 1.45 122. 110 24.3
3 8 15.1 15.2 2.56 209. 192. 51.0
And similarly you can use formulas to combine functions to avoid unnecessary extra mutating:
mtcars %>%
group_by(cyl) %>%
summarise(across(mpg,
list(minus_sd = ~(mean(.x) - sd(.x)),
mean = mean,
plus_sd = ~(mean(.x) + sd(.x)))
))
# A tibble: 3 x 4
cyl mpg_minus_sd mpg_mean mpg_plus_sd
* <dbl> <dbl> <dbl> <dbl>
1 4 22.2 26.7 31.2
2 6 18.3 19.7 21.2
3 8 12.5 15.1 17.7
Extra tip: c_across()
for working row-wise
Similarly c_across()
allows you to mutate based on an arbitrary set of columns. Here are a couple of examples:
WorldPhones %>%
as.data.frame() %>%
rowwise() %>%
mutate(mean = mean(c_across(N.Amer:Mid.Amer), na.rm = TRUE))
# A tibble: 7 x 8
# Rowwise:
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45939 21574 2876 1815 1646 89 555 10642
2 60423 29990 4708 2568 2366 1411 733 14600.
3 64721 32510 5230 2695 2526 1546 773 15714.
4 68484 35218 6662 2845 2691 1663 836 16914.
5 71799 37598 6856 3000 2868 1769 911 17829.
6 76036 40341 8220 3145 3054 1905 1008 19101.
7 79831 43173 9053 3338 3224 2005 1076 20243.
starwars %>%
rowwise() %>%
mutate(
stuff_they_own = length(c_across(c("vehicles", "starships")))
) %>%
select(name, stuff_they_own)
# A tibble: 87 x 2
# Rowwise:
name stuff_they_own
<chr> <int>
1 Luke Skywalker 4
2 C-3PO 0
3 R2-D2 0
4 Darth Vader 1
5 Leia Organa 1
6 Owen Lars 0
7 Beru Whitesun lars 0
8 R5-D4 0
9 Biggs Darklighter 1
10 Obi-Wan Kenobi 6
# … with 77 more rows
These are just a few examples of what’s possible with across()
. If you discover any other uses that you’d like others to know about, feel free to post a comment.
I have enjoyed your commonsense approach to offering advice on up-to-date R programming. It has helped me a great deal. I am working on a project dealing with ‘churn’ rates for commercial truck drivers in the USA (often close to 100% for long distance carriers). I am looking into the R package lme4 and have heard that ‘Item Response Theory’ is useful. Can you offer insight into what might be a good approach? Thank you!