What you need to know about dplyr 1.0.0 – Part 2: more flexible summarise()

Summarise – the original workhorse of dplyr – has been made even more flexible in the new release. First, it can now return vectors to form multiple rows in the output. Second, it can return dataframes to form multiple rows and columns in the output. This might be a little mind-bending for some, so I’ll spend a little time on it here to illustrate how this could work.

Vector output

If you want to summarise a function that creates a vector output, this is now easy. For example you can easily summarise a range:

library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(range = range(mpg))

# A tibble: 6 x 2
    cyl range
  <dbl> <dbl>
1     4  21.4
2     4  33.9
3     6  17.8
4     6  21.4
5     8  10.4
6     8  19.2

You could then combine with tidyr::pivot_wider() if you wish:

library(tidyr)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(range = range(mpg)) %>% 
  mutate(name = rep(c("min", "max"), length(unique(cyl)))) %>% 
  pivot_wider(names_from = name, values_from = range)

# A tibble: 3 x 3
    cyl   min   max
  <dbl> <dbl> <dbl>
1     4  21.4  33.9
2     6  17.8  21.4
3     8  10.4  19.2

This would provide the equivalent of:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(min = min(mpg), max = max(mpg))

# A tibble: 3 x 3
    cyl   min   max
* <dbl> <dbl> <dbl>
1     4  21.4  33.9
2     6  17.8  21.4
3     8  10.4  19.2

The second option in this case is much easier, but where this comes in useful is where you have longer outputs. Here’s one simple way you could compute deciles:

decile <- seq(0, 1, 0.1)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(deciles = quantile(mpg, decile)) %>% 
  mutate(name = rep(paste0("dec_", decile), length(unique(cyl)))) %>% 
  pivot_wider(names_from = name, values_from = deciles)

# A tibble: 3 x 12
    cyl dec_0 dec_0.1 dec_0.2 dec_0.3 dec_0.4 dec_0.5 dec_0.6 dec_0.7 dec_0.8 dec_0.9 dec_1
  <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
1     4  21.4    21.5    22.8    22.8    24.4    26      27.3    30.4    30.4    32.4  33.9
2     6  17.8    18.0    18.3    19.0    19.4    19.7    20.5    21      21      21.2  21.4
3     8  10.4    11.3    13.9    14.7    15.0    15.2    15.4    15.9    16.8    18.3  19.2

Dataframe output

Now your summarise output can be a dataframe. Let’s look at a simple example. Recently I wrote a function that identified all unique unordered pairs of elements in a vector. Now I want to apply that to map a network of connections between characters of Friends based on appearing in the same scene.

Here’s a simple version of a dataframe I might be working from:

friends_episode <- data.frame(
  scene = c(1, 1, 1, 2, 2, 2),
  character = c("Joey", "Phoebe", "Chandler", "Joey", "Chandler", "Janice")
)

friends_episode

  scene character
1     1      Joey
2     1    Phoebe
3     1  Chandler
4     2      Joey
5     2  Chandler
6     2    Janice

Now I’m going to write my function which accepts a vector and which produces a two column dataframe, and apply it by scene:

unique_pairs <- function(char_vector = NULL) {

  vector <- as.character(unique(char_vector))

  df <- data.frame(from = character(), to = character(), stringsAsFactors = FALSE)

  if (length(vector) > 1) {
    for (i in 1:(length(vector) - 1)) {
      from <- rep(vector[i], length(vector) - i) 
      to <- vector[(i + 1): length(vector)]

      df <- df %>% 
        dplyr::bind_rows(
          data.frame(from = from, to = to, stringsAsFactors = FALSE) 
        )
    }
  }

  df

}


friends_episode %>% 
  group_by(scene) %>% 
  summarise(unique_pairs(character))


# A tibble: 6 x 3
  scene from     to      
  <dbl> <chr>    <chr>   
1     1 Joey     Phoebe  
2     1 Joey     Chandler
3     1 Phoebe   Chandler
4     2 Joey     Chandler
5     2 Joey     Janice  
6     2 Chandler Janice  

As you might see, the new dataframe has been unpacked and replaces the original character column and creates two new columns instead. What happens if we name our summarised dataframe?

friends_pairs <- friends_episode %>% 
  group_by(scene) %>% 
  summarise(pairs = unique_pairs(character))

friends_pairs

# A tibble: 6 x 2
  scene pairs$from $to     
  <dbl> <chr>      <chr>   
1     1 Joey       Phoebe  
2     1 Joey       Chandler
3     1 Phoebe     Chandler
4     2 Joey       Chandler
5     2 Joey       Janice  
6     2 Chandler   Janice  

So this is an important watchout. If you want your output dataframe unpacked into the summary dataframe, don’t name your summary function output.

One of the things you might immediately see from all this is the ability to run iterations of models using dplyr. This is quite exciting and I will write more on this soon, but here’s a simple example:

model_coefs <- function(data, formula) {
  coefs <- lm(formula, data)$coefficients
  data.frame(coef = names(coefs), value = coefs)
}


mtcars %>% 
  dplyr::nest_by(cyl) %>% 
  dplyr::summarise(model_coefs(data = data, formula = mpg ~ disp + hp + drat + wt)) %>% 
  tidyr::pivot_wider(names_from = coef, values_from = value)

# A tibble: 3 x 6
    cyl `(Intercept)`       disp       hp    drat    wt
  <dbl>         <dbl>      <dbl>    <dbl>   <dbl> <dbl>
1     4          52.5 -0.0629    -0.0760  -1.44   -3.10
2     6          15.1  0.0436     0.00252  2.43   -3.98
3     8          26.8  0.0000659 -0.0135  -0.0453 -2.19

What I’ve written here is just the beginning of what’s possible. Please feel free to leave a comment with other ideas for readers.

Leave a Reply

%d bloggers like this: