Summarise – the original workhorse of dplyr
– has been made even more flexible in the new release. First, it can now return vectors to form multiple rows in the output. Second, it can return dataframes to form multiple rows and columns in the output. This might be a little mind-bending for some, so I’ll spend a little time on it here to illustrate how this could work.
Vector output
If you want to summarise a function that creates a vector output, this is now easy. For example you can easily summarise a range:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(range = range(mpg))
# A tibble: 6 x 2
cyl range
<dbl> <dbl>
1 4 21.4
2 4 33.9
3 6 17.8
4 6 21.4
5 8 10.4
6 8 19.2
You could then combine with tidyr::pivot_wider()
if you wish:
library(tidyr)
mtcars %>%
group_by(cyl) %>%
summarise(range = range(mpg)) %>%
mutate(name = rep(c("min", "max"), length(unique(cyl)))) %>%
pivot_wider(names_from = name, values_from = range)
# A tibble: 3 x 3
cyl min max
<dbl> <dbl> <dbl>
1 4 21.4 33.9
2 6 17.8 21.4
3 8 10.4 19.2
This would provide the equivalent of:
mtcars %>%
group_by(cyl) %>%
summarise(min = min(mpg), max = max(mpg))
# A tibble: 3 x 3
cyl min max
* <dbl> <dbl> <dbl>
1 4 21.4 33.9
2 6 17.8 21.4
3 8 10.4 19.2
The second option in this case is much easier, but where this comes in useful is where you have longer outputs. Here’s one simple way you could compute deciles:
decile <- seq(0, 1, 0.1)
mtcars %>%
group_by(cyl) %>%
summarise(deciles = quantile(mpg, decile)) %>%
mutate(name = rep(paste0("dec_", decile), length(unique(cyl)))) %>%
pivot_wider(names_from = name, values_from = deciles)
# A tibble: 3 x 12
cyl dec_0 dec_0.1 dec_0.2 dec_0.3 dec_0.4 dec_0.5 dec_0.6 dec_0.7 dec_0.8 dec_0.9 dec_1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 21.4 21.5 22.8 22.8 24.4 26 27.3 30.4 30.4 32.4 33.9
2 6 17.8 18.0 18.3 19.0 19.4 19.7 20.5 21 21 21.2 21.4
3 8 10.4 11.3 13.9 14.7 15.0 15.2 15.4 15.9 16.8 18.3 19.2
Dataframe output
Now your summarise output can be a dataframe. Let’s look at a simple example. Recently I wrote a function that identified all unique unordered pairs of elements in a vector. Now I want to apply that to map a network of connections between characters of Friends based on appearing in the same scene.
Here’s a simple version of a dataframe I might be working from:
friends_episode <- data.frame(
scene = c(1, 1, 1, 2, 2, 2),
character = c("Joey", "Phoebe", "Chandler", "Joey", "Chandler", "Janice")
)
friends_episode
scene character
1 1 Joey
2 1 Phoebe
3 1 Chandler
4 2 Joey
5 2 Chandler
6 2 Janice
Now I’m going to write my function which accepts a vector and which produces a two column dataframe, and apply it by scene:
unique_pairs <- function(char_vector = NULL) {
vector <- as.character(unique(char_vector))
df <- data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
if (length(vector) > 1) {
for (i in 1:(length(vector) - 1)) {
from <- rep(vector[i], length(vector) - i)
to <- vector[(i + 1): length(vector)]
df <- df %>%
dplyr::bind_rows(
data.frame(from = from, to = to, stringsAsFactors = FALSE)
)
}
}
df
}
friends_episode %>%
group_by(scene) %>%
summarise(unique_pairs(character))
# A tibble: 6 x 3
scene from to
<dbl> <chr> <chr>
1 1 Joey Phoebe
2 1 Joey Chandler
3 1 Phoebe Chandler
4 2 Joey Chandler
5 2 Joey Janice
6 2 Chandler Janice
As you might see, the new dataframe has been unpacked and replaces the original character
column and creates two new columns instead. What happens if we name our summarised dataframe?
friends_pairs <- friends_episode %>%
group_by(scene) %>%
summarise(pairs = unique_pairs(character))
friends_pairs
# A tibble: 6 x 2
scene pairs$from $to
<dbl> <chr> <chr>
1 1 Joey Phoebe
2 1 Joey Chandler
3 1 Phoebe Chandler
4 2 Joey Chandler
5 2 Joey Janice
6 2 Chandler Janice
So this is an important watchout. If you want your output dataframe unpacked into the summary dataframe, don’t name your summary function output.
One of the things you might immediately see from all this is the ability to run iterations of models using dplyr
. This is quite exciting and I will write more on this soon, but here’s a simple example:
model_coefs <- function(data, formula) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::nest_by(cyl) %>%
dplyr::summarise(model_coefs(data = data, formula = mpg ~ disp + hp + drat + wt)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 6
cyl `(Intercept)` disp hp drat wt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 52.5 -0.0629 -0.0760 -1.44 -3.10
2 6 15.1 0.0436 0.00252 2.43 -3.98
3 8 26.8 0.0000659 -0.0135 -0.0453 -2.19
What I’ve written here is just the beginning of what’s possible. Please feel free to leave a comment with other ideas for readers.