dplyr
previously had limited friendliness to working across rows. It previously behaved somewhat counter-intuitively when you wanted to sum or average across values in the same row. Here’s an example, which some of you might recognize as being a source of a previous headache:
WorldPhones_df <- WorldPhones %>%
as.data.frame()
# mutate an average column
WorldPhones_df %>%
dplyr::mutate(avg = mean(N.Amer:Mid.Amer))
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer avg
1951 45939 21574 2876 1815 1646 89 555 23247
1956 60423 29990 4708 2568 2366 1411 733 23247
1957 64721 32510 5230 2695 2526 1546 773 23247
1958 68484 35218 6662 2845 2691 1663 836 23247
1959 71799 37598 6856 3000 2868 1769 911 23247
1960 76036 40341 8220 3145 3054 1905 1008 23247
1961 79831 43173 9053 3338 3224 2005 1076 23247
This has returned the average of everything in every column in your dataframe, which is of course not what was intended.
Previously the only solution to this was to use manual calculations and to avoid using functions in this way, so you wuld write (N.Amer + Europe + Asia + S.Amer + Oceania + Africa + Mid.Amer)/7
which was pretty darn tedious.
rowwise()
creates a different structure called a rowwise_df
which prepares your data to perform operations across the rows – it basically groups your data by row. It can be used in combination with the new c_across()
adverb to allow you to work in a similar way to how you would work colwise. Now you can write:
WorldPhones_df %>%
rowwise() %>%
dplyr::mutate(avg = mean(c_across(N.Amer:Mid.Amer)))
# A tibble: 7 x 8
# Rowwise:
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45939 21574 2876 1815 1646 89 555 10642
2 60423 29990 4708 2568 2366 1411 733 14600.
3 64721 32510 5230 2695 2526 1546 773 15714.
4 68484 35218 6662 2845 2691 1663 836 16914.
5 71799 37598 6856 3000 2868 1769 911 17829.
6 76036 40341 8220 3145 3054 1905 1008 19101.
7 79831 43173 9053 3338 3224 2005 1076 20243.
Working row-wise with list columns
The new rowwise_df
object is designed to work with list-columns, which allow the storage of any type of data you want inside a column in a dataframe. Where I find this particularly valuable is where you want to run different models on subsets of your data according to the value of certain variables. Here’s an example of how you can store different subsets of mtcars
in a rowwise dataframe and then run a model on them.
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::group_by(cyl) %>%
tidyr::nest() %>%
dplyr::rowwise() %>%
dplyr::summarise(model_coefs(mpg ~ wt + disp + hp, data = data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 5
cyl `(Intercept)` wt disp hp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 6 30.3 -3.90 0.0161 -0.0110
2 4 44.9 -2.59 -0.0663 -0.0645
3 8 26.7 -2.18 0.00000110 -0.0137
The nest_by
function
Of course, the developers behind dplyr 1.0.0
noticed the power of this row-wise modelling capability and so created the nest_by()
function as a shortcut for the code above. nest_by(x)
is equivalent of:
dplyr::group_by(x) %>%
tidy::nest() %>%
dplyr::rowwise()
So now you can do the modeling above using:
mtcars %>%
nest_by(cyl) %>%
dplyr::summarise(model_coefs(mpg ~ wt + disp + hp, data = data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 5
cyl `(Intercept)` wt disp hp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 44.9 -2.59 -0.0663 -0.0645
2 6 30.3 -3.90 0.0161 -0.0110
3 8 26.7 -2.18 0.00000110 -0.0137
I’ve found the group_map and group_modify verbs more succinct – they avoid any obvious nesting. e.g.
mtcars %>%
group_by(cyl) %>%
group_modify(~ broom::tidy(lm(mpg ~ wt + disp + hp, data = .x)))