The force is strong in you if you abstract your R code

Any decent mathematician or computer programmer will tell you that if a task is being repeated again and again, it should be made into a function.

This has always been true, and if you are still coding repeated tasks again and again while just changing a variable or two — if you are just copy/pasting code for example — then you need to stop right now and learn how to write functions.

But recent developments mean that there are more and more incentives to always consider what parts of your code can be abstracted. Developments in R packages to get around non-standard evaluation challenges and to enhance abstraction power through quosures and related expressions means that amazing powers are within our reach.

Functions in R

Let’s start simply. A function is useful to create if you are doing identical analysis but just changing variable values. Let’s work with the starwars dataset in dplyr. If we wanted a list of all human characters we might use this:

starwars_humans <- starwars %>% 
dplyr::filter(species == "Human") %>%
dplyr::select(name)

This will return the names of 35 characters. Now if we want a the same list but for several other species, we could just copy and paste and change the value for species. Or we would write this function for future use:

species_search <- function(x) {
starwars %>%
dplyr::filter(species == x) %>%
dplyr::select(name)
}

Now if we run species_search("Droid") we get a list of four characters and are reassured to see our buddy R2-D2 in there.

We can of course extend this to make it a function with more than one variable to help us search based on various conditions.

Abstracting the search further using features of rlang

The problem above is that this function has limited flexibility. It is defined in a way that you have no control over which variable you want to filter on.

What if we wanted to redefine this function so that it will return a list based on any arbitrary condition that we set. Here we can now set two arguments to the function, one to represent the column on which to filter, and another the value to filter against. We can use the enquo function in rlang to capture the column name for use in dplyr::filter(). Like this:

starwars_search <- function(filter, value) {

filter_val <- rlang::enquo(filter)

starwars %>%
dplyr::filter_at(vars(!!filter_val), all_vars(. == value)) %>%
dplyr::select(name)
}

Now if we evaluate starwars_search(skin_color, "gold") we are reassured to see our anxious but loveable friend C-3PO returned.

Even further to allow arbitrary filter conditions using purrr

So even with our step above we have made our search functionality more abstract and powerful, but it’s still somewhat limited. For example, it only deals with one filter and will only find characters that match that single value.

Lets imagine that we have a set of filters in the form a of a list. We can use the map2 function in purrr to take that list and break it into a series of quosure expressions that can be passed as individual statements into dplyr::filter, using a new function that acts on a dataframe:

my_filter <- function(df, filt_list){     
cols = as.list(names(filt_list))
conds = filt_list
fp <- purrr::map2(cols, conds,
function(x, y) rlang::quo((!!(as.name(x))) %in% !!y))
dplyr::filter(df, !!!fp)
}

Now this allows us to further abstract our starwars_search function to receive an arbitrary set of filter conditions in a list, and those conditions can be set to either match a single value of a set of values expressed in a vector:

starwars_search <- function(filter_list) {
starwars %>%
my_filter(filter_list) %>%
dplyr::select(name)
}

Now we can, for example, look for all characters who have blue or brown eyes, are human and hail from Tatooine or Alderaan, using starwars_search(list(eye_color = c("blue", “brown"), species = “Human", homeworld = c("Tatooine", “Alderaan"))) which will return the following:

# A tibble: 10 x 1
name
<chr>
1 Luke Skywalker
2 Leia Organa
3 Owen Lars
4 Beru Whitesun lars
5 Biggs Darklighter
6 Anakin Skywalker
7 Shmi Skywalker
8 Cliegg Lars
9 Bail Prestor Organa
10 Raymus Antilles

Now you are ready to unleash the full power of the force, by developing functions that abstract multiple elements of your dplyr code. For example, here’s a function that allows you to find any grouped averages you wish of certain Star Wars characters:

starwars_average <- function(mean_col, grp, filter_list) {
  calc_var <- rlang::enquo(mean_col)
grp_var <- rlang::enquo(grp)

starwars %>%
my_filter(filter_list) %>%
dplyr::group_by(!!grp_var) %>%
summarise(mean = mean(!!calc_var, na.rm = TRUE))
}

So if you wanted to find the average height of all humans according to their home worlds, this can be accomplished using starwars_average(height, homeworld, list(species = "Human")) which will return this table:

# A tibble: 16 x 2
homeworld mean
<chr> <dbl>
1 Alderaan 176.
2 Bespin 175
3 Bestine IV 180
4 Chandrila 150
5 Concord Dawn 183
6 Corellia 175
7 Coruscant 168.
8 Eriadu 180
9 Haruun Kal 188
10 Kamino 183
11 Naboo 168.
12 Serenno 193
13 Socorro 177
14 Stewjon 182
15 Tatooine 179.
16 <NA> 193

Although this has been a somewhat trivial example, I hope this helps you better grasp the potential that is available in R functions nowadays. As you look at your day to day work, you may find that there are opportunities to abstract out some of your most common manipulations into functions which could save you a lot of time and effort. Really, what I have demonstrated here is only the tip of the iceberg in terms of what is possible.

Leave a Reply

%d bloggers like this: