Simple iterative programming and error handling in R

As you develop as a programmer, there are common situations you will find yourself in. One of those situations is where you need to run your code over a number of iterations of one or more loops, and where you know that your code may fail for at least one iteration. You don’t want your code to stop completely, but you do want to know that it failed and log where it happened. I am going to show a simple example of how to do this here.

This was the second part of my journey in building my interactive character network visualization of the TV show Friends. The tutorial of the first step – scraping the scripts of individual episodes from the web – can be found here. The end product of the entire project can be seen here and all code is here.

Where we left off, we had written a script that could scrape each online episode of Friends, find and count the different scenes and list the characters in each scene. Our scraping script outputted a table like this:

episodescenecharacter
11Monica
11Joey
11Chandler
11Phoebe
11Ross
11Rachel
11Waitress
12Monica
12Chandler
12Ross

What we want to do now is create a network edgelist so that we can analyze and visualize the character network of the show. A network edgelist is a simple pairing of characters with a ‘from’ and ‘to’ column, where characters are paired if they have appeared together in at least one scene. We will also add a ‘weight’ column which will count the number of scenes the pair have appeared together in – a measure of connection ‘strength’. We want to do this for every season of the show, and we don’t care about direction in our network – so for us the pair {“Chandler”, “Monica”} is the same as the pair {“Monica”, “Chandler”}. So to summarise our objectives here – we want to:

  1. Run our scraping function through every season and every episode
  2. Transform the output into character ‘pairs’ for each scene
  3. Count character pairs across each season of the show to form a ‘weight’.

Running our scraping function through all seasons and episodes

Friends ran over ten seasons, each one with a varying number of episodes, but never more than 25 – there were also double episodes, which means that some episode numbers were skipped in the script names. I don’t want to write conditional code that precisely defines all combinations of seasons and episodes over the years. What I would much prefer to do is to allow my scraping script to throw an error, return an empty data frame, log the error as a message in my console, and then continue on to the next iteration. Then I can go and look at the logged messages and check that the errors were expected because the episodes didn’t exist.

So I am going to do some error handling here – and I’m going to use the tryCatch() function to do this. tryCatch() takes a command and executes it if it can, and then accepts specific instructions as a callback function in the event of an error. In my case I am going to use tryCatch() like this:

library(dplyr)
source('edgelist_creation/scrape_friends.R')   # function developed in previous tutorial

season <- 1 # enter season no
episode <- 1 # enter episode number

scrape <- tryCatch(
      scrape_friends(season, episode),
      error = function(e) {
        message(paste("Scrape failed for Episode", episode, "Season", season))
        data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
        }
)    

With this code, if the specific season and episode combination does not exist, the code will not stop, but instead it will display the specific message in the console and return an empty dataframe. This now allows us to iterate over every season and episode without fearing that our code will stop because of an error.

Transforming output into character pairs

So because of our wonderful tryCatch() solution, we can now create the beginnings of a for loop that will iterate over 10 seasons, 25 episodes per season, as follows. Note my comments here and pay particular attention to the CAPS section which we will need to work on next:

for (season in 1:10) {
  
  # start with empty dataframe
  season_results <- data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
  
  # no season has more than 25 episodes, loop through them all
  for (episode in 1:25) { 
    
    # keep track of where we are
    message(paste("Scraping Season", season, "Episode", episode))
    
    # run scraping function, pass empty df if scrape fails
    scrape <- tryCatch(
      scrape_friends(season, episode),
      error = function(e) {
        message(paste("Scrape failed for Episode", episode, "Season", season))
        data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
        }
    )
    
    result <- data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
    
    if (nrow(scrape) > 0) {
      # DO SOMETHING TO CREATE CHARACTER PAIRS HERE
    } 
   
   # add episode output to season results 
   season_results <- season_results %>% 
     dplyr::bind_rows(result)
     
  }
  
  # add season results to overall results
  raw_results <- season_results %>% 
    dplyr::mutate(season = season) %>% 
    dplyr::bind_rows(raw_results)
  
}

This now reduces us to the question of how to transform the output of an episode scrape into a set of unique character pairs per scene. Recall that the output from our scraping script contains a set of scene numbers and list of characters for each scene.

Within a scene, we need to take that character list and turn it into a set of unique unordered pairs. Let’s write a simple function to transform a character vector into a set of unique unordered pairs of its elements. To do this, we need to go through each element up to the second from last element, and pair with each of the elements that follow it – so for example to do this for the vector ("A", "B", "C", "D"), we would pair "A" with "B", "C" and "D", we would pair "B" with "C" and "D", and finally we would pair "C" with "D".

unique_pairs <- function(char_vector = NULL) {
  
  vector <- as.character(unique(char_vector))
  
  df <- data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
  
  if (length(vector) > 1) {
    for (i in 1:(length(vector) - 1)) {
      from <- rep(vector[i], length(vector) - i) # each element up to second last
      to <- vector[(i + 1): length(vector)] # each element that follows it
      
      df <- df %>% 
        dplyr::bind_rows(
          data.frame(from = from, to = to, stringsAsFactors = FALSE) 
        )
    }
  }

  df
  
}

Let’s test our function to see if it works:

> test <- c("A", "B", "C", "D")
> unique_pairs(test)
  from to
1    A  B
2    A  C
3    A  D
4    B  C
5    B  D
6    C  D

Looks good! Now we just need to apply this to every scene in the episode, so this is the final code that we can replace into our loop instead of the CAPS comments above. It goes through each scene and applies our new unique_pairs() function to the character list, and then appends the results to a data frame which captures all the pairs for the episode.

      for (i in 1:max(scrape$scene)) {
        result_new <- scrape %>% 
          dplyr::filter(scene == i) %>% 
          dplyr::pull(character) %>% 
          unique_pairs()
        
        result <- result %>% 
          dplyr::bind_rows(result_new) 
      } 

Now we are in a position to run the entire loop over all seasons and episodes. You can find the full code of this loop here. If we run it we will see the errors being caught – for example we will see:

Scraping Season 10 Episode 17
Scraping Season 10 Episode 18
Scrape failed for Episode 18 Season 10
Scraping Season 10 Episode 19
Scrape failed for Episode 19 Season 10
Scraping Season 10 Episode 20
Scrape failed for Episode 20 Season 10
Scraping Season 10 Episode 21
Scrape failed for Episode 21 Season 10
Scraping Season 10 Episode 22
Scrape failed for Episode 22 Season 10
Scraping Season 10 Episode 23
Scrape failed for Episode 23 Season 10
Scraping Season 10 Episode 24
Scrape failed for Episode 24 Season 10
Scraping Season 10 Episode 25
Scrape failed for Episode 25 Season 10

This makes sense because the final episode of Season 10 was Episode 17 (a double episode finale).

Counting the number of scenes for each pair

Now we are almost home and dry. We need to count the number of times in each season a pair of characters appeared in a scene together to create our ‘weight’ column. The only issue we need to overcome is that the order of the characters might not be the same in our raw_results dataframe. Our iteration might have caught from = "Monica", to = "Chandler" in one scene, but the other way around in another scene.

The best solution to this is to order the pairs in each row alphabetically, with the following command:

# order pairs alphabetically to deal with different orderings
for (i in 1: nrow(raw_results)) {
  
  raw_results[i, c("from", "to")] <- sort(raw_results[i, c("from", "to")])
  
}

Now we are ready to generate our ‘weight’ column by season, which is pretty simple now:

# add up scenes to form season weights
edges <- raw_results %>% 
  dplyr::count(season, from, to, name = "weight")

And we can take a quick look at a sample of our edges dataframe. It is many thousands of rows as you might expect, but we would anticipate pretty high weights between the six major characters:

friends <- c("Phoebe", "Monica", "Rachel", "Joey", "Ross", "Chandler")

edges %>% 
  dplyr::filter(season == 1,
                from %in% friends,
                to %in% friends) 

This returns:

seasonfromtoweight
1ChandlerJoey133
1ChandlerMonica104
1ChandlerPhoebe100
1ChandlerRachel100
1ChandlerRoss115
1JoeyMonica109
1JoeyPhoebe93
1JoeyRachel101
1JoeyRoss112
1MonicaPhoebe113
1MonicaRachel129
1MonicaRoss99
1PhoebeRachel113
1PhoebeRoss101
1RachelRoss110

OK – so we have our edgelist and we are now ready to move on to the network analysis section of this project, where we will look at the communities of the six major characters and visualize how they change from season to season. Look out for this on an upcoming post.

Leave a Reply

%d bloggers like this: