Scraping Structured Data From Semi-Structured Documents

One of the most powerful capabilities that data science tools bring to the table is the capacity to deal with unstructured data and to turn it into something that can be structured and analyzed. Any data scientist worth their salt should be able to ‘scrape’ data from documents, whether from the web, locally or any other type of text-based asset.

In this article I am going to show how to scrape the online scripts of the TV Show Friends with the aim of creating a table of numbered scenes for each episode and the names of the characters in those scenes. This was the first step in my build of the interactive Friends network website here. It is also an excellent test case for scraping techniques, since the scripts all have a certain basic structure to them, but the structure is not entirely predictable, with the writers not always following the same formatting rules from episode to episode.

You can find all the final code for this work on the Github repo for my Friends project.

The scripts that I want to scrape can be found here. If you click on each one you can see that the file name has a consistent format of the form https://fangj.github.io/friends/season/[ss][ee].html where[ss] is the two digit season number and[ee] is the two digit episode number.

To scrape effectively – you’ll need a knowledge of two things:

  1. The relevant scraping packages in your chosen language. In R, which I will use here, those packages are rvest and xml2. In Python, most people use Beautiful Soup. If you want an in depth tutorial on how to use rvest and xml2 to scrape web pages, see my previous article here, which I suggest you read before moving forward with this article.
  2. An ability to use regex (short for regular expressions), which is the language of text search. We will focus more on regex in this article. The syntax for regex can be a little different across platforms, but it’s mostly very similar. A great cheatsheet for regex in R is here. regex is not an easy topic to become a master of – there are very few true regex experts in this world – but most data scientists know enough to perform the most common tasks, and often search online for the remainder of what they need.

Looking at the Friends scripts to work out what we need to do

Let’s remind ourselves of our task here. We are interested in two things:

  • We want to break an episode into numbered scenes
  • We want to list all the characters which appear in each scene

Let’s take a look at the web code for Season 1 Episode 1. You can do this by opening the script in Google Chrome and then pressing CMD+Option+C (or Ctrl+Shift+C in Windows) to open the Elements Console where you can view the HTML of the page side by side with the page itself.

One of the things we can see immediately is that most of the words that precede a colon are of interest to us. In fact, most of them are character names that say something in a scene. We also see that lines the contain the string "Scene:" are pretty reliable indicators of scene boundaries.

The first thing we probably want to do get this HTML code in a list or vector of nodes which represent the different pieces of formatting and text in the document. Since this will contain the separated lines spoken by each character, this will be really helpful for us to work from. So we will use our nice scraping packages to download the HTML code, break it into nodes so that we have a nice tidy vector of script content.

library(rvest) # (also automatically loads xml2)

url_string <- "https://fangj.github.io/friends/season/0101.html"

nodes <- xml2::read_html(url_string) %>% 
      xml2::as_list() %>% 
      unlist()

Now if you take a look at your nodes object, you’ll see that it is a character vector that contains a lot of different split out parts, but most importantly it contains lines from the script, for example:

> nodes[16]
                                                      html.body.font.p 
"[Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]" 

Using regex to pull out what we need

To be useful for our task, we need to create a vector that contains the word ‘New Scene’ if the line represents the beginning of a scene, and the name of the character if the line represents something spoken by a character. This will be the best format for what we want to do.

The first thing we will need to do is swap any text string containing "Scene:" to the string "New Scene". We can do this quite simply using an ifelse() on the nodes vector, where we use grepl() to identify which entries in nodes contain the string "Scene:".

nodes <- ifelse(grepl("Scene:", nodes), "New Scene", nodes)

We can quickly check that there are entires that now contain "New Scene".

> nodes[nodes == "New Scene"]
 [1] "New Scene" "New Scene" "New Scene" "New Scene" "New Scene" "New Scene" "New Scene" "New Scene"
 [9] "New Scene" "New Scene" "New Scene" "New Scene" "New Scene" "New Scene" "New Scene"

That worked nicely. Now, you might also have noticed that character names precede a colon. So that might be a nice way to extract the names of characters (although it might give us a few other things that have preceded colons in the script also, but we can deal with that later).

So what we will do is use regex to tell R that we are looking for anything preceding a colon. We will use a lookahead string as follows: ".+?(?=:)" .

Let’s look at that string and make sure we know what it means. The .+? component means ‘find any string of text of any length’. The part in brackets is known as a lookahead: it means look ahead of that string of text and find a colon as the next character. This is therefore instructing R to find any string of text that precedes a colon and return it. If we use the package stringr and its function str_extract() with this regex string, it will go through every entry of the nodes vector and transform it to just the first string of text found in front of a colon. If no such string is found it will return an NA value. This is great for us because we know that spoken characters names are always at the start of nodes, so we certainly won’t miss any if we just take the first instance in each line. We should also, for safety, not mess with the scene breaks we have put into our vector:

library(stringr)

nodes <- ifelse(nodes != "New Scene", 
                stringr::str_extract(nodes, ".+?(?=:)"), 
                nodes)

Let’s take a look at what’s in our nodes vector now

> nodes[sample(100)]
  [1] "Joey"                       NA                           "Chandler"                  
  [4] NA                           "Phoebe"                     "All"                       
  [7] "Chandler"                   NA                           NA                          
 [10] NA                           NA                           NA                          
 [13] NA                           "Joey"                       "Chandler"                  
 [16] NA                           NA                           NA                          
 [19] NA                           "Transcribed by"             NA                          
 [22] "Joey"                       NA                           NA                          
 [25] NA                           NA                           NA                          
 [28] NA                           NA                           NA                          
 [31] "Phoebe"                     NA                           NA                          
 [34] "Monica"                     NA                           NA                          
 [37] NA                           NA                           NA                          
 [40] NA                           "Additional transcribing by" NA                          
 [43] "(Note"                      NA                           NA                          
 [46] NA                           NA                           NA                          
 [49] NA                           NA                           NA                          
 [52] NA                           "Monica"                     NA                          
 [55] NA                           NA                           "New Scene"                 
 [58] "Monica"                     NA                           "Ross"                      
 [61] NA                           NA                           NA                          
 [64] NA                           NA                           "Chandler"                  
 [67] "Joey"                       NA                           NA                          
 [70] "Chandler"                   "Chandler"                   NA                          
 [73] NA                           NA                           NA                          
 [76] NA                           "Written by"                 "Monica"                    
 [79] NA                           NA                           NA                          
 [82] "Ross"                       "Joey"                       "Monica"                    
 [85] NA                           NA                           NA                          
 [88] NA                           NA                           "Chandler"                  
 [91] "Phoebe"                     NA                           NA                          
 [94] "Chandler"                   NA                           NA                          
 [97] NA                           NA                           NA                          
[100] NA 

So this is working – but we have more cleaning to do. For example, we will want to get rid of the NA values. We also notice that there are some preamble lines which usually contain the word “by” and we can see that strings in brackets like “(Note” seems to have been extracted. We can create a bunch of special cleaning commands to get rid of these if we don’t want them. For example:

nodes <- nodes[!is.na(nodes)] # remove NAs

# remove entries with "by" or "(" or "all" irrelevant of the case
nodes <- nodes[!grepl("by|\\(|all", tolower(nodes))] 

Let’s look at a sample:

> nodes[sample(10)]
 [1] "Phoebe"    "Monica"    "New Scene" "Chandler"  "Chandler"  "Joey"      "Chandler"  "Monica"   
 [9] "Chandler"  "Phoebe" 

This is looking good. Of course the cleaning steps I did above are not complete, and that string we have put in !grepl("by|\\(|all", tolower(nodes)) will be expanded as we do more and more scraping to account for common failure in parsing. But you get the idea. Note that when you are using characters in regex strings which are also used as ‘special characters’, you’ll need to escape those by putting a \\ in front of them. What I mean is: R needs to know when you type a ( – do you literally mean you are looking for a ( or are you starting a lookahead command or something like that. So if you want a special character to be taken literally – ie yes I really want a literal opening bracket – you should write it in regex as \\(. Consult the cheat sheet link above or look online for lists of specials in regex.

Organizing our output

Let’s assume our cleaning is done and we have a nice vector that contains either the names of characters that are speaking lines in the episode or “New Scene” to indicate that we are crossing a scene boundary. We now just need to convert this vector into a simple data frame with three columns: episode, scene and character.

The episode number is obvious, and we already have our character lists, so we really just need to iterate through our nodes vector and for each entry, count the number of previous occurrences of “New Scene” and add one. We can do this with:

  # number each scene
  scene_count <- c()
  
  for (i in 1:length(nodes)) {
    scene_count[i] <- sum(grepl("New Scene", nodes[1:i])) + 1
  }

Then we can finalize our dataframe by putting our three vectors together and removing any repeated characters in the same scene. We can also correct for situations where the script starts with a New Scene and we can consistently format our character names to title case, to account for different case typing:

library(tidyverse)

results <- data.frame(episode = 1, scene = scene_count, character = nodes) %>% 
    dplyr::filter(character != "New Scene") %>% 
    dplyr::distinct(episode, scene, character) %>% 
    dplyr::mutate(scene = scene - min(scene) + 1, # set first scene number to 1
                  character = character %>% tolower() %>% tools::toTitleCase()) # title case

Let’s take a look at some example output from our results dataframe:

episodescenecharacter
11Monica
11Joey
11Chandler
11Phoebe
11Ross
11Rachel
11Waitress
12Monica
12Chandler
12Ross
12Rachel
12Phoebe
12Joey
12Paul
13Phoebe
14Ross
14Joey
14Chandler

Generalizing into a function for every season and episode

Of course, our aim here is not to scrape just the first episode, but to scrape every episode of every season. A good data scientist will now look to generalize the work they have done and put it into a function that will accept a season number and an episode number and then scrape that episode and deliver back the data frame we just constructed.

If we look at the format of Season 1, Episode 1, we will see that it is repeated through most – though not all – episodes. There are some exceptions:

  • Some episodes in Season 2 (from Episode 3) have different formatting
  • The entire of Season 10 appears to use some different HTML where the formatting happens before, not after, the colon.

So it will be necessary to expand our code above with if else statements to allow for situations where we know the format will be a little different. We will also need to expand our cleanup as much as we can to account for things happening with our parsing of other episodes which we don’t anticipate. It’s fairly normal for there to be a little bit of human effort involved in scraping text and cleaning up for unanticipated results. It’s also quite common for the end result not to be 100% error free. The data scientist has to determine when the result is close enough to what they need, and when cleaning up the remaining 1-2% of errors is not worth the effort involved in doing so.

Here is my final episode scraping function which I’m happy does the job I need. You’ll see the code anchor around the work we did above, but expand it to work on all seasons and episodes.

library(rvest)
library(stringr)

scrape_friends <- function(season = 1, episode = 1) {
  
  # some episodes are double episodes
  
  if (season == 2 & episode == 12) {
    url_string <- "https://fangj.github.io/friends/season/0212-0213.html"
  } else if (season == 6 & episode == 15) {
    url_string <- "https://fangj.github.io/friends/season/0615-0616.html"
  } else if (season == 9 & episode == 23) {
    url_string <- "https://fangj.github.io/friends/season/0923-0924.html"
  } else if (season == 10 & episode == 17) {
    url_string <- "https://fangj.github.io/friends/season/1017-1018.html"
  } else {
    url_string <- paste0("https://fangj.github.io/friends/season/", 
                         sprintf("%02d", season), 
                         sprintf("%02d", episode), 
                         ".html")
  }
  
  # general html for seasons 1:9 is different from season 10
  
  if (season %in% 1:9) {
    nodes <- xml2::read_html(url_string) %>% 
      xml2::as_list() %>% 
      unlist()
    
    nodes <- ifelse(grepl("Scene:", nodes), "New Scene", nodes) # mark scene boundaries
    
    # season 2 has some weirdly formatted episodes
    if (season == 2 & episode %in% 3:25) {
      nodes <- ifelse(nodes != "New Scene", 
                      # look for caps preceding a colon 
                      stringr::str_extract(nodes, "[[[:upper:]][[:punct:]]\\s]+(?=:)") %>% 
                        tolower() %>% 
                        tools::toTitleCase(),
                      nodes)
    } else {
      nodes <- ifelse(nodes != "New Scene", 
                      stringr::str_extract(nodes, ".+?(?=:)"), # anything preceding a colon
                      nodes)
    }
    
  } else {
    # season 10
    nodes <- xml2::read_html(url_string) %>% 
      rvest::html_nodes("p") %>% 
      rvest::html_text() # anything in paragraph tags
    
    nodes <- ifelse(grepl("Scene:", nodes), "New Scene", nodes)
    
    nodes <- ifelse(nodes != "New Scene", 
                    stringr::str_extract(nodes, ".+?(?=:)"), # anything preceding a colon
                    nodes)
    
  }
  

  # manual leaveouts and replacements - gets us 98% of the way I reckon 
  nodes <- nodes[!is.na(nodes)]
  nodes <- trimws(nodes)
  nodes <- nodes[!grepl("/| and |all|everybody|&|by|position|aired|both|,|from|at 8|end|time|commercial|\\(|\\[|letters| to |it's|it was|kudrow|perry|cox|aniston|schwimmer|leblanc|look|could|walks|everyone|teleplay|story|together", tolower(nodes))]
  nodes <- nodes[nchar(nodes) < 20]
  nodes <- nodes[!grepl("^[a-z]|^[0-9]|^[[:punct:]]", nodes)]
  nodes <- gsub("<b>|\n", "", nodes)
  nodes <- gsub("\u0092", "'", nodes)
  nodes <- gsub("Mr.Heckles", "Mr. Heckles", nodes)
  nodes <- gsub("father", "dad", nodes)
  nodes <- gsub("mother", "mom", nodes)
  nodes <- gsub("Mr. ", "Mr ", nodes)
  nodes <- gsub("Mrs. ", "Mrs ", nodes)
  nodes <- gsub("Ms. ", "Ms ", nodes)
  nodes <- gsub("Dr. ", "Dr ", nodes)
  nodes <- ifelse(nodes == "r Zelner", "Mr Zelner", nodes)
  nodes <- ifelse(nodes == "r Zelner", "Mr Zelner", nodes)
  nodes <- ifelse(nodes == "Mnca", "Monica", nodes)
  nodes <- ifelse(nodes == "Phoe", "Phoebe", nodes)
  nodes <- ifelse(nodes == "Rach", "Rachel", nodes)
  nodes <- ifelse(nodes == "Chan", "Chandler", nodes)
  nodes <- ifelse(nodes == "Billy", "Billy Crystal", nodes)
  nodes <- ifelse(nodes == "Robin", "Robin Williams", nodes)
  nodes <- ifelse(tolower(nodes) == "amger", "amber", nodes)
  nodes <- ifelse(tolower(nodes) == "gunter", "gunther", nodes)
  
  # number each scene
  scene_count <- c()
  
  for (i in 1:length(nodes)) {
    scene_count[i] <- sum(grepl("New Scene", nodes[1:i])) + 1
  }
  
  data.frame(episode = episode, scene = scene_count, character = nodes) %>% 
    dplyr::filter(character != "New Scene") %>% 
    dplyr::distinct(episode, scene, character) %>% 
    dplyr::mutate(scene = scene - min(scene) + 1, # set first scene number to 1
                  character = character %>% tolower() %>% tools::toTitleCase()) # title case
  
  
}

Now we can scrape for example Season 9 Episode 2 with this simple command:

scrape_friends(9, 2)

This function will become super useful to us as we move to try to create edgelists for the network of characters based on them appearing in the same scene together. Look out for this in an upcoming post.

One thought on “Scraping Structured Data From Semi-Structured Documents

Leave a Reply

%d bloggers like this: