Get any US music chart listing from history in your R console

We are lucky enough to live in an age where we can get pretty much any factoid we want. If we want to find out the Top Billboard 200 albums from 1980, we just need to go to the official Billboard 200 website, enter the date, and up the list comes in a nice display with album art and all that nice stuff.

But often we don’t care about the nice stuff, and we don’t want to visit a website and go through several clicks to get the info we need. Wouldn’t it be great if we could just get it in our console with a simple function or command.

Well, if the website is well structured and accessing data from a structured dataset, then in all likelihood you can scrape it, meaning that you can extract the precise information you want into a vector or table, allowing you to conduct analysis or do whatever.

In this article we are going to review elementary web scraping in R using the packages rvest and xml2. These packages are remarkably easy to use. By the end of the article we will have created a function called get_charts() which will take a date, a chart version and a vector of ranked positions as its arguments and instantly return the chart entries in those positions on that date. I hope this will encourage you to try it on countless other sources of web data.

For this tutorial you will need to have installed dplyr, xml2 and rvest packages. You will also need to be using the Google Chrome browser.

Getting started — Scraping this week’s Billboard Hot 100

We will start by working out how to scrape this week’s Billboard Hot 100 to get a ranked list of artists and titles. If you take a look at the Hot 100 page at https://www.billboard.com/charts/hot-100 you can observe its overall structure. It has various banners and advertising and there’s quite a lot going on, but you can immediately see that there is a Hot 100 list on the page, so we know that the information we want is on this page and we will need to navigate the underlying code to find it.

The packages rvest and xml2 in R are designed to make it easy to extract and analyse deeply nested HTML and XML code that sits behind most websites today. HTML and XML are different — I won’t go into the details of that here — but you’ll usually need rvest to dig down and find the specific HTML nodes that you need and xml2 to pull out the XML attributes that contain the specific data you want.

After we load our packages, the first thing we will want to do is read the html from the web page, so we have a starting point for digging to find the nodes and attributes we want.

# required libraries
library(rvest)
library(xml2)
library(dplyr)
# get url from input
input <- "https://www.billboard.com/charts/hot-100"
# read html code from url
chart_page <- xml2::read_html(input)

Now we have a list object chart_page which contains two elements, one for the head of the webpage and the other for the body of the webpage.

We now need to inspect the website using Chrome. Right click on the website and choose ‘inspect’. This will bring up a panel showing you all the nested HTML and XML code. As you roll your mouse over this code you will see that the part of the page that it refers to is highlighted. For example, you can see here that the section we are interested in highlights when I mouse over the highlighted <div class = "container chart-container ..."> which makes sense.

If you continue to expand this section and follow this method, you will eventually cone to the specific HTML and XML code which populates the chart list. If you look close enough you will see that the chart items all have the format <div class='chart-list-item' .... We can use rvest‘s html_nodes() function to dive into the body of the page and the use xml2’s xml_find_all() function to grab all <div> nodes that have chart-list-item as their class.

# browse nodes in body of article
chart <- chart_page %>% 
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(@class, 'chart-list-item ')]")
View(chart)

This gives us a nested numbered list which we can click and browse through, like so:

Now we notice that the actual XML class the contains the data we are interested in actually has a space after chart-list-item, so if we rewrite our previous command to have an extra space, this should parse out exactly the nodes that have the data that we want. Then we can use xml2’s xml_attr() function to pull out the rank, artist and title into vectors.

# scrape data
chart <- chart_page %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(@class, 'chart-list-item ')]")
# get rank, artist and title as vector
rank <- chart %>% 
xml2::xml_attr('data-rank')

artist <- chart %>%
xml2::xml_attr('data-artist')

title <- chart %>%
xml2::xml_attr('data-title')
# create dataframe, remove NAs and return result
chart_df <- data.frame(rank, artist, title)
chart_df <- chart_df %>%
dplyr::filter(!is.na(rank))
View(chart_df)

And there we have it, a nice list of exactly what we want, and good to know that there are 100 rows as expected:

Generalizing to pull any chart from any date

So that was a lot of investigative work, and digging into HTML and XML can be annoying. There are Chrome plugins like SelectorGadget that can help with this, but I find them unpredictable and prefer to just investigate the underying code like I did above.

Now that we know where the data sits, however, we can now make this a lot more powerful. If you play with the billboard.com website, you’ll notice that you can get to a specific chart on any historic date by simply editing the URL. So for example if you wanted to see the Billboard 200 as of 22nd March 1983 you just go to https://www.billboard.com/charts/billboard-200/1983-03-22.

So, this allows us to take the code above and easily generalize it by creating a function that accepts the date, chart type and positions we are interested in. Let’s write that function with some default values for date (today), chart type (default to Hot 100), and positions(top 10).

get_chart <- function(date = Sys.Date(), positions = c(1:10), type = "hot-100") {
# get url from input and read html
input <- paste0("https://www.billboard.com/charts/", type, "/", date)
chart_page <- xml2::read_html(input)
# scrape data
chart <- chart_page %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(@class, 'chart-list-item ')]")
rank <- chart %>% 
xml2::xml_attr('data-rank')

artist <- chart %>%
xml2::xml_attr('data-artist')

title <- chart %>%
xml2::xml_attr('data-title')
# create dataframe, remove nas and return result
chart_df <- data.frame(rank, artist, title)
chart_df <- chart_df %>%
dplyr::filter(!is.na(rank), rank %in% positions)
chart_df
}

OK, let’s test our function. What were the Top 20 singles on 22nd March 1983?

What were the Top 10 albums on 1st April 1970?

What I love about rvest and xml2 is how simple and powerful they are. Look how lean the content of the function is — it didn’t take much to create something quite powerful. Give it a try with some other sources of web data and feel free to add to the Github repo here if you create any other cool scraping functions.

You can find out more about rvest here and xml2 here.

Leave a Reply

%d bloggers like this: