I wanted to let people know that my new book Handbook of Regression Modeling in People Analytics is now available.
With recent major updates in two core packages, the tidyverse has substantially improved in the flexible options it offers for data wrangling. Here are five examples of what I mean.
In this article I will use the community detection capabilities in the igraph package in R to show how to detect communities in a network. By the end of the article we will able to see how the Louvain community detection algorithm breaks up the Friends characters into distinct communities (ignoring the obvious community of the six main characters), and if you are a fan of the show you can decide if this analysis makes sense to you.
dplyr is now much more friendly to row by row operations
Summarise – the original workhorse of dplyr – has been made even more flexible in the new release.
In this article I want to highlight one of the key developments of this release – the across() function.
As you develop as a programmer, there are common situations you will find yourself in. One of those situations is where you need to run your code over a number of iterations of one or more loops, and where you know that your code may fail for at least one iteration. You don’t want your code to stop completely, but you do want to know that it failed and log where it happened. I am going to show a simple example of how to do this here.
One of the most powerful capabilities that data science tools bring to the table is the capacity to deal with unstructured data and to turn it into something that can be structured and analyzed. Any data scientist worth their salt should be able to ‘scrape’ data from documents, whether from the web, locally or any other type of text-based asset.
Everyone is now calling themselves a Data Scientist. No matter what position I am hiring for, that term is on over 80% of the resumes I look at. It has actually made me start to ignore the term because it is not a differentiator of talent any more.