What Does It Take To Be a Qualified Data Scientist?

I am excited by how many students and professionals are learning programmatic Data Science tools nowadays. More and more, schools and colleges are adding practical components to erstwhile theoretical courses, where people need to get to grips with Python or R and try to work with real life datasets.

But this causes another problem: everyone is now calling themselves a Data Scientist. No matter what position I am hiring for, that term is on over 80% of the resumes I look at. It has actually made me start to ignore the term because it is not a differentiator of talent any more.

So this begs the question – when does someone actually become a Qualified Data Scientist?

The definition of a qualified Data Scientist

Based on my personal experience, one strong indicator of a Qualified Data Scientist (QDS) is when they can accurately estimate how long it will take them to complete a non-trivial piece of work that contains elements unfamiliar to them. To be able to do this you need to:

  1. Have the knowledge to work out steps will be needed to do the work, which of these steps you can already do and which you need to research further
  2. Have the ability to do the steps you think you can do without unexpected issues
  3. Have the confidence to learn the parts you don’t know in the time you have estimated

If I were to design some sort of certification exam for a QDS award, it would be open-book, probably somewhere between 4 hours and a full day, and it would involve a choice of novel problems to be solved and chunky datasets to work with. The problems would be set up in steps, where some of the early steps would need a strong, fluent knowledge and rapid recall of the core of the chosen language and some of the later steps would require research and applications of packages, modules and methods that would not be regarded as part of the core of the language, and would require the examinee to conduct online research in order to complete. Extra credit would be given for efficient and elegant approaches.

The steps to becoming a Qualified Data Scientist

What would be the steps involved in studying to become a QDS? I believe at an absolute minimum it would take a year of full time study, over two intense semesters, and would have several progressive learning modules with accompanying practical components:

Semester 1:

  1. Elementary mathematics and statistics: This would ensure that people are understanding some of the things they are coding. It doesn’t have to be super complex, but it should cover data types and structures, statistical aggregations and descriptors, measures of error and accuracy in data, discrete mathematics and a few other concepts.
  2. Language basics: This would teach the basic components of how to manipulate data in the chosen language and relate strongly to the concepts learned in 1. Students should be given exercises where they have to perform manipulations and operations in their chosen language and then explain what is the mathematical meaning of what they just did.
  3. Operating system management: Students should be given experience of both Unix and a Windows environment, using the command line, understanding environment variables, permissions and numerous other concepts which play a role in how their language interacts with its platform. Setting up core software, IDEs and integrations should be an important part of this module.
  4. Project work: Following Steps 1 thru 3, substantial project work should follow with the aim of ensuring frequent coding practice, experience at discovering and resolving errors, using version control and producing high-quality output and results. Students should be encouraged to access community resources like StackOverflow and learn how to appropriately interact with those communities.

Semester 2:

  1. Explanatory and Predictive Modelling: This should teach the theoretical difference between modelling to explain a phenomenon versus modelling to predict a phenomenon, the different design choices that can be made for each and the most typical methods/algorithms used in practice.
  2. Common algorithms, methods and tools: Students should be exposed to the most common cross-platform algorithms, what they are appropriate for (relating back to 1), and how to execute them in their chosen language.
  3. Code abstraction: Students should learn the importance of being DRY (Don’t Repeat Yourself), become comfortable writing functions and understand the value of abstraction.
  4. Development: Students should be taught the basic steps and principles of software development and shown the key resources for development in their chosen language. They should be encouraged to participate in language development communities.
  5. Debugging: Students will need to build the confidence to handle errors. Systematic debugging processes should be taught, with the exploration of typical error messages and tracebacks. Students should be exposed to errors both generated from within the language and from how the language interacts with the operating system.
  6. More project work: Semester 2 project work should focus on the QDS exam, to continue to encourage rapid confident coding in core language features but introducing problems that require research into methods that are unfamiliar to students and where some code abstraction will help make work more efficient and where some debugging will be likely.

My main reason for writing this is to get some thinking out of my system and down on paper while it’s fresh, but I’d love to get some reactions from readers to this. Would a course/qualification of this kind of structure provide a better basis for hiring Data Scientists? Am I overshooting? What am I missing?

One thought on “What Does It Take To Be a Qualified Data Scientist?

  1. Dr. McNulty, I have recently discovered your blogs and social media accounts and have been geeking out on them ever since! Some thoughts I had on this were:

    I liked the structure of the two semesters and felt like the flow worked well. What I have found in my own personal experience ( I am finishing a Master’s in Data Science right now) is that my journey and I would venture a lot people’s journeys are similar, felt very disjointed and a patchwork of skills.

    When I first started, I was all over the place with what skills I tried to learn, it was scrappy and informal. It seems this approach works, it gets the job done or answers the questions, but it is by no means a best practices approach. Then when I got into my master’s program it became more formal, but it still lacked the structure that you provided above. Because we got to choose our classes and which language we used and the operating systems, the professors and the new program had a hard time teaching us these skills of debugging for the language vs debugging for the operating system. Or the development of software or even the code abstraction. The program (and now myself) were focused more on the creation of a final product/project and less on how to do it effectively and efficiently. That is where I feel like I am on my data science journey, not necessarily skill development (I mean we can always improve and learn new skills/packages/applications etc…) but in the development of how to produce quality projects/products that are efficient with resources and implement “data science best practices”.

    The best practices are really what stand out when I look at your outline. The best practices of code abstraction, learning about software development and operating systems to better understand debugging process and applications of our work. This idea would probably generate a lot interest, I know I am interested in it. And I think it would start to help with the distinction of people who can do data science-y things and data scientists; people who can solve ambiguous, complex problems in an efficient and complete way.

Leave a Reply

%d bloggers like this: