How important is advanced SQL for data science?

Is advanced level SQL required to be competitive as a data scientist? Is it more important for a data analyst to be good in SQL? Is it enough to be able to extract data using simple SQL queries?

I know it is faster to manipulate data in SQL than to copy data into R or Python, but are there any other advantages or disadvantages?

Topic data career sql

Category Data Science


It is a very good point, and IMHO it is often overlooked and underestimated by Data Scientists. I have come to believe it strongly depends on the following variables that are largely intertwined (solely based on personal experience):

  • Domain (e-Commerce, Manufacturing, etc.):

I have witnessed that majority of e-Commerce, realtors, online businesses are cloud-based like AWS, Azure, and in particular Google Cloud Platform due to website tracking and BigQuery integration for fast query of large data. Often data stored there has a very nested (JSON style) that is meant to conveniently stores the content of the data and interactions of the users in a business website. Imagine if you wan to develop a new model e.g. working on a new concept that requires the raw data, you need to extract a data (if not flattened in a normal SQL database). Such SQL queries are not a simple 'SELECT X, Y from Table..'. You need to be at least have intermediate knowledge of SQL to be able to UNNEST, Aggregate, Merge, and things can get complex and confusing if you don't know what you are doing. In such cases, usually data is big enough that you cannot load it in Pandas. In contrast, in Manufacturing often one deals with IoT and machine data, and such complexity is not there!

  • Database infrastructure:

Of a general concern is how the database or datalake is. The problem, at least from DS's point of view, is that data is not stored for the purpose of building models or to be consumed by algorithms. Even for analytics, Analyst end up writing their custom SQL queries to import the data for graphs, A/B test, and dashboards. It would sound rather peculiar that all of a sudden someone else has to be a there for DSs to extract whatever data they fancy using. In a case whom is building the datalake, may foresee the potential DS use cases, your life may become easier, otherwise it will bounce back to yourself to import your data as you wish.

  • Team diversity and expectation:

While all of the above is true, it is also the case how the role is realized in the firm, which was pointed out in the earlier answer. With diverse set of mind, employers or team leads start building up the team. One place you have help to import such queries if needed, other place you are seen to do Data Analyst job, and even sometimes to some extent Data Engineer tasks. if over time you want to be in the group of candidates that is able to act independently if needed for the sake of Getting the Job Done, you are better off again with at least intermediate SQL skills.

  • ML Application:

Last but not the least, and in view obvious, but stating here for the completeness, is what project you are assigned to work on no matter which domain. For example, in the same retailer firm, you work with images, visual search, classification whatever, even if you are superman with SQL, it won't matter much. It is noted though, you are often assigned for lifetime working on a project, every now and then, you end up doing something completely different, i.e. 6-month visual search and similarities, then price optimization, then recommendation and so forth. It is best to be equipped with minimum technical competencies to perform well!


Takeaways: Depending on how these variables changes in where you work or what you do, you may need a beginner, intermediate or advanced knowledge of SQL as a Data Scientist. In most scenarios very basics of SQL that everyone can learn in less than a week won't be ideal and enough. You are better off with at least intermediate knowledge, which comes largely by practice rather internet browsing or online tutorials. If you look for an investment for future, definitively knowing at least intermediate knowledge skills of SQL queries will boost your career chances, more than you expect.

Update [06.05.2020]: I ran into an article by a data scientist/analyst at Google titled The Best Medium-Hard Data Analyst SQL Interview Questions, where he explains the motivation of knowing beyond intermediate SQL skills. In fact there are examples to refresh advanced SQL skills by going through some typical interview questions.

Update [13.05.2020]: I ran into a book N . that has lots of recipes that is meant for Professionals. I thought I will add this to the answer for those DS who are willing to take their SQL to the next level.


What do you mean exactly with competitive as a data scientist? Unfortunately, many employers will have different expectations of someone they hire to be a Data Scientist, so there isn't a single answer!

In any case, I think it is a good idea to know three components to be effective with databases:

  1. Managing a connection: how to create and connect to a database using a simply library in Python or R. This sort of workflow should look familiar to you:

    import mysql
    connection = mysql.connector.connect(
                     host="127.0.0.1",
                     user="mj_whales",
                     passwd="somethingsecret",
                 )
    
    query = "CREATE DATABASE my_table"
    cursor = connection.cursor()
    cursor.execute(query)
    
  2. Getting data: understand basic syntax of a standard SQL-style query, for example:

    FROM my_table SELECT my_feature WHERE my_other_feature > 0
    
  3. Writing data: understand how to insert data efficiently. This will depend on the database type e.g. document database, SQL-style, InfluxDB for time-series data.

Here is a really good introduction to databases with Python that might help you understand your current strengths and weaknesses on the subject.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.