Which are the strategies to counter the 80/20 dilema in Data Science projects?

Most of the time in Data Science projects is not spent in (performing) actual analytics but rather in other tasks, such as organizing data sources, collecting samples and preparing datasets, compiling and validating business rules in data, etc.This fact has been studied as the 80/20 dilemma in Data Science projects

In order to tackle this dilema, I would like to ask which are the strategies used to decrease the 80% of time spent in the other stages (organizing data sources, collecting samples and preparing datasets, compiling and validating business rules in data)

Topic project-planning management

Category Data Science


Some of these pain points are unavoidable. We are data scientists, we need lots of clean, relevant data and outside sources don't generally just hand that over. Someone on your team will have to do collection, organizing, cleaning, etc. The question then is, how do we make it as painless as possible?

Know your people. You presumably have a team, and I'm sure some people on it enjoy the analysis side of things while others prefer data engineering. Identify who is a best fit from both an enthusiasm and skill perspective and delegate accordingly. If you miss on this step, the work may get done, but it will be very slow compared to an environment where people are passionate about the problem they are solving.

Know your tools. How much data are you working with? If it's small enough to fit into the working memory on your machine and just run some Python/R scripts, then just do that. No need to be fancy when a simple solution will do. Just make sure to project into the future and verify your solution won't need to significantly scale up. If you're working with very large datasets or streaming data that won't play nicely with your local machine, then look into tech like Scala/Java running on a Spark cluster.

Know your process. Identify redundant tasks and cut them out as much as possible. Data science is an iterative process and the less time you spend repeating yourself the more you can focus on the actual analysis. That means no re-cleaning of data, updating existing models with more data rather than re-training, etc.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.