How do you define the steps to explore the data?

I'm falling in love with data science and I'm spending a lot of time studying it. It seems that a common data science workflow is:

  1. Frame the problem
  2. Collect the data
  3. Clean the data
  4. Work on the data
  5. Report the results

I'm struggling to connect the dots when comes to work on the data. I'm aware that step 4 is where the fun happens, but I don't know where to begin. What are the steps taken when you work on the data? Example: do I need to find the central tendency or the standard deviation? Is machine learning needed?

Ps: I know these are broad questions, so please answer it within your own domain expertise.

Topic data-wrangling methods visualization statistics machine-learning

Category Data Science


As far as working with data depends on one's education, expertise, goal and favorite tools, I would answer it within my narrow scope - and trying to keep your track.

  • Framing the problem is an important starting point a lot of people neglect. Even-though it is only the beginning, this should result in first strategies to explore the data.

    1. Translate "What I want to do" to "What are the implicit information I need to have to achieve it"
    2. Given the information you need, find your way to get it (by decomposing it into tasks and sub-tasks) and the corresponding data to extract it (specific task implies specific signal(s) : structured data, pictures, movies, sounds, texts ...)
    3. Along with 1. and 2., you should have a clearer idea of the data you'll deal with and thus the tools you might use (NLP, image processing, time-series, ...)
  • Collecting the data is now easier as it is a implied by the previous task. However, classify mentally your data in the following graph to know what to start with according to your personal trade-off: enter image description here

    1. Direct data are those that can be obtained easily. Indirect are those that require some pre-process (scrapping websites, cropping images, counting number of clicks, ...)
    2. The simplicity / complexity of use depends on the data : generally speaking, structured data within arrays are easier to deal with that images.
    3. The size of the dot is the reward obtained if you achieve to work on these data, regarding your whole project
  • Exploring and Cleaning the data : there are levels of complexity here. I usually start with standard processes to clean the data (mean/median for missing values, normalization and centering when needed, ...). Meanwhile, I start looking deeper into the data by getting histograms of values, evolution of the mean for time series, word frequency for texts, ... This is task specific but exploration is here to give you hints about your data. Once inspecting them, you should mature your cleaning process.

  • Working on the data : As you said, here comes the fun part. You can choose your favorite tools, or start improving your skills by looking for new concepts ( as a future good data scientist ), to process your data. One reason you don't know what to start with may be that you went too fast over the previous dots - implying what you have to do is still unclear. Get back to them, write the process on a paper until you clearly identify the inputs and outputs you need. Again, generally speaking, it involves the following :

    1. Dimensionality reduction (especially for images) and feature design (one-hot encoder, floats or ints, ordinal or cardinal category, ...)
    2. Choose of your estimator / model by tuning the hyper-parameters
    3. Training with validation methods (cross-validation, Leave One Out, ...)
    4. Testing and improving your results
  • Report the results. Not as easy as it sounds, as mentioned here. If is it for your own, having a whole project, started from scratch, is a good reward. Moreover, you may remember your scores when testing the model and how you improve it (which hyperparameters, which model, ...). If it is a well-discussed subject, you can compare to top teams in the world on well-known datasets. Finally, if it is for an employer, I would recommend to start this discussion before getting into the subject - would same time and trouble.


That is a very good framework of solving a question you have. According to me, it has multiple answers. I'll give you the one which I relate to.

After cleaning the data or rather while cleaning it, we have to be clear about the task ahead and about our results. The work on the data follows the below steps mostly:

  1. Feature Detection
  2. Training using the above features(there are many Machine Learning/Deep learning models to do this) like Classification (this depends on the task)
  3. Check the trained model on validation test if needed and later on test set The features depend on the dataset. Standard deviation or to find central tendency isn’t always a criteria. Machine Learning is needed to train on the dataset in most of the cases.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.