Auto ML vs Manual ML for a project

I recently was introduced to a AUTO ML library based on genetic programming called tpot. Thanks to @Noah Weber. I have few questions

1) When we have AUTO ML, why do people usually spend time on Feature selection or preprocessing etc? I mean they do at-least reduce the search space/feature space

2) I mean atleast, they reduce our work to some extent and we can work from the output of AUTO ML solution and tune further if required. We don't really have to do gridsearchCV by manually keying in range of values that we might require. Right?

3) Is there any disadvantage to it? I understand it might be black box but for data analysis, don't they make it easier? Computer scientists,may not prefer it. Ofcourse, we need to have some sort of knowledge to be able to fine tune the model, interpret the results etc

4) What's the advantage of doing manual ML when compared to AUTO ML

5) Will it be possible for us to improve the results further? I mean once we get the output from Auto ML

Can you help me understand this?

Topic automl deep-learning predictive-modeling data-mining machine-learning

Category Data Science


1) Feature Selection should be done by AutoML on the other hand preprocessing is normally done by the user in order to make sense fo the data.

2) AutoML takes care of the hyper-parametrization.

3) The disadvantage that I mostly find is that is extremely computationally expensive. And from what I have seen in Kaggle most of the winning solutions use manual ML, not AutoML.

4) For me, one of the advantages is that sometimes it finds a good algorithm that I have not tried (or thought) and that it avoids me spending some coding time. Also, it happens to do some good ensembles of different models.

5) You can do Manual ML in your side and then doing an ensembling with your personal ML model and the AutoML. This doesn't guarantee improvement but it could boost your performance.

You can have a look at H20 AutoML, I quote the documentation, I believe it can be helpful in this case in order to have an intuition about it:

Although H2O has made it easy for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular are notoriously difficult for a non-expert to tune properly. In order for machine learning software to truly be accessible to non-experts, we have designed an easy-to-use interface which automates the process of training a large selection of candidate models. H2O’s AutoML can also be a helpful tool for the advanced user, by providing a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code, and by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

You can also have a look at this blog of Bojan Tunguz where he defines what are the phases of AutoML:

  • Level 0: No automation. You code your own ML algorithms. From scratch. In C++.

  • Level 1: Use of high-level algorithm APIs. Sklearn, Keras, Pandas, H2O, XGBoost, etc.

  • Level 2: Automatic hyperparameter tuning and ensembling. Basic model selection.

  • Level 3: Automatic (technical) feature engineering and feature selection, technical data augmentation, GUI.

  • Level 4: Automatic domain and problem specific feature engineering, data augmentation, and data integration.

  • Level 5: Full ML Automation. Ability to come up with super-human strategies for solving hard ML problems without any input or guidance. Fully conversational interaction with the human user.


There is no one size fits all solution.

AutoML is cool, but you wont get tailored and best-possible solutions using it.

Reason being is the fact that DS has an "art" component to it. Sure theoretically you can put everything in an huge optimization framework and find the optimum params but realistically it will take for ever. Maybe with quantum computers this will change, but for the time being we have to zero in on the optimal configuration using some heuristics, theory and previous experience.

So you can use it to aid your thinking, or even question it but if you are making tailored solution you wont achieve best results using autoML. And depending on the field this could mean a lot. For example F1 score difference of 0.2+- could mean milions of dollars in fraud expenses. You really want to minimise this.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.