Machine Learning in Practice

I worked on a machine learning project where we dealt with relatively small data sets. I noticed that the way that we tried to increase performance was basically to try out a bunch of different models with different hyperparameters, try out a bunch of different sets of features, etc. Basically, it seemed like we approached the problem fairly randomly and that we had no real theoretical basis for anything we tried. This disillusioned me a fair amount and made me think if this is what machine learning engineers do in practice.

Have people found that this is fairly common? How can you work on a machine learning problem in a non-random-try-everything kind of way?

Topic methodology machine-learning

Category Data Science


Here are a few comments from my perspective:

  • First, methodologically this kind of approach must be done carefully, but people who work this way are rarely careful. training/evaluating many different types of models and/or parameters is equivalent to doing parameter tuning, i.e. it's essentially training a model. As a consequence the performance should be evaluated on a validation set, and after that the best model is selected and only this model should be evaluated on the (unseen) test set. A common mistake is to evaluate all the models on the same test set and assume that this gives the real performance. This is wrong, the best performance could have been obtained by chance and therefore the true performance is likely lower.
  • While it is normal to try to maximize performance and to some extent this can involve a fair amount of experimental evaluation, trying many different models/parameters blindly is in my opinion very poor practice and it rarely leads to the best results (but it can lead to an illusion of good result, see point above). Most of the time there is much more to gain by doing a fine-grained analysis of the data, including studying expert knowledge about the context, than blindly relying on black box algorithms. A simple example is feature engineering: sometimes some simple redesigning of the features can drastically improve performance, but there's no ML technique which can replace this (DL methods may come close to it, but usually at a much higher computational cost).
  • Final point: the quest for maximum performance is often misguided. Sometimes performance is used as a lazy excuse for not doing the job properly: what really matters in production? Is it really 0.3% more F1-score or an interpretable result? Was the evaluation measure chosen to really reflect the quality of the prediction or just "because it's standard"? And the training/test sets, do we know if the production data will look exactly the same? If not, is the ML method robust enough to handle these variations? When optimizing performance one often relies on the model catching the most subtle patterns, which means a higher risk of overfitting and mediocre performance in production. More generally, there are more questions (at least in research) about explainability, the role of human actors in a ML process, and of course the computational complexity and the environmental cost of the ML method. All of these points are not reflected by a performance measure, but they matter.

Overall I would consider this kind of approach as mediocre: it's inelegant, it's prone to mistakes (which often won't be detected) and it's short-sighted because it ignores other dimensions of the quality of a ML system.

By the way this approach can be automated. Personally I would not be very proud (and probably even worried) if my job could be done entirely by a program.

Have people found that this is fairly common?

In research my guess is that this would be common enough in low-level publications, but it's unlikely to pass at any reputable journal/conference (because expert reviewers can usually detect it).

How can you work on a machine learning problem in a non-random-try-everything kind of way?

That's the thing, there's no recipe... But that's the beauty of it, isn't it? :)

In my opinion it's a bit like medicine: a good doctor has a good intuition because they have a solid theoretical background and a lot of practice. They don't just follow some rigid manual, they see the patient as a unique individual and they do their best to understand every aspect of the problem.


In practice trying all “squeeze” in performance of the model needs to be interpreted/documented, meaning theory is mandatory ... in academia this is not the case ...

All models I put in production needs better to be simple, easy to scale and less “black-box” otherwise MRM team won’t validate and approve for production (high risk costs money).

For example: if logit or trees are giving close to same results as ANN the. Will go with logit or trees because I can easy interpret them and theory is enough to pass MRM validation and acceptance.

To get a feel of what is done in non-academia world is to participate in competitions like kaggle (they use a lot of real world data) or find any dataset open on internet and asks and answer questions.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.