Why Deep Learning / Neural Networs don't achieve state of the art results in tabular data problems?

Apparently, deep learning methods don't achieve state-of-the-art results on tabular data problems [1,2]. This claim appears to be known also by Kagglers. The SOTA method looks like it is the gradient boosting decision tree.

Is there any intuition on why this happens? Any relevant literature on the topic?

Do neural networks have a stronger IID assumption that inhibits learning in tabular data?

Literature:

  1. Deep Neural Networks and Tabular Data: A Survey https://arxiv.org/abs/2110.01889
  2. Do We Really Need Deep Learning Models for Time Series Forecasting? https://arxiv.org/abs/2110.01889

Topic deep-learning neural-network machine-learning

Category Data Science


In my opinion, deep learning methods are best for (but not only for) representation learning on very generic and homogenous data formats: sound, images, text, videos etc. For most of these formats there are pre-trained models that achieve state-of-the-art results.

On the contrary, tabular datasets typically have more heterogeneous and messy structure often related with domain knowledge, which is outside the scope of automatic representation learning. Thus, manual feature engineering and methods like gradient boost perform better.

By the way, the most power in deep learning comes from fine-tuning models that have already been pre-trained on huge dataset e.g. BERT by Google for textual data. Then, considering how difficult or impossible it is to use pre-trained deep learning models on messy/heterogenous tabular datasets, deep learning loses its attractiveness in this scenario.

Another reason is that learning algorithms also have what we call inductive biases. If the domain knowledge that is essential for solving a tabular business problem has by its nature a tree-based/taxonomy structure, it is logical that tree-based models will have an edge. ( Because even the domain expert or label annotator will follow a tree based process)

On the other hand, if a set of images and their labels depend on spatial features that can be captured with a filter, deep learning with CNN makes better assumptions.

Finally, because deep learning models have a high number of parameters to be learned, it requires huge datasets to avoid overfitting. Thus, it would not be a good option for small tabular datasets when it is difficult/expensive to acquire more data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.