In my opinion, deep learning methods are best for (but not only for) representation learning on very generic and homogenous data formats: sound, images, text, videos etc. For most of these formats there are pre-trained models that achieve state-of-the-art results.
On the contrary, tabular datasets typically have more heterogeneous and messy structure often related with domain knowledge, which is outside the scope of automatic representation learning. Thus, manual feature engineering and methods like gradient boost perform better.
By the way, the most power in deep learning comes from fine-tuning models that have already been pre-trained on huge dataset e.g. BERT by Google for textual data. Then, considering how difficult or impossible it is to use pre-trained deep learning models on messy/heterogenous tabular datasets, deep learning loses its attractiveness in this scenario.
Another reason is that learning algorithms also have what we call inductive biases. If the domain knowledge that is essential for solving a tabular business problem has by its nature a tree-based/taxonomy structure, it is logical that tree-based models will have an edge. ( Because even the domain expert or label annotator will follow a tree based process)
On the other hand, if a set of images and their labels depend on spatial features that can be captured with a filter, deep learning with CNN makes better assumptions.
Finally, because deep learning models have a high number of parameters to be learned, it requires huge datasets to avoid overfitting. Thus, it would not be a good option for small tabular datasets when it is difficult/expensive to acquire more data.