Which machine learning model is best for a combination of numerical and categorical data?

I want to develop a ML model which will allow my company to highlight employees which are at a risk of leaving the business, based on a variety of parameters such as performance, absence rates, location, age, team manager etc. We have a fairly diverse database of individuals who have already left the business, with values for each of the inputs which can be used to train the model. The output is a simple 1 or 0: based on all of the inputs, either an individual is 'at risk' or 'not at risk' of leaving, with no immediate requirement for any indication of the degree of risk.

I am somewhat of a ML newbie, but having researched the various types of ML models, I cannot find any specific information which relates to training models with datasets where there are combinations of numerical and categorical data. I have looked at examples of similar model requirements which have used k-NN and SVM models, but I cannot find definite clarification on how to approach this task. Worth noting that I code in python and matlab.

Any input would be greatly appreciated.

Topic machine-learning-model keras matlab python machine-learning

Category Data Science


Even though virtually any supervised classification algorithm can be used when having categorical features by applying some encoding technique, my first thought is using Catboost, an algorithm specially designed just for handling categorical features without a necessary explicit preprocessing/encoding phase. In short this algorithm will use an adaptation of target encoding and you can check details here:

https://arxiv.org/abs/1706.09516

From Yandex docs:

CatBoost is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex and in other companies, including CERN, Cloudflare, Careem taxi. It is in open-source and can be used by anyone.


There are some ML models which use both categorical and numerical data

  • Decision trees(with bagging),
  • Random forest(with bagging & random subspace)
  • Naive Bayes(numeric by Gaussian distribution or kernel density estimation)
  • KNN based approach
  • Ensemble Techniques
  • linear regression

Note: you can always use different encoding techniques to transform categorical data into numeric and vice versa based on the ML model you choose

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.