An Unsupervised learning method suitable for large categorical data sets

I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values.

This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes in in real time, clustering techniques such as dbscan are limited.

If you want to apply this categorical and many missing values to unsupervised learning, I am curious about how to organize the data and whether there is an appropriate unsupervised learning method.

Since the dataset is the data used in the project, we don't currently have it.

Topic unsupervised-learning anomaly-detection categorical-data machine-learning

Category Data Science


One option is the Isolation Forest algorithm. Since it is tree-based, it can handle categorical features.

The rate of missing data will pose a major problem to the interpretation of any model. It might be difficult to tell if the data missing at random or if the missing data is related to the anomalies.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.