An Unsupervised learning method suitable for large categorical data sets

Question

An Unsupervised learning method suitable for large categorical data sets

HoonP

2022年5月27日 22:05

I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values.

This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes in in real time, clustering techniques such as dbscan are limited.

If you want to apply this categorical and many missing values to unsupervised learning, I am curious about how to organize the data and whether there is an appropriate unsupervised learning method.

Since the dataset is the data used in the project, we don't currently have it.

Topic unsupervised-learning anomaly-detection categorical-data machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2022年4月27日 17:28

One option is the Isolation Forest algorithm. Since it is tree-based, it can handle categorical features.

The rate of missing data will pose a major problem to the interpretation of any model. It might be difficult to tell if the data missing at random or if the missing data is related to the anomalies.

An Unsupervised learning method suitable for large categorical data sets

About