Assign a risk score in records in a dataset

Question

Assign a risk score in records in a dataset

nameguest

2022年1月4日 15:56

I was wondering, if I have a dataset with categorical and numerical data and labels such as 1 or 0 that shows if a row is anomalous or normal respectively.

Is it possible to create somehow a model that will assign something like how much risky a record is using as input these numerical and categorical features?

Edit

My thoughts were to train a supervised anomaly detection method that will classify the records as 0 or 1. But instead of using these outputs, maybe I could use the probability that the model outputs as a risk score.

Topic anomaly anomaly-detection regression outlier

Category Data Science

Multivac · Accepted Answer · 2022年1月4日 15:56

If you have a labeled dataset $f(X) =Y$ then you have a supervised learning problem, so you may try to solve it as a "usual" binary classification problem by using metrics like $F1$ or $AUC$ and Cross-validation to evaluate your model's performance, and what I mean by usual is that you do not need to apply something special for anomaly detection beyond the fact that for the context is this what you are solving.

What I would recommend here is to place special emphasis on descriptive analysis and model explainability since from those will come most of the value that your classifier can bring you by finding which characteristics define an anomalous observation and to what degree one or more features impact the output for the anomaly.

For the latter purpose, you can use SHAP values to explain your model.

One last thing I would recommend If you have enough time and resources is to try an unsupervised anomaly detection algorithm like Isolation Forest, a non-parametric algorithm that will also allow you to assign an anomaly score based on the average tree path to isolate every observation. It might be interesting to see if there are, and what characteristics have observations that you have labeled as normal but the unsupervised model labeled as anomalies, you could also use the unsupervised model output as a feature for the supervised one.

Hope it helps!

Assign a risk score in records in a dataset

About