What is the best practice to normalize/standardize imbalanced data for outlier detection or binary classification task?

I'm researching Anomaly/outlier/fraud detection, and I'm looking for the best practice to pre-process the synthetic data for imbalanced data. I have checked all methodology for normalizing/standardizing, which are not sensitive to the presence of outliers and fit this case study. Based on scikit-learn 0.24.2 study about Compare the effect of different scalers on data with outliers, it has been stated here:

If some outliers are present in the set, robust scalers or transformers are more appropriate.

I'm using CTU-13 dataset, which you can see the overview of its distributions in the dataset here.

Concerning the synthetic nature of the dataset, I need to use categorical-encoding for some features/columns to convert them into numerical values for my representation-based learning model (e.g., using an image form of data as inputs for learning algorithms like CNN. Please check the Ref.: Figure 6 in this paper ).

My question: What is the best normalizing method I can use that fits my research case, Anomaly/outlier/fraud detection for imbalance data at the end of the day in preprocessing stage to have the robust outlier detection model or binary classifier?

Any help/update about the state-of-the-art in this concept will be appreciated!

Topic binary-classification categorical-encoding imbalanced-data normalization anomaly-detection

Category Data Science


A bit far fetched, though, if you are looking for fraud, maybe it would be worthwhile to check for values with strange digit patterns, like 999999000000.00, following ideas from Benford's law. Maybe this could be a score to add to your factors.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.