How to determine the abnormality of a specific variable by taking into account all the other variables in the data?

I have an issue of machine learning/anomaly detection. Indeed, I have a variable Y and several other variables X. The purpose is to quantify the degree of abnormality of the data on Y but I have to take into account the values on the other variables (the relationship between Y and X).

Normally, an anomaly detection algorithm would find anomalies but on the whole data (Y + X), but in my case I want to zoom in on Y because it is a very important variable. If I wanted to quantity the abnormality on all my variables (Y + X), Y would be lost in the middle of all the variables.

It is not something strange because when you apply a linear regression Y ~ X, you can calculate the Cook distance which is a kind of abnormality score and it took into account the relationship between Y and X.

I hope it is clear!

Topic anomaly anomaly-detection research machine-learning

Category Data Science


If you want to focus on the outliers wrt the class, you can do as follows:

Using Isolation Forest

import pandas as pd
import numpy as np

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

import plotly.express as px

X, y = load_iris(return_X_y= True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

out_model = Pipeline([("model", IsolationForest(random_state= 42))]).fit(X_train)

visualizer = Pipeline([("scaler",StandardScaler()),
                ("decomposer",PCA(n_components= 3)),
                ("framer", FunctionTransformer(lambda x: pd.DataFrame(x, columns = ["p1","p2","p3"])))]).fit(X_train)

outliers = out_model.predict(X_train)
X3D = visualizer.transform(X_train)

px.scatter_3d(data_frame=X3D, x = "p1", y = "p2", z = "p3", color = y_train, symbol=outliers )

pd.crosstab(index = y_train, columns= outliers, margins= True, normalize= 0 )

Output:

enter image description here

enter image description here

We can see that ~ 35% of observations that belong to class 2 are marked as outliers

This is a good starting point to analyze the reasons behind this.

Hope it helps

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.