Classification Produces too Many False Positives or False Negatives

I trying to classify this data set (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) to classify if a patient is at risk for having a stroke. As the title says, whatever test I run to classify the patients, I keep running into the final results having too many false-positives or too many false-negative results.

The data itself is severely imbalanced (95% 0s to 5% 1 (had a stroke)) and in spite of doing various things to try and balance it or compensate for it, I keep running into the same ends.

For the record, yes, I have tried SMOTEing the training data set with no success. Furthermore, I've read a few articles against SMOTEing the test data set due to data leakage (e.g. https://machinelearningmastery.com/data-leakage-machine-learning/ and https://imbalanced-learn.org/stable/common_pitfalls.html#data-leakage).

Here are the codes I've been using. I'm using Python 3.10:

X = stroke_red.drop('stroke', axis=1)  # Removes the stroke column.
Y = stroke_red.stroke  # We're storing the dependent variable here.
####### Pipelining #######
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

cat_pipe = Pipeline(
    steps=[
        (impute, SimpleImputer(strategy=most_frequent)),
        (oh-encode, OneHotEncoder(handle_unknown='ignore', sparse=False))
    ]
)
num_pipe = Pipeline(
    steps=[
        (impute, SimpleImputer(strategy=mean)),
        (scale,StandardScaler())
    ]
)

cont_cols = X.select_dtypes(include=number).columns
cat_cols = X.select_dtypes(exclude=number).columns

process = ColumnTransformer(
    transformers=[
        (numeric, num_pipe, cont_cols),
        (categorical, cat_pipe, cat_cols)
    ]
)

####### Splitting the data into train/test #######
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, StratifiedKFold

#preprocessing.

X_process = process.fit_transform(X)
Y_process = SimpleImputer(strategy=most_frequent).fit_transform(
    Y.values.reshape(-1,1)
)

X_train, X_test, Y_train, Y_test = train_test_split(X_process, Y_process, test_size=0.3,
                                                    random_state=1111)  # Splits data into train/test sections. Random_state = seed.
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

sm = SMOTENC(categorical_features=[0,2,3], random_state=1111)
X_train, Y_train = sm.fit_resample(X_train, Y_train)

Finally, the Extreme Gradient Boosting algorithm:

import xgboost as xgb
boostah = xgb.XGBClassifier(objective='binary:logistic', n_estimators=100000, max_depth=5,
                            learning_rate=0.000001, n_jobs=-1, scale_pos_weight=20
                            ) # scale_pos_weight is a weight. #0s / #1s .
boostah.fit(X_train,Y_train)

predict = boostah.predict(X_test)

print('Accuracy = ', accuracy_score(predict, Y_test))
print(F1 Score = , f1_score(Y_test, predict))
print(classification_report(Y_test, predict))
print(confusion_matrix(Y_test, predict))

Here are the confusion matrix results. Bear in mind, I had the SMOTE section commented out when running this:

Accuracy =  0.6966731898238747
F1 Score =  0.2078364565587734
              precision    recall  f1-score   support
           0       0.99      0.69      0.81      1459
           1       0.12      0.82      0.21        74
    accuracy                           0.70      1533
   macro avg       0.55      0.76      0.51      1533
weighted avg       0.95      0.70      0.78      1533
[[1007  452]
 [  13   61]]

Here are the results with SMOTE on:

Accuracy =  0.39008480104370513
F1 Score =  0.13506012950971324
              precision    recall  f1-score   support
           0       1.00      0.36      0.53      1459
           1       0.07      0.99      0.14        74
    accuracy                           0.39      1533
   macro avg       0.54      0.67      0.33      1533
weighted avg       0.95      0.39      0.51      1533
[[525 934]
 [  1  73]]

Any tips on fixing this? If you need my complete code, let me know, and I'll get it to you.

Topic classification python machine-learning

Category Data Science


Behind the scenes there is a confidence score associated to most models. You can retrieve them using model_name.predict_prob instead of model_name.predict. By default predict uses a .5 confidence score, i.e. anything above a .5 confidence score is predicted to be in the positive class. All you have to do is alter that threshold and you can tradeoff performance between the two classes.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.