Classification Produces too Many False Positives or False Negatives
I trying to classify this data set ( to classify if a patient is at risk for having a stroke. As the title says, whatever test I run to classify the patients, I keep running into the final results having too many false-positives or too many false-negative results.
The data itself is severely imbalanced (95% 0s to 5% 1 (had a stroke)) and in spite of doing various things to try and balance it or compensate for it, I keep running into the same ends.
For the record, yes, I have tried SMOTEing the training data set with no success. Furthermore, I've read a few articles against SMOTEing the test data set due to data leakage (e.g. and
Here are the codes I've been using. I'm using Python 3.10:
X = stroke_red.drop('stroke', axis=1) # Removes the stroke column.
Y = stroke_red.stroke # We're storing the dependent variable here.
####### Pipelining #######
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
cat_pipe = Pipeline(
(impute, SimpleImputer(strategy=most_frequent)),
(oh-encode, OneHotEncoder(handle_unknown='ignore', sparse=False))
num_pipe = Pipeline(
(impute, SimpleImputer(strategy=mean)),
cont_cols = X.select_dtypes(include=number).columns
cat_cols = X.select_dtypes(exclude=number).columns
process = ColumnTransformer(
(numeric, num_pipe, cont_cols),
(categorical, cat_pipe, cat_cols)
####### Splitting the data into train/test #######
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, StratifiedKFold
X_process = process.fit_transform(X)
Y_process = SimpleImputer(strategy=most_frequent).fit_transform(
X_train, X_test, Y_train, Y_test = train_test_split(X_process, Y_process, test_size=0.3,
random_state=1111) # Splits data into train/test sections. Random_state = seed.
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
sm = SMOTENC(categorical_features=[0,2,3], random_state=1111)
X_train, Y_train = sm.fit_resample(X_train, Y_train)
Finally, the Extreme Gradient Boosting algorithm:
import xgboost as xgb
boostah = xgb.XGBClassifier(objective='binary:logistic', n_estimators=100000, max_depth=5,
learning_rate=0.000001, n_jobs=-1, scale_pos_weight=20
) # scale_pos_weight is a weight. #0s / #1s .,Y_train)
predict = boostah.predict(X_test)
print('Accuracy = ', accuracy_score(predict, Y_test))
print(F1 Score = , f1_score(Y_test, predict))
print(classification_report(Y_test, predict))
print(confusion_matrix(Y_test, predict))
Here are the confusion matrix results. Bear in mind, I had the SMOTE section commented out when running this:
Accuracy = 0.6966731898238747
F1 Score = 0.2078364565587734
precision recall f1-score support
0 0.99 0.69 0.81 1459
1 0.12 0.82 0.21 74
accuracy 0.70 1533
macro avg 0.55 0.76 0.51 1533
weighted avg 0.95 0.70 0.78 1533
[[1007 452]
[ 13 61]]
Here are the results with SMOTE on:
Accuracy = 0.39008480104370513
F1 Score = 0.13506012950971324
precision recall f1-score support
0 1.00 0.36 0.53 1459
1 0.07 0.99 0.14 74
accuracy 0.39 1533
macro avg 0.54 0.67 0.33 1533
weighted avg 0.95 0.39 0.51 1533
[[525 934]
[ 1 73]]
Any tips on fixing this? If you need my complete code, let me know, and I'll get it to you.
