How to improve the result? Should I remove the columns?

Question

How to improve the result? Should I remove the columns?

Agat0

2022年5月30日 08:36

I am using this dataset, the target column is the last one which is 'DEATH_EVENT', I have separated this last one. I am using KMeans to calculate the number of hits and misses. The result is quite bad, I think I should delete some columns or create a loop that deletes. What would you do?

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


X = np.genfromtxt('heart_failure_clinical_records_dataset.csv', delimiter=',')

X = np.delete(X, 0, 0)
train, test = train_test_split(X, test_size=0.33, shuffle=True, random_state=100)

X_train = np.delete(train, -1, axis =1)
y_train = train[:, -1]

X_test = test[:, :-1]
y_test = test[:, -1]

from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

K = 2
kmeans = KMeans(n_clusters=K)
kmeans.fit(X_train)

pred = kmeans.predict(X_test)


n_items = len(pred)
aciertos = 0
for i in range(0, n_items):
    aciertos += 1 if (pred[i] == y_test[i]) else 0

print(Hitss: %6.5f, misses %6.5f % (aciertos/n_items, (n_items-aciertos)/n_items))


cm = confusion_matrix(y_test, pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

output

Hits: 0.59596, misses 0.40404

???

Topic deep-learning python k-means

Category Data Science

lpounng · Accepted Answer · 2022年5月30日 08:36

The reason is an unsupervised algorithm is used for a supervised problem.

Remember there is no strict right or wrong in unsupervised learning. For example, say I have a dataset of 3 attritutes, "Age", "Blood Pressure" and "Income". Both "Blood Pressure" and "Income" consists of 2 categories, "high" and "low".

There are 2 things I can do:

I can set "Income" as target, and train a supervised model to predict it from "Age" and "Blood Pressure". (of course Blood Pressure may have no predictive power at all)
I can also feed the 2 attributes "Age" and "Blood Pressure" into an unsupervised algorithm e.g. KMeans, and ask it to return 2 groups.

From 2., there is a chance that the algorithm gives back 2 groups, which turns out to be the two Blood Pressure clusters. Is this grouping useful to predict "Income"? Probably not. But it is not wrong either - it correctly identified 2 Blood Pressure groups. Just not related to income.

So in your case, the algorithm has detected 2 clusters, but not necessary related to DEATH_EVENT. Supervised algorithms should be used if you want to make predictions.

How to improve the result? Should I remove the columns?

About