What best/correct algorithm/procedure to cluster a dataset with a lot 0's?

I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation.

I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred.

My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now):

So, I was wondering what was the best way to cluster this dataset. I was looking for and found hierarchical and kmeans, but don't know if i can properly apply to this cenario. First of all, I switched the absolute values of occurrences to % of total, does that make possible to deal with the 0's ? I thought about it but at the same time 1% is close to 0%, I don't know if the algorithm can also understand as 'flags', since 1% represents that actually that symptom occurr(even at lower rates) and in another disease don't occurr at all.

I heard about PCA to reduce number of variables, and I was also curious if: 1-PCA is applicable to this cenario (dataset with a lot of sparse 0's) 2-PCA could solve my problem(because I think that even when a reduce to 2 3 variables, some rows could still be 0 for that particular column(symptom).

Any help/guidance would be extremely helpful and I thanks all in advance, Sorry for some english error as well!

Have a great week!

Topic pca missing-data data k-means clustering

Category Data Science


Do not expect any clustering algorithm to just work as a black box.

Zeros itself are probably not much of an issue, but scaling is. You different diseases have different frequency and the same holds for certain symptoms (e.g., fever).

So rather than pickings clustering algorithm because someone on the internet claims that it works better with "lots" of zeros (note that it is common to have 99.9% of zeros in BOW models), you need to first narrow down your objective. Define a good clustering for your problem. Then pick an algorithm that best optimizes this quality. Don't pick the hammer, and assume your problem then must be a nail...

In your case, I would suggest approaching it from a probabilistic point of view. Define some P() when two diseases are "likely to be related", then you can use a wide variety of clustering algorithms afterwards (probably try something interpretable first, such as HAC).


If I understand your problem statement correct, You want to cluster data which have a lot of 0's (which mean your data is not balanced). You can use any clustering technique (Supervised or Unsupervised) you like and plot the data to visualize.

If you want to train your model for future prediction, I would suggest to balance the data before you start training (fitting) your chosen model.

You can use resample to scale-up of scale-down your data, once done you can concatenate and create final dataset for your model training. from sklearn.utils import resample

Thanks!


Maybe I am not understanding your question properly, but k-means clustering is not sensitive to the zeros. More accurately zeros are valid values for features to cluster on.

While PCA may enable you to reduce the number of features to train on, it may complicate the clarity of the clusters from k-means. PCA may be beneficial in its own right, but AI would resist the urge to feed PCA data into k-means simply to reduce the number of zeros in your data.

HTH

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.