What best/correct algorithm/procedure to cluster a dataset with a lot 0's?
I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation.
I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred.
My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now):
So, I was wondering what was the best way to cluster this dataset. I was looking for and found hierarchical and kmeans, but don't know if i can properly apply to this cenario. First of all, I switched the absolute values of occurrences to % of total, does that make possible to deal with the 0's ? I thought about it but at the same time 1% is close to 0%, I don't know if the algorithm can also understand as 'flags', since 1% represents that actually that symptom occurr(even at lower rates) and in another disease don't occurr at all.
I heard about PCA to reduce number of variables, and I was also curious if: 1-PCA is applicable to this cenario (dataset with a lot of sparse 0's) 2-PCA could solve my problem(because I think that even when a reduce to 2 3 variables, some rows could still be 0 for that particular column(symptom).
Any help/guidance would be extremely helpful and I thanks all in advance, Sorry for some english error as well!
Have a great week!
Topic pca missing-data data k-means clustering
Category Data Science