How to cluster/group these data points (using K-Mean or Hirarachal clustering)

Question

How to cluster/group these data points (using K-Mean or Hirarachal clustering)

asmgx

2021年12月2日 05:26

I have genes from different species

Gene A , Gene B, Gene C, ... Gene Z

Some Genes are similar to each other

A  G are 96% similar
C  H are 92% similar
G  B are 89% similar
G  T are 85% similar
.
.
.
K  F are 52% similar

I want to classify these genes into groups of species

Species A, B, T, G are the same species Species C, H, N, R, L, P are the same species . . . K does not seem to be similar to any species (it is unknown or a species by itself) F does not seem to be similar to any species (it is unknown or a species by itself)

I know that I can use K-Mean to cluster these genes.

but not sure how to build the feature set to be used in K-Mean

all the examples online are for 2-dimensional datasets

something like this

So can someone help me with how to build this dataset features to be used with K-Mean

Topic hierarchical-data-format feature-extraction k-means

Category Data Science

spectre · Accepted Answer · 2021年12月2日 05:26

One thing you can do is take all the features you consider to be important from a clustering point of view using your domain knowledge, and then use PCA to capture all the features which have high variance. Those are the features that you would use in the clustering algorithm.

Here is a link that does that.

Another article which although does not use PCA, but is an excellent article for different types of clustering. You can use PCA in this article if you want.

The reason the articles you mention convert to 2 dimensions is because they want to visualize the clusters on a 2-d graph. We can instead convert the dataset into 3 dimension (i.e only 3 features) using PCA and plot them on a 3-d graph but that is the limit. Beyond 3 features, we cannot plot 4 features on a graph (for obvious reasons). But if you don't want to visualize the clusters, then you can take all the features with the highest variance and use them in your clustering algorithm.

Cheers!

How to cluster/group these data points (using K-Mean or Hirarachal clustering)

About