Clustering analysis for observations with lists as data

Question

Clustering analysis for observations with lists as data

quarksome

2019年8月14日 15:17

So I have several samples analyzed for their chemical composition. After data analysis, for each sample, I have a list of compounds found and their corresponding relative abundance. Some compounds are unique but most are actually found in most samples.

I want to do clustering analysis based on these list of compounds. How do I go about this? Specifically how to vectorize my dataset since each observation is actually an array with both numerical (abundance) and categorical (compound label) variables.

Topic bioinformatics clustering

Category Data Science

stefanLopez · Accepted Answer · 2019年8月14日 15:17

K-means would be a fine clustering method for you to start with, though you will have to provide the number of clusters you wish for it to return (not sure if you know that/can figure it out). Otherwise check out DBSCAN.

As for the mix of numerical vs categorical data types, all you will need to do is one-hot encoding on your categorical variables. What that does is that it will take all of the known possibilities for a category and it creates new features out of them. A 1 is assigned if the sample is part of that category, and a 0 if it is not. In this way you can use numerical and categorical at the same time, just make sure to normalize your numerical data!

Clustering analysis for observations with lists as data

About