Clustering analysis for observations with lists as data

So I have several samples analyzed for their chemical composition. After data analysis, for each sample, I have a list of compounds found and their corresponding relative abundance. Some compounds are unique but most are actually found in most samples.

I want to do clustering analysis based on these list of compounds. How do I go about this? Specifically how to vectorize my dataset since each observation is actually an array with both numerical (abundance) and categorical (compound label) variables.

Topic bioinformatics clustering

Category Data Science


K-means would be a fine clustering method for you to start with, though you will have to provide the number of clusters you wish for it to return (not sure if you know that/can figure it out). Otherwise check out DBSCAN.

As for the mix of numerical vs categorical data types, all you will need to do is one-hot encoding on your categorical variables. What that does is that it will take all of the known possibilities for a category and it creates new features out of them. A 1 is assigned if the sample is part of that category, and a 0 if it is not. In this way you can use numerical and categorical at the same time, just make sure to normalize your numerical data!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.