Clustering mixed data types - numeric, categorical, arrays, and text
I have a dataset with 4 types of data columns:
numeric categorical tags text
id
1 51585 27 [A, B, C, ...] "Some text bla bla bla"
2 53596 27 [B, D, E] "Other text..."
3 1176345 27 [D, A, F, ...] "..."
4 168 24 NaN "..."
5 88564 22 NaN "..."
numeric
- continuous numeric values.categorical
- discrete categories, either numbers or strings (the type doesn't really matter because I can convert it to whatever works)tags
- array containing discrete values. Each row can have a different array length.text
- a string of text.
I am new to data science so maybe this is a "beginner" question.
How can I use all of these different data types in a clustering algorithm?
Here is what I learned so far:
- K-means is good for numerical data. I successfully applied it to a subset of my data with only numerical columns. I also used some evaluation metrics (such as silhouette coefficient) to help me choose the number of clusters. So this works in principal but since it's not using most of my data the results are not good.
- Then I read about clustering categorical data. I found the Gower Distance which is a distance between categorical data. So far I've used it with K-means (I passed the distance matrix generated by Gower into K-means). From here it should be easy to join the Gower distance matrix with the numeric columns from my original dataset and pass all of them to K-means.
I am aware there are other clustering algorithms besides K-means, and I plan to check some others as well. But before I do, I want to find some way to utilize all of my data in a single algorithm.
- The
tags
andtext
columns stump me. I can't find a way to use them for clustering. I found some articles about clustering words from a text document - this is not what I want to do. I want to use atext
column as one (or more) "feature" among others for clustering. - I am aware of the "bag of words" method for converting
text
into a vector of numbers. I can also easily imagine how to use this same method for converting thetags
into a vector. However that seems like a bit of an overkill because it will increase the dimentionality of my data by a lot. Are there other ways to tackle this?
Bottom line - I am looking for a way to use all these data types together for clustering. I summarized what I know so far, but I am open to any solution, even if it's completely different from what I've listed above.
Thanks!
Topic text nlp categorical-data k-means clustering
Category Data Science