Clustering mixed data types - numeric, categorical, arrays, and text

I have a dataset with 4 types of data columns:

              numeric  categorical            tags                     text
id
1               51585           27  [A, B, C, ...]  "Some text bla bla bla"
2               53596           27  [B, D, E]               "Other text..."
3             1176345           27  [D, A, F, ...]                    "..."
4                 168           24             NaN                    "..."
5               88564           22             NaN                    "..."
  • numeric - continuous numeric values.
  • categorical - discrete categories, either numbers or strings (the type doesn't really matter because I can convert it to whatever works)
  • tags - array containing discrete values. Each row can have a different array length.
  • text - a string of text.

I am new to data science so maybe this is a "beginner" question.

How can I use all of these different data types in a clustering algorithm?

Here is what I learned so far:

  • K-means is good for numerical data. I successfully applied it to a subset of my data with only numerical columns. I also used some evaluation metrics (such as silhouette coefficient) to help me choose the number of clusters. So this works in principal but since it's not using most of my data the results are not good.
  • Then I read about clustering categorical data. I found the Gower Distance which is a distance between categorical data. So far I've used it with K-means (I passed the distance matrix generated by Gower into K-means). From here it should be easy to join the Gower distance matrix with the numeric columns from my original dataset and pass all of them to K-means.

I am aware there are other clustering algorithms besides K-means, and I plan to check some others as well. But before I do, I want to find some way to utilize all of my data in a single algorithm.

  • The tags and text columns stump me. I can't find a way to use them for clustering. I found some articles about clustering words from a text document - this is not what I want to do. I want to use a text column as one (or more) "feature" among others for clustering.
  • I am aware of the "bag of words" method for converting text into a vector of numbers. I can also easily imagine how to use this same method for converting the tags into a vector. However that seems like a bit of an overkill because it will increase the dimentionality of my data by a lot. Are there other ways to tackle this?

Bottom line - I am looking for a way to use all these data types together for clustering. I summarized what I know so far, but I am open to any solution, even if it's completely different from what I've listed above.

Thanks!

Topic text nlp categorical-data k-means clustering

Category Data Science


One option is to learn an embedding for all the data in the same space, then apply any clustering clustering technique. One way to do that is with StarSpace package.


  1. For the tags: Do you know how they are generated? How many unique tags do you have? If they are self-generated (ie lots of tags that can be subsets of other tags). You might need to do tag consolidation which will also help with dimensionality reduction of the word vectors. If you can provide a bit more information about what your data looks like and where it comes from, I can perhaps provide a more in depth answer.

  2. For the text: You might want to try using word embeddings. You can use a pre-trained word2vec model.

  3. I am not sure if it makes sense to use two different distance metrics. Your categorical data looks like its integers, is it ordinal, or are those indexes?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.