Topic models for non-textual data?

Question

Topic models for non-textual data?

Jamie

2022年2月28日 11:01

I am looking to employ an unsupervised clustering on a dataset where each observation has a mix of textual and non-textual features.

For each observation, I combine the features into a single vector of ~1000 dimensions. To cluster I have two potential ideas:

Using an autoencoder (or an embedding?) to reduce the dimensionality of the data and then cluster using k-means.
Could I use a topic model? If so, isn't this the superior method in most circumstances to the above?

Why are topic models (in my experience) not commonly used for non-textual data? Is this just a relic of their name/original application, or is there something more fundamental?

Thanks!

Topic unsupervised-learning topic-model k-means clustering

Category Data Science

Brian Spiering · Accepted Answer · 2020年5月15日 13:59

StarSpace is a model that can learn to embed of a mix of textual and non-textual features. Once all the features are converted to numerical representations, any topic model algorithm can work (e.g., LSA, PLSA, LDA, or variations).

teoML · Accepted Answer · 2019年12月9日 14:19

I think that you can use a topic model such as Latent Dirichlet Allocation (LDA). For example, in this paper https://pdfs.semanticscholar.org/9e6f/33bdd04df0536f6ad6783d33cccfbc54b1b1.pdf it is used for music and images. I suggest you to take a look at it :) . In general, in topic modeling you end up with a list of topics, where each topic contains a set of associated keywords. In clustering, depending on the algorithm, you might have hierarchy of dependencies. You can also use algorithms which assign each sample to only one class. In addition to this, when doing clustering, you usually have a distance metric which you have to pre-define (e.g. Euclidean distance). The topic models, especially LDA are based on the assumption that your data represents a distribution of topics with their corresponding distribution of keywords (one keyword can be contained in many topics). In other words, you already assume how the texts/documents have been generated.

Topic models for non-textual data?

About