Clustering using both text and numerical features

I have a dataset that contains 2 types of features, one is generated from doc2vec and one is numerical feature. I would like to perform clustering analysis on them. However, due to the size of doc2vec features, if I simply combine them into one array, clustering algorithm would distribute the weight on the doc2vec features more, how do I overcome this problem?

For example, for a given label, say I have features from doc2vec that look like this [1,2,3,4,5], and numerical feature [2]. I don't want to simply combine them into [1,2,3,4,5,2] and perform clustering analysis. Ideally, I would like my clustering algorithm to give the numerical feature equal importance as the doc2vec feature.

Topic doc2vec feature-engineering unsupervised-learning clustering machine-learning

Category Data Science


One way to achieve this is to use a clustering method based on a custom similarity/distance measure. For example you could defined the similarity measure between two instances as:

$$sim(\langle v_1, n_1\rangle,\langle v_2, n_2\rangle)=\frac{1}{2} cosine(v_1,v_2)\ +\ \frac{1}{2} \left(1-\frac{|n_1-n_2|}{max(n_1,n_2)}\right)$$

This measure gives the same weight to the similarity between the vectors ($v_1$ and $v_2$) and the similarity between the numerical values ($n_1$ and $n_2$). Note that since this similarity measure is normalized, you can convert it to a normalized distance measure: $d=1-s$. Of course you should define the exact measure based on what the values represent, this is just an example.

You could use this measure with a hierarchical clustering method or a graph clustering method (with edges based on similarity value).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.