Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

Question

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

Bruso

2022年2月21日 05:04

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder.

For example, for a single record in my dataset, feature vector looks like this:

text feature's embedding is 512 dimension vector - 1 X 512
categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature)
my final feature vector - 1 X 1012

After this, I derive similarity matrix using cosine-similarity to decide if two such records are semantically same or not.

Problem is, there is a difference in the range of values for text feature (real numbers) and one hot encoded feature (0 or 1). So shall I scale the one hot encoded vector with min-max scalar or using some other technique?

Topic categorical-encoding semantic-similarity feature-scaling

Category Data Science

Brian Spiering · Accepted Answer · 2021年1月25日 15:51

No, do not scale the one hot encoded vector with min-max scaling. That will lose the meaning of the data. One hot encoding means a data point is completely on a dimension or not. There is no meaning for those data points that are only fractional part of a dimension.

A better option to derive textual similarity using multiple features is to embedded all features in the same embedding space (including categorical features). StarSpace is one such embedding method.

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

About