Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder.

For example, for a single record in my dataset, feature vector looks like this:

  1. text feature's embedding is 512 dimension vector - 1 X 512
  2. categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature)
  3. my final feature vector - 1 X 1012

After this, I derive similarity matrix using cosine-similarity to decide if two such records are semantically same or not.

Problem is, there is a difference in the range of values for text feature (real numbers) and one hot encoded feature (0 or 1). So shall I scale the one hot encoded vector with min-max scalar or using some other technique?

Topic categorical-encoding semantic-similarity feature-scaling

Category Data Science


No, do not scale the one hot encoded vector with min-max scaling. That will lose the meaning of the data. One hot encoding means a data point is completely on a dimension or not. There is no meaning for those data points that are only fractional part of a dimension.

A better option to derive textual similarity using multiple features is to embedded all features in the same embedding space (including categorical features). StarSpace is one such embedding method.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.