Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?
My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder.
For example, for a single record in my dataset, feature vector looks like this:
- text feature's embedding is 512 dimension vector - 1 X 512
- categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature)
- my final feature vector - 1 X 1012
After this, I derive similarity matrix using cosine-similarity to decide if two such records are semantically same or not.
Problem is, there is a difference in the range of values for text feature (real numbers) and one hot encoded feature (0 or 1). So shall I scale the one hot encoded vector with min-max scalar or using some other technique?
Topic categorical-encoding semantic-similarity feature-scaling
Category Data Science