SVR - RMSE is much worse after normalizing the data

I'm building a model using a custom kernel SVR that looks into a few of my dataframe's features and checks the proximity/distance between each pair of datapoints. The features are weigthed and the weights were calculated using cross validation.

Initially my dataframe was not normalized, and the model's results were not very good (RMSE higher than 25% of the target range). Because I had read that SVR is sensible to scale, I decided to try to normalize the data, which resulted in much worse predictions.

The original results in terms of Root Mean Squared Error were as follows:

RMSE (test_set):  6.59
RMSE (training_set):  6.56
RMSE (validation_set):  5.90

The new results, with the normalised data are the following:

RMSE (test_set):  2404.68
RMSE (training_set):  148.06
RMSE (validation_set):  2546.44

These values are very much outside the normal range of my outcome variable. This makes me suspect I'm doing something wrong.

Notes:

  1. I recalculated the kernel's weights after normalizing.
  2. I normalized the data after splitting it into train/test/validation.
  3. I didn't normalize the y (outcome variable) vector.

To normalize the data, I used the following code:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

def normalize(df):
    x = df.values
    df_scaled = sc.fit_transform(x)
    return pd.DataFrame(df_scaled, columns=df.columns)

X_train = normalize(X_train)
X_test = normalize(X_test)
X_validation = normalize(X_validation)

Then I use these 3 variables to train and test the model.

regressor = SVR(kernel=my_kernel)
regressor.fit(X_train, y_train)

Any tips on what I'm doing wrong? Thanks.

Topic rmse svr feature-scaling scikit-learn

Category Data Science


You normalized it after splitting into train / test / validation, but you're doing this wrong. You need to normalize the training set X_train_normalized = scaler.fit_transform(X_train), and then use the same statistics to normalize the test and validation sets: X_valid_normalized = scaler.transform(X_valid); X_test_normalized = scaler.transform(X_test).

If you don't do it this way, you're normalizing to different distributions (mean and variance are different). So if you're normalizing to a z-score, replacing each feature value with how many standard deviations from the mean it is, a value of 1.0 that the model sees in the training data has a different original value than a 1.0 in the validation set and test set.

So the right way is to calculate the statistics for the training set (the fit part of fit_transform) and assume the validation and test sets are from the same distribution (you're only supposed to know about the training set at training time). Then use those same statistics on the validation and test sets to make sure you're comparing apples to apples.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.