SVR - RMSE is much worse after normalizing the data
I'm building a model using a custom kernel SVR that looks into a few of my dataframe's features and checks the proximity/distance between each pair of datapoints. The features are weigthed and the weights were calculated using cross validation.
Initially my dataframe was not normalized, and the model's results were not very good (RMSE higher than 25% of the target range). Because I had read that SVR is sensible to scale, I decided to try to normalize the data, which resulted in much worse predictions.
The original results in terms of Root Mean Squared Error were as follows:
RMSE (test_set): 6.59
RMSE (training_set): 6.56
RMSE (validation_set): 5.90
The new results, with the normalised data are the following:
RMSE (test_set): 2404.68
RMSE (training_set): 148.06
RMSE (validation_set): 2546.44
These values are very much outside the normal range of my outcome variable. This makes me suspect I'm doing something wrong.
Notes:
- I recalculated the kernel's weights after normalizing.
- I normalized the data after splitting it into train/test/validation.
- I didn't normalize the y (outcome variable) vector.
To normalize the data, I used the following code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
def normalize(df):
x = df.values
df_scaled = sc.fit_transform(x)
return pd.DataFrame(df_scaled, columns=df.columns)
X_train = normalize(X_train)
X_test = normalize(X_test)
X_validation = normalize(X_validation)
Then I use these 3 variables to train and test the model.
regressor = SVR(kernel=my_kernel)
regressor.fit(X_train, y_train)
Any tips on what I'm doing wrong? Thanks.
Topic rmse svr feature-scaling scikit-learn
Category Data Science