Does it make sense to scale input data with random forest regressor taking two different arrays as input?
I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data.
I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes.
Here is an example of input data:
Return Volume
0 0.000420 119.447233
1 -0.001093 86.455629
2 0.000277 117.940777
3 0.000256 38.084008
4 0.001275 74.376315
...
45 0.001764 90.880667
46 -0.003638 77.364971
47 0.001449 53.892422
48 -0.000990 20.278449
49 -0.000159 44.389470
I reshaped my data into a dim=2 array so that sklearn can train the model by flattening each training array.
x_data = np.stack(x_data, axis=0)
nsamples, nx, ny = x_data.shape
x_data = x_data.reshape((nsamples,nx*ny))
Now the data is alternating between returns and volume. [return_t0, volume_t0, return_t1, volume_t1, ..., return_t49, volume_t49]
[ 4.20084086e-04 1.19447233e+02 -1.09329647e-03 8.64556285e+01
2.76843107e-04 1.17940777e+02 2.55559803e-04 3.80840075e+01
1.27459967e-03 7.43763155e+01]
But we see that returns is a very small number and volume varies between 20 and 1000. So would it make sense to scale the data while returns and volumes are in the same array? If not how do I do that?
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.25, random_state=0)
print('Scaling data')
scale = StandardScaler()
x_train = scale.fit_transform(x_train)
x_test = scale.transform(x_test)
scale = StandardScaler()
y_train = scale.fit_transform(y_train)
y_test = scale.transform(y_test)
I tried training the model with the scaled data and I get a negative score so something must be wrong here.
I included the rest of the code and the scores on test data:
print('Training model')
model = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=42, bootstrap=True, verbose=True, max_features=sqrt, n_jobs=N_PROCESSORS)
model.fit(x_train, y_train)
# Predict from trained model
print('Predicting test data')
predict = model.predict(x_test)
print(predict)
print(predict.shape)
# Evaluate accuracy
print(Mean Absolute Error:, round(metrics.mean_absolute_error(y_test, predict), 4))
print(Mean Squared Error:, round(metrics.mean_squared_error(y_test, predict), 4))
print(Root Mean Squared Error:, round(np.sqrt(metrics.mean_squared_error(y_test, predict)), 4))
print((R^2) Score:, round(metrics.r2_score(y_test, predict), 4))
print(f'Train Score : {model.score(x_train, y_train) * 100:.2f}% and Test Score : {model.score(x_test, y_test) * 100:.2f}% using Random Tree Regressor.')
Mean Absolute Error: 1.5153
Mean Squared Error: 5.7477
Root Mean Squared Error: 2.3974
(R^2) Score: -3672400.631
Topic feature-scaling random-forest scikit-learn
Category Data Science