Does it make sense to scale input data with random forest regressor taking two different arrays as input?

Question

Does it make sense to scale input data with random forest regressor taking two different arrays as input?

Jérémy Talbot-Pâquet

2022年6月2日 23:42

I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data.

I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes.

Here is an example of input data:

     Return      Volume
0  0.000420  119.447233
1 -0.001093   86.455629
2  0.000277  117.940777
3  0.000256   38.084008
4  0.001275   74.376315
...
45  0.001764  90.880667
46 -0.003638  77.364971
47  0.001449  53.892422
48 -0.000990  20.278449
49 -0.000159  44.389470

I reshaped my data into a dim=2 array so that sklearn can train the model by flattening each training array.

x_data = np.stack(x_data, axis=0)
nsamples, nx, ny = x_data.shape
x_data = x_data.reshape((nsamples,nx*ny))

Now the data is alternating between returns and volume. [return_t0, volume_t0, return_t1, volume_t1, ..., return_t49, volume_t49]

[ 4.20084086e-04  1.19447233e+02 -1.09329647e-03  8.64556285e+01
  2.76843107e-04  1.17940777e+02  2.55559803e-04  3.80840075e+01
  1.27459967e-03  7.43763155e+01]

But we see that returns is a very small number and volume varies between 20 and 1000. So would it make sense to scale the data while returns and volumes are in the same array? If not how do I do that?

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.25,  random_state=0)

print('Scaling data')
scale = StandardScaler()
x_train = scale.fit_transform(x_train)
x_test = scale.transform(x_test)

    scale = StandardScaler()
    y_train = scale.fit_transform(y_train)
    y_test = scale.transform(y_test)

I tried training the model with the scaled data and I get a negative score so something must be wrong here.

I included the rest of the code and the scores on test data:

print('Training model')
model = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=42, bootstrap=True, verbose=True, max_features=sqrt, n_jobs=N_PROCESSORS)
model.fit(x_train, y_train)

# Predict from trained model
print('Predicting test data')
predict = model.predict(x_test)
print(predict)
print(predict.shape)

# Evaluate accuracy
print(Mean Absolute Error:, round(metrics.mean_absolute_error(y_test, predict), 4))
print(Mean Squared Error:, round(metrics.mean_squared_error(y_test, predict), 4))
print(Root Mean Squared Error:, round(np.sqrt(metrics.mean_squared_error(y_test, predict)), 4))
print((R^2) Score:, round(metrics.r2_score(y_test, predict), 4))
print(f'Train Score : {model.score(x_train, y_train) * 100:.2f}% and Test Score : {model.score(x_test, y_test) * 100:.2f}% using Random Tree Regressor.')

Mean Absolute Error: 1.5153
Mean Squared Error: 5.7477
Root Mean Squared Error: 2.3974
(R^2) Score: -3672400.631

Topic feature-scaling random-forest scikit-learn

Category Data Science

Does it make sense to scale input data with random forest regressor taking two different arrays as input?

About