Data Transformation for Machine Learning Regression Task

Question

Data Transformation for Machine Learning Regression Task

DomIsAwesomee

2021年8月3日 11:37

I am performing a ML regression task, using XGBoost Regressor. I am using financial time series data, namely the Close price of the EUR/USD exchange rate which I will transform into geometric log returns which will be my predictor variable. Also, I am using a technical analysis library which uses the open, high, low, close prices to create additional features, e.g. Bollinger Bands, ATR, moving averages etc...

When viewing the distribution of, let's say, the Bollinger Bands it looks non-normal. My question is what is the best feature transformation for use in my ML model? I am aware of using .pct_change(), .diff() or mean-centering with a rolling average method, along with np.log or .apply(np.log(1p)); what is the best procedure? Is it okay to pass the Bollinger Band series, without transforming the distribution of the data, into the model or should I apply one? The same goes for all my other features in my dataframe. Here is my code:

# Bollinger Bands
df[UP_BB], _, df[LOW_BB] = ta.BBANDS(df.Close, timeperiod=10, nbdevup=3, nbdevdn=3, matype=0)

x1 = df[UP_BB]

x2 = x1.pct_change() # pct_change
x3 = x2.apply(np.log1p) # pct_change with log(1+x)

x4 = x1 - x1.rolling(window=20).mean() # mean-centering
x5 = x4.apply(np.log1p) # mean-centering with log(1+x)

x6 = x1.diff() # differencing
x7 = np.log(x6) # log of differenced series

Here are the visualizations of the distributions:

Topic distribution feature-engineering finance time-series machine-learning

Category Data Science

Data Transformation for Machine Learning Regression Task

About