Improve a regression model and feature selection
I am working on Azure ML Studio and try to create a regression model to predict a numerical value. I will try to describe my features and what I have done until now.
My data with about 3 million rows :
Features:
- 8 integer features from 1 to 25
- 2 boolean features with 0 and 1
- 3 integer features from 1 to 10
- 2 integer feature from 0 to 500.000 (and 1.000.000 respectively) with about 4.500 unique values
- 1 integer feature from 20 to 50
- 1 integer feature from 1 to 15
- 1 integer feature from 0 to 100
Label:
- Integer from 10.000 to 100.000.000 with about 5.000 unique values
What I have done:
- Split the dataset to 80% (train) and 20% (test). Then I split the training dataset again to 60% (actual train) and 40% (validation).
- Normalize the features with many unique values (4th bullet in the above list)
- Train a model of Boosted Decision Tree Regression.
- Use the Sweep Parameters module to find the best combination
I tried also Neural Networks, Bayesian Linear Regression, but BDTR gave the best score.
I tried to exclude columns and start with only a few (based on what I think it will affect the model) and then add more columns one by one.
However, the least MSE I could achieved was 1.500.000 (plus I had many negative scored values)
So, I was thinking what other techniques I could use to improve the model.