Feature set choice in Google's Vertex AI/AutoML
This is a subjective question on utilizing Vertex AI/AutoML in practice. I posted it on stackoverflow and it was closed. I hope it is within scope here.
I'm using Google's Vertex AI/AutoML's Tabular dataset models to learn a regression problem on structured data with human engineered features - it's a score/ranking problem and the training target values are either 0
or 1
.
Our constructed features are often correlated, sometimes the same data point normalized on different dimensions, e.g. number of car accidents of a person divided by the average number of car accidents of all people, or the same number divided by the average number of accidents of people with the same car. I'm weighing out whether it would be better to supply AutoML with all of these features and let it figure it out, or might I have a better chance picking one or the other? Or perhaps, might it make sense in this specific case to not normalize at all and supply the overall and same-car averages as separate features.
I'm asking this question from a practical point of view and I'm not looking for a definite objective answer. AFIK, in theory, the more data you add the better results you are going to get if your network is complex enough. But you'd also need more training samples, which we do not have too much of.
Topic google-cloud-platform automl feature-selection
Category Data Science