How to approach the competition with anonymous scaled numerical predictors? has been around for a while now and there seem to be only few posts or other discussions about it on the web.
The system has changed from time to time and the set-up today is the following:
- train (N=96K) and test (N=33K) data with 21 features with continuous values in [0,1] and a binary target.
- The data is clean (no missing values) and updated every 2 weeks. You can upload your predictions (on the test set) and see the log-loss. Part of the test data is even live data and you get paid for good predictions.
What I would like to discuss:
As the features are totally anonymous I think that there is not much feature engineering we can do. So my approach is very mechanical:
- inspired by this I use a classification algorithm to filter out those training data which fit to my test data best.
- Figure out some nice preprocessing
- train nice classification algorithms
- build ensembles of them (stacking, ..).
The concrete question:
Concerning step 1: Do you have experience with such an approach? Let's say I order the probability of train samples to belong to test (usually below 0.5) and then I take the largest K probabilities. How would you choose K? I tried with 15K .. but mainly to have a small training data set in order to speed up training in step 3.
Concerning step 2: The data is already on a 0,1 scale. If I apply any (PCA like) linear transformation then I would break this scale. What would you try in preprocessing if you have such numerical data and no idea that this actually is.
PS: I am aware that because pays people discussing this could help me make some money. But as this is public this would help anybody out there...
PPS: Today's leaderboard has an interesting pattern: The top two with logloss of 0.64xx, then number 3 with 0.66xx and then most of the predictors reach 0.6888x.
Thus there seems to be a very small top field and lot of moderately successful guys (including me).