Can distribution values of a target variable be used as features in cross-validation?

I came across an SVM predictive model where the author used the probabilistic distribution value of the target variable as a feature in the feature set. For example:

The author built a model for each gesture of each player to guess which gesture would be played next. Calculating over 1000 games played the distribution may look like (20%, 10%, 70%). These numbers were then used as feature variables to predict the target variable for cross-fold validation.

Is that legitimate? That seems like cheating. I would think you would have to exclude the target variables from your test set when calculating features in order to not "cheat".

Topic methods accuracy

Category Data Science


There is nothing necessarily wrong with this. If you have no better information, then using past performance (i.e., prior probabilities) can work pretty well, particularly when your classes are very unevenly distributed.

Example methods using class priors are Gaussian Maximum Likelihood classification and Naïve Bayes.

[UPDATE]

Since you've added additional details to the question...

Suppose you are doing 10-fold cross-validation (holding out 10% of the data for validating each of the 10 subsets). If you use the entire data set to establish the priors (including the 10% of validation data), then yes, it is "cheating" since each of the 10 subset models uses information from the corresponding validation set (i.e., it is not truly a blind test). However, if the priors are recomputed for each fold using only the 90% of data used for that fold, then it is a "fair" validation.

An example of the effect of this "cheating" is if you have a single, extreme outlier in your data. Normally, with k-fold cross-validation, there would be one fold where the outlier is in the validation data and not the training data. When applying the corresponding classifier to the outlier during validation, it would likely perform poorly. However, if the training data for that fold included global statistics (from the entire data set), then the outlier would influence the statistics (priors) for that fold, potentially resulting in artificially favorable performance.


After speaking with some experienced statisticians, this is what I got.

As for technical issues regarding the paper, I'd be worried about data leakage or using future information in the current model. This can also occur in cross validation. You should make sure each model trains only on past data, and predicts on future data. I wasn't sure exactly how they conducted CV, but it definitely matters. It's also non-trivial to prevent all sources of leakage. They do claim unseen examples but it's not explicit exactly what code they wrote here. I'm not saying they are leaking for sure, but I'm saying it could happen.


As bogatron and Paul already said, there is nothing wrong with using the prediction from one classifier as a feature in another classifier. Actually, so-called "Cascading classifiers" work that way. From Wikipedia:

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade.

This can be helpful not only to inform posterior classifiers using new features but also as an optimization measure. In the Viola-Jones object detection framework, a set of weak classifiers is used sequentially in order to reduce the amount of computation in the object recognition task. If one of the weak classifiers fails to recognize an object of interest, others classifiers don't need to be computed.


I agree that there is nothing wrong with using these type of features. I have used for inter-arrival times for example in modeling work. I have noticed however that many of these kind of features have "interesting" covariance relationships with each other, so you have to be really careful about using multiple distribution features in a model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.