How to conclude the generality of any classification methods?

Suppose a classification task A, and there exist a lot of methods $M_1, M_2, M_3$. The task $A$ is measured by a consistent measure. For instance, the task A can be a binary classification. In this case, F-score, ROC curve can be used.

I did a survey on some research are and found that

  • $M_1$ is evaluated with dataset $D_1$ (open) using pre-processing $P_1$ only (seems the seminal work).
  • $M_2$ is evaluated with dataset $D_1$ (open), $D_2$ (private) and compared with $M_1$, claiming $M_2$ has more accurate result, but using different data pre-processing $P_2$.
  • $M_3$ proposes new way with dataset $D_3$ (private) and did not provide any comparison against $M_2$ and $M_1$

I'm trying to work on this area, but there are lots of inconsistency. None of the methods are validated with validation data. They just used train and test data. I think some parameters are tuned for test dataset although authors do not claim so. Since this field is not a data science-oriented and the amount of dataset is few, this may happen.

Which method can we consider as a state-of-the-art?

How can we conclude the generality of each of the method?

Topic data preprocessing research

Category Data Science


You're experiencing an unfortunately common issue with the current state of system/model evaluation. In addition to evaluating on different datasets, authors often leave out important details, such as the procedure for hyperparameter tuning, detailed evaluation metrics (i.e. true positives, false negatives, etc. in addition to F-score), and ablation analyses. In cases like these, we cannot conclude that one method is necessarily better than the others or state-of-the-art.

The best way to estimate the generality of each method when the literature has so far failed to do so is to implement each yourself and do a fair comparative evaluation. You would evaluate all methods on the same dataset with the same pre-processing steps and hyperparameter tuning procedure and, if possible, introduce additional evaluation datasets. It can also be very enlightening to perform an ablation analysis in which you iteratively remove certain components of the methods and re-evaluate to see how much of a performance hit you take.

Doing the above and communicating it (via a publication, blog post, or whatever) will not only help you, but everyone else working in the area.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.