Can classification model B trained on data labeled by classification model A exceed the performance of model A?

Let's say that I have a small or medium sized dataset of images, say 50,000. I use transfer learning to train a deep learning classification model. Call this model A. Model A is deemed to have good enough performance to be deployed. I deploy model A to a production environment where many users are able to consume the service by sending an image to an endpoint and receiving back the predicted class.

Now lets say the service becomes very popular, and after a few months, model A has labeled a few million images. Assume that these images and their labels as assigned by A are maintained in a new dataset. Say I train a new classification model. Call this model B. This model uses the new dataset consisting of the millions of images labeled by model A.

Model B has trained on much, much more data than did model A; however, the training data for model B was labeled exclusively by model A, so it will have label noise in accordance with model A's errors.

Is there any reason to believe that model B should show an improvement in performance over A, or will B merely learn the errors of A and fail to show any improvement?

If B won't/shouldn't improve over A in the given scenario, are there any slight tweaks to the scenario that might allow for B to show a performance gain over A?

================================================================

I have to believe that this problem is extremely common in industry, but I have not found a canonical or even common approach to solving it. For example, a similar question exists here for numerical data, but it didn't get much traction.

I'm interested in any attempted solution or a method that is commonly used in industry for image classifiers. Feel free to rephrase the question in your proposed solution. The only necessary assumptions you must make when drafting a solution are that 1) you've trained some sort of initial model(s) on an initial image dataset, 2) you now you have a huge number of new images that can be considered either unlabeled or labeled by the initial model(s), and 3) you can only consider automated solutions, there can be no manual checking of the data.

Topic classification predictive-modeling

Category Data Science


For strictly supervised methods, I would not expect model B to perform significantly better than model A. The only way model B can do better than model A is by correctly classifying samples for which model A produces an incorrect classification. But since model B is trained on model A predictions, this is exactly what model B tries to avoid - it should attempt to produce the same incorrect classifications as model A, which are "correct" as far as model B is concerned. To improve on model A, model B must "intend" to be wrong.

For model B to perform better, it must selectively "misclassify" samples that model A got wrong. But neither model knows what samples those are, if they did, they would be able to recognize the error and not have gotten them wrong in the first place.

Model B could potentially do better with semi-supervised methods that use both a set of provided labels as well as the structure of the data in feature space. Such methods may allow some flexibility in the training labels, allowing the optimization to consider things other than how many training labels are correctly predicted. With supervised methods, model B can do no better than exactly recapitulating model A's output, but with semi-supervised methods, model B may be optimized in a way that produces labels distinct from model A.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.