Can classification model B trained on data labeled by classification model A exceed the performance of model A?
Let's say that I have a small or medium sized dataset of images, say 50,000. I use transfer learning to train a deep learning classification model. Call this model A. Model A is deemed to have good enough performance to be deployed. I deploy model A to a production environment where many users are able to consume the service by sending an image to an endpoint and receiving back the predicted class.
Now lets say the service becomes very popular, and after a few months, model A has labeled a few million images. Assume that these images and their labels as assigned by A are maintained in a new dataset. Say I train a new classification model. Call this model B. This model uses the new dataset consisting of the millions of images labeled by model A.
Model B has trained on much, much more data than did model A; however, the training data for model B was labeled exclusively by model A, so it will have label noise in accordance with model A's errors.
Is there any reason to believe that model B should show an improvement in performance over A, or will B merely learn the errors of A and fail to show any improvement?
If B won't/shouldn't improve over A in the given scenario, are there any slight tweaks to the scenario that might allow for B to show a performance gain over A?
================================================================
I have to believe that this problem is extremely common in industry, but I have not found a canonical or even common approach to solving it. For example, a similar question exists here for numerical data, but it didn't get much traction.
I'm interested in any attempted solution or a method that is commonly used in industry for image classifiers. Feel free to rephrase the question in your proposed solution. The only necessary assumptions you must make when drafting a solution are that 1) you've trained some sort of initial model(s) on an initial image dataset, 2) you now you have a huge number of new images that can be considered either unlabeled or labeled by the initial model(s), and 3) you can only consider automated solutions, there can be no manual checking of the data.
Topic classification predictive-modeling
Category Data Science