Label A records B times or label A*B records
This question concerns pre-training data sourcing.
Suppose you have a human workforce of B
individuals and a potentially unlimited source of data.
The task is labeling images with classes. These classes are somewhat subjective (emotions). This means one individual might label the same image with a different class than another individual.
For then using these labeled records as training data on a neural network that predicts classes on images, is it better to
1) have a number of records (A
) labeled redundantly by all B
individuals.
2) have every individual label A
different records each, yielding A x B
labeled records.
Intuition behind 1) is that the mean of subjective labeling would be somewhat objective. Thus training data would be mostly objective. In addition, probabilities (50% happy, 50% surprised) could be used as input.
Intuition behind 2) is that subjectiveness in labeling of individuals is natural and the NN is trained on that, becoming somewhat "general"/"objective" in it's predictions. Also, more data is always better.
Please excuse the use of subjective and objective in combination with Machine Learning. I know this might not be correct at all.
Topic labelling training image-classification neural-network
Category Data Science