Why does increasing the training set size not improve the results?
I have trained a model on a training set, which is not that big (overall around 120 true positives, and of course lots of negative examples). What I am trying to do is to improve the results by increasing the data size. I tried two approaches:
I added data from a different domain and concatenated the data with the existing one. It increased the F-score from 0.13 to 0.14.
I added the same extra data instances, but this time with a domain adaptation technique (feature augmentation). This time, I got an improvement of around 6 percent which was significant.
My question is why does the first approach help a little bit?
The additional data set was around 10 times bigger than the main data set. That is, adding around 1000 instances from a different domain increased the results 1 percent. If it didn't help at all, I think, that would be easier to explain. But now, I don't understand. If the data is biased and test set and training set are so close to each other such that the additional data set can not help, why do I get 1 percent improvement?
On the other hand, how do I get almost the same result when the main data set in this situation is just around 1/10 of the data? I don't understand the behaviour of the model. I am using a Naive Bayse
classifier.
Topic domain-adaptation naive-bayes-classifier classification
Category Data Science