Why does increasing the training set size not improve the results?

I have trained a model on a training set, which is not that big (overall around 120 true positives, and of course lots of negative examples). What I am trying to do is to improve the results by increasing the data size. I tried two approaches:

  1. I added data from a different domain and concatenated the data with the existing one. It increased the F-score from 0.13 to 0.14.

  2. I added the same extra data instances, but this time with a domain adaptation technique (feature augmentation). This time, I got an improvement of around 6 percent which was significant.

My question is why does the first approach help a little bit?

The additional data set was around 10 times bigger than the main data set. That is, adding around 1000 instances from a different domain increased the results 1 percent. If it didn't help at all, I think, that would be easier to explain. But now, I don't understand. If the data is biased and test set and training set are so close to each other such that the additional data set can not help, why do I get 1 percent improvement?

On the other hand, how do I get almost the same result when the main data set in this situation is just around 1/10 of the data? I don't understand the behaviour of the model. I am using a Naive Bayse classifier.

Topic domain-adaptation naive-bayes-classifier classification

Category Data Science


The model that you are generating is most probably under-fitting the model. Imagine, you are fitting a straight line to a non-linear data. Certainly even if you are increasing your sample size, the model is still not getting any better.

Study about learning curves and how they can be used to solve the under-fitting!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.