Training data from different sources

Question

Training data from different sources

Bashar Haddad

2017年7月16日 22:23

I am working on a binary classification problem. My data contains 100K samples from two different sources. When I perform the training and testing on data from the first source I can achieve classification accuracy up to 98% and when perform training and testing on the data from the second source, I can achieve up to 99%. The problem is when mix both of them, the classification accuracy goes down to 89%. Any idea how to perform the training to achieve high accuracy. Knowing that one of my features is related to the source

Topic domain-adaptation classification bigdata data-mining machine-learning

Category Data Science

Christos Karatsalos · Accepted Answer · 2017年7月16日 22:23

This could happen for several reasons.

There is a disrcepancy between the distribution of the features of the two samples.

There is a disrcepancy between the distribution of the labels of the two samples.

Another issue is the sample sizes. If one sample is much larger than the other one, and also there is a discrepancy between the distributions, this could affect the final performance of the classifier.

Finally, according to the Simpson's Paradox there is a possibility that a trend appears in different groups of data, but disappears when these groups are combined. This could be a reason for observing a worse performance when you combine the data.

DaL · Accepted Answer · 2016年8月15日 06:56

It seems that you have a domain adaptation problem. The samples from the two sources behaves differently.

I suggest reading Frustratingly Easy Domain Adaptation. As the name hints, the solution is easy , popular (800 citation until now) and a good survey of other directions.

I understand that the classifier that you run on the entire dataset was train on it. How well does the classifiers trained on the single sources perform on the other sources? How many of the samples belong to the first source? Will you have an indication at production of the source of the sample? The answer to these question might open more directions.

Training data from different sources

About