Every machine learing model i build, always predict wrongly almost the same samples. (Random forest, XGBoost, AdaBoost)

First of all, I'd like to apologize for any spelling or grammar mistakes.

I'm having a problem using R for a classification problem. My dataset contains ~300.000 genomic data, and the features are DNA-related features (number of dinucleotides, number of trinucleotides, the CG Content, and some more). In conclusion, I have a dataset of 300.000 rows and 84 columns (columns = features). The 84th feature is basically the classification variable (there are two classes: class 1 and class 2).

I split the dataset in train set and test set and fit the train set to create a model using different methods, like Random Forest, XGBoost, Adaboost, etc. and check the accuracy using the test set.

In every model, the accuracy of the prediction on the test set is always around 86-87%. What I noticed is that, in the test set, for EVERY MODEL, almost the same number of rows (data) is predicted right (the model predict its true class), and always almost the same data of rows (data) is predicted wrong (the model predict wrong class. For example its true class is 1 but the model predicts 2).

If I take the set of data that are predicted right for every model I've created (250.000 data from the 300.000), and build a new model out of them (splitting them again in a new train and test set), I have an accuracy of 98%.

That means that the other 50.000 data (from the 300.000), add an error at every model I'm trying to create.

But I can't just ignore them. I thought that it would be a way to create two differents models, but doesn't know how to figure this out.

Then I thought that there could be a way to find which variables from the two subsets have strong differences, but even if I figure this out, I don't know what's the way for example to scale the original dataset to prevent those differences from keeping the accuracy in 87% in every model (using the original dataset of the 300.000 rows).

In conclusion, I have a dataset that pretends like "two datasets". It consists of two subsets: the first one is always predicted right from any model (machine learning algorithm) I'm trying, and the second one is always predicted wrong. How could I build a strong model taking into account the differences?

I also took a look at the box plots of the features. The most important feature for every model is called "CG" (number of dinucleotide of cytosines and guanines in DNA). In the following image you can see the boxplot of CG feature separating the original dataset in two datasets (T = for the dataset that is predicted right in every model) and (F = for the dataset that is predicted wrong in every model). 1 stands for data of class 1 and 2 stands for data of class 2. So, T1 means data from the right predicted dataset of class 1, T2 means data from the right predicted dataset of classes 2, T means data from the right predicted dataset for both class 1 and 2, etc.

As I'm a new beginner in R and in machine learning, I hope you do understand my problem. Thanks sincerely.

Topic adaboost xgboost classification r machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.