Python sklearn model.predict() gives me different results depending on the amount of data

I train my XGBoostClassifier(). If my testing set has:

0: 100 
1: 884

It attempts to predict 210 1's. Around 147 are wrong (False positives) and 63 1's correctly predicted (True positives).

Then I increase my testing sample:

0: 15,000
1: 884

It attempts to predict 56 1's. Around 40 are wrong (False positives) and 16 1's correctly predicted (True positives).

Am I missing something? some theory? some indication on how to use model.predict(X_test)?

Does it say somewhere - if you try to predict 10 items is gonna try harder than if you try to predict 10000 items? In what situation model.predict(X_test) would give me a different result for Joe Smith if his prediction is accompanied by 8000 more rows?

The code I use is the following:

from xgboost import XGBClassifier
xgb = XGBClassifier(subsample=0.75,scale_post_weight=30,min_child_weight=1,max_depth=3,gamma=5,colsample_bytree=0.75)
model = xgb.fit(X_train,y_train)
y_pred_output = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred_output)

y_pred_output2 = model.predict(X_test2) #contains the same 884 1's plus 14500 more rows with 0's as the target value
cm = confusion_matrix(y_test2, y_pred_output2)

it produces two different matrices:

#Confusion matrix for y_test with 15000 0's and 884 1's
[[14864   136]
 [  837    47]]

#Confusion matrix for y_test with 500 0's and 884 1's
[[459  41]
 [681 203]]

Notice that the same 884 positive class items are being used across both attempts. Why would the true positives go down to 47 just because we now have more Negatives on the X_test?

Topic predict xgboost python machine-learning

Category Data Science


If XGBoostClassifier is fed the same input data over and over again it will yield the same results. There is no inherent randomness in this classifier that would different results for the same input. Additionally - there should be no difference in the result of an individual prediction if it's requested in a smaller batch versus a larger batch (again the result will be identical).

On the other hand - if you train XGBoost on different data their outputs will definitely be different. If you add new data to the underlying dataset & train with that - new & different patterns will emerge that XGBoost will try to take advantage of and the entire tree network will be fit very differently.

I suspect what you are observing is a bug in structuring your input data that you are then feeding to the .predict() method. If you share a sample of your code maybe we can drill-down on the issue.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.