Why does classifier (XGBoost) “after PCA” runtime increase compared to “before PCA”

The short version:
I am trying to compare different classifiers for a certain dataset from kaggle, and am trying to also compare these classifiers between before using PCA (form sklearn) to after using PCA in terms of accuracy and runtime. For some reason the runtime of the classifiers (XGBoost and AdaBoost to take 2 as an example) after the use of PCA is 3 times (approximately) the runtime of the classifiers before the use of PCA. My question is: why? am I doing something wrong or is it possible?

The long version:
my understanding of how to use PCA:

  • have normalized and clean datasets split into training and testing sets (using train_test_split).
  • PCA fit and transform the X_train and save it to a new df
  • Using the fitted PCA, transform (without fitting) the X_test
  • run the classifier with the transformed X_train and X_test

    PS: I have checked that the number of dimentions is decreasing (from 21 to the number specified: 17 in case of 90% of the variance). The dataset size is around 130000 entries, taken from kaggle. The code written to achieve this:
pca = PCA(n_components=0.9)
X_train_Reduced = pca.fit_transform(X_train)
X_test_Reduced = pca.transform(X_test)

Classifier (XGBoost) before the use of PCA:

start_timeXGBoost = time.time()
warnings.filterwarnings('ignore')
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False, n_jobs = -1)
modelXGBoost.fit(X_train, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print(Accuracy (XGBoost): , accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print(Time taken to achive result: %s seconds % (timeXGBoost))

Output of code:

Accuracy (XGBoost): 0.9655066214967662
Time taken to achive result: 3.33561372756958 seconds


Classifier (XGBoost) After PCA:

start_timeXGBoost = time.time()
warnings.filterwarnings('ignore')
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False,
                             n_jobs = -1)
modelXGBoost.fit(X_train_Reduced, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test_Reduced)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print(Accuracy (XGBoost): , accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print(Time taken to achive result: %s seconds % (timeXGBoost))

Output of Code:

Accuracy (XGBoost): 0.93032029565753
Time taken to achive result: 10.376214981079102 seconds

Another example (AdaBoost)
Classifier (AdaBoost) before PCA:

start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print(Accuracy (AdaBoost): , accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print(Time taken to achive result: %s seconds % (timeAdaBoost))

Output of code:

Accuracy (AdaBoost): 0.9575762242069603
Time taken to achive result: 103.38761949539185 seconds

Classifier (AdaBoost) after PCA:

start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train_Reduced, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test_Reduced)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print(Accuracy (AdaBoost): , accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print(Time taken to achive result: %s seconds % (timeAdaBoost))

Output of code:

Accuracy (AdaBoost): 0.9141515244841392
Time taken to achive result: 295.6763050556183 seconds


I would very much appreciate any help in the matter of understanding what I have done wrong (or right).
Thank you all in advance

Topic adaboost boosting pca scikit-learn machine-learning

Category Data Science


This is likely behavior when several of your original features are discrete.

Each tree, when splitting, considers a split for each unique value of each feature (in the current node). For discrete features this is often significantly smaller than the number of rows, while for continuous features it is often very nearly the same as the number of rows. When some features are discrete then, the count of candidate splits is generally much smaller than when all the features are continuous.

Since PCA amounts to essentially a rotation of the feature space, the resulting feature space will generally be purely-continuous, making the resulting tree decisions more expensive even when removing a few of the last principal components.

You could try using histogram splitting in xgboost, which discretizes the features before splitting to reduce computation time. I don't think this is currently available in sklearn's implementation of adaboost.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.