Why does classifier (XGBoost) “after PCA” runtime increase compared to “before PCA”
The short version:
I am trying to compare different classifiers for a certain dataset from kaggle, and am trying to also compare these classifiers between before using PCA (form sklearn) to after using PCA in terms of accuracy and runtime. For some reason the runtime of the classifiers (XGBoost and AdaBoost to take 2 as an example) after the use of PCA is 3 times (approximately) the runtime of the classifiers before the use of PCA. My question is: why? am I doing something wrong or is it possible?
The long version:
my understanding of how to use PCA:
- have normalized and clean datasets split into training and testing sets (using train_test_split).
- PCA fit and transform the X_train and save it to a new df
- Using the fitted PCA, transform (without fitting) the X_test
- run the classifier with the transformed X_train and X_test
PS: I have checked that the number of dimentions is decreasing (from 21 to the number specified: 17 in case of 90% of the variance). The dataset size is around 130000 entries, taken from kaggle. The code written to achieve this:
pca = PCA(n_components=0.9)
X_train_Reduced = pca.fit_transform(X_train)
X_test_Reduced = pca.transform(X_test)
Classifier (XGBoost) before the use of PCA:
start_timeXGBoost = time.time()
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False, n_jobs = -1)
modelXGBoost.fit(X_train, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print(Accuracy (XGBoost): , accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print(Time taken to achive result: %s seconds % (timeXGBoost))
Output of code:
Accuracy (XGBoost): 0.9655066214967662
Time taken to achive result: 3.33561372756958 seconds
Classifier (XGBoost) After PCA:
start_timeXGBoost = time.time()
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False,
n_jobs = -1)
modelXGBoost.fit(X_train_Reduced, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test_Reduced)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print(Accuracy (XGBoost): , accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print(Time taken to achive result: %s seconds % (timeXGBoost))
Output of Code:
Accuracy (XGBoost): 0.93032029565753
Time taken to achive result: 10.376214981079102 seconds
Another example (AdaBoost)
Classifier (AdaBoost) before PCA:
start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print(Accuracy (AdaBoost): , accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print(Time taken to achive result: %s seconds % (timeAdaBoost))
Output of code:
Accuracy (AdaBoost): 0.9575762242069603
Time taken to achive result: 103.38761949539185 seconds
Classifier (AdaBoost) after PCA:
start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train_Reduced, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test_Reduced)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print(Accuracy (AdaBoost): , accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print(Time taken to achive result: %s seconds % (timeAdaBoost))
Output of code:
Accuracy (AdaBoost): 0.9141515244841392
Time taken to achive result: 295.6763050556183 seconds
I would very much appreciate any help in the matter of understanding what I have done wrong (or right).
Thank you all in advance
Topic adaboost boosting pca scikit-learn machine-learning
Category Data Science