Decision Tree taking too long to execute
I am training a Decision Tree Regressor on a relatively small data. The dimensions of my train and test sets are (34164, 10) and (8514, 10). Here is the relevant code:
y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)
num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']]
cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object']
num_trans = SimpleImputer(strategy = 'mean')
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),
('onehotencode', OneHotEncoder(handle_unknown = 'ignore'))])
preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols),
('num', num_trans, num_cols)])
dtr_model = DecisionTreeRegressor(random_state = 69, criterion = 'mae')
dtr_pipe = Pipeline(steps = [('preproc', preproc), ('model', dtr_model)])
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2,
random_state=69)
# BASELINE MODEL
cross_dtr_score = -1 * cross_val_score(dtr_pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'neg_mean_absolute_error')
base_dtr_score = cross_dtr_score.mean()
The problem is it is taking too long to run, even for the baseline model. This is the first time I am facing this problem as usually any kind of tree based model does not take this long. Also the train and test dataset is not huge. So why is it taking such a long time to run even for something as simple as a baseline model? By such a long time I mean more than 15 minutes!
Topic decision-trees cross-validation databases
Category Data Science