Decision Tree taking too long to execute

I am training a Decision Tree Regressor on a relatively small data. The dimensions of my train and test sets are (34164, 10) and (8514, 10). Here is the relevant code:

y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)

num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']]
cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object']

num_trans = SimpleImputer(strategy = 'mean')
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')), 
                          ('onehotencode', OneHotEncoder(handle_unknown = 'ignore'))])

preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols), 
                                        ('num', num_trans, num_cols)])


dtr_model = DecisionTreeRegressor(random_state = 69, criterion = 'mae')

dtr_pipe = Pipeline(steps = [('preproc', preproc), ('model', dtr_model)])

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, 
random_state=69)


# BASELINE MODEL
cross_dtr_score = -1 * cross_val_score(dtr_pipe, train_x, train_y, cv = 5,
                                    n_jobs = -1, scoring = 'neg_mean_absolute_error')
base_dtr_score = cross_dtr_score.mean()

The problem is it is taking too long to run, even for the baseline model. This is the first time I am facing this problem as usually any kind of tree based model does not take this long. Also the train and test dataset is not huge. So why is it taking such a long time to run even for something as simple as a baseline model? By such a long time I mean more than 15 minutes!

Topic decision-trees cross-validation databases

Category Data Science


For anyone who is facing a similar error, the reason is that mean_absolute_error takes more time to calculate. Hence I was facing long execution times. I chose another metric mean_squared_error and the execution time decreased drastically. So if the choice of metric is not a restriction, then I'd advise to go for mean_squared_error for faster computation times.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.