Decision Tree taking too long to execute

Question

Decision Tree taking too long to execute

spectre

2021年12月1日 09:46

I am training a Decision Tree Regressor on a relatively small data. The dimensions of my train and test sets are (34164, 10) and (8514, 10). Here is the relevant code:

y = np.log(data2['price'])
data2.drop(['price'], axis = 1, inplace = True)

num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']]
cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object']

num_trans = SimpleImputer(strategy = 'mean')
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')), 
                          ('onehotencode', OneHotEncoder(handle_unknown = 'ignore'))])

preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols), 
                                        ('num', num_trans, num_cols)])


dtr_model = DecisionTreeRegressor(random_state = 69, criterion = 'mae')

dtr_pipe = Pipeline(steps = [('preproc', preproc), ('model', dtr_model)])

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, 
random_state=69)


# BASELINE MODEL
cross_dtr_score = -1 * cross_val_score(dtr_pipe, train_x, train_y, cv = 5,
                                    n_jobs = -1, scoring = 'neg_mean_absolute_error')
base_dtr_score = cross_dtr_score.mean()

The problem is it is taking too long to run, even for the baseline model. This is the first time I am facing this problem as usually any kind of tree based model does not take this long. Also the train and test dataset is not huge. So why is it taking such a long time to run even for something as simple as a baseline model? By such a long time I mean more than 15 minutes!

Topic decision-trees cross-validation databases

Category Data Science

spectre · Accepted Answer · 2021年12月1日 09:46

For anyone who is facing a similar error, the reason is that mean_absolute_error takes more time to calculate. Hence I was facing long execution times. I chose another metric mean_squared_error and the execution time decreased drastically. So if the choice of metric is not a restriction, then I'd advise to go for mean_squared_error for faster computation times.

Decision Tree taking too long to execute

About