How to avoid memory error with Pandas pd.read_csv method call with GridSearchCV usage for DecisionTreeRegressor model?
I have been implementing a DecisionTreeRegressor model in Anaconda environment with a data set sourced from a 20 million row, 12-dimensional CSV file. I could get the chunks off of the data set with chunksize set to 500,000 rows and process the computation of the R-Squared score on the training/test split data sets in each iteration of 500,000 rows till iteration #20.
sklearn.__version__: 0.19.0
pandas.__version__: 0.20.3
numpy.__version__: 1.13.1
The GridSearchCV()
instance uses parameter grid with parameter max_depth
set to values [4, 6].
I then see memory errors in numpy module with the Anaconda Python interpreter throwing an exception.
This is the exception after iteration #20:
Score for model trained on Test Dataset: -0.000287864727209
Best Parameters: {'max_depth': 4}
Best Cross-Validation Accuracy: -0.00037759422675
Traceback (most recent call last):
File "ipython-input-1-a28a1b71d60d", line 1, in module
runfile('C:/Kal/Stat-Work/Stat-Code/SciKit/Final/DecisionTreeRegression-MaximumDepthFour.py', wdir='C:/Kal/Stat-Work/Stat-Code/SciKit/Final')
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Kal/Stat-Work/Stat-Code/SciKit/Final/DecisionTreeRegression-MaximumDepthFour.py", line 21, in module
for piece in chunker:
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 978, in __next__
return self.get_chunk()
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1042, in get_chunk
return self.read(nrows=size)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1023, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 275, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 411, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 5506, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4309, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4381, in form_blocks
int_blocks = _multi_blockify(int_items)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4450, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4493, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
This is the code:
# Read the dataset into a DataFrame from the Test Regression CSV file.
chunker = pd.read_csv('C:/Kal/Stat-Work/Stat-Code/SciKit/Test_Data_Set_Regression.csv',
names=['ROW', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'RT'],
chunksize=500000, low_memory=False)
for piece in chunker:
# Create Training and Test Datasets.
X_train, X_test, y_train, y_test = train_test_split(
piece[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']], piece['RT'], random_state=0)
param_grid = {
'max_depth': [4, 6]
}
# Instantiate the GridSearchCV class with the model DecisionTreeRegressor, the parameter
# grid to search param_grid, and the cross-validation strategy we want to use,
# say 5 fold (stratified) cross-validation.
grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5)
# Call fit method to run cross-validation for each combination
# of parameters we specified in param_grid
grid_search.fit(X_train, y_train)
# Print the best_score, best_parameters and the test_score.
print("Score for model trained on whole Training Dataset: ", grid_search.score(X_train, y_train))
# Evaluate the generalization performance by calling score method on the Test Dataset.
print("Score for model trained on Test Dataset: ", grid_search.score(X_test, y_test))
print("Best Parameters: ", grid_search.best_params_)
print("Best Cross-Validation Accuracy: ", grid_search.best_score_)
Questions:
- Please point to ways to overcome the Python memory error exception.
- What is the best way to implement a DecisionTreeRegressor model on a cluster of 4 16-GB RAM, 2.5-GHz CPU machines (linux or windows)? I see memory errors with scale even with DecisionTreeRegressor model. SVM model does not even compute fully beyond 20,000 rows in a chunk with hung CPUs being seen by the Python interpreter. But, it is known that SVM is computationally tedious. Is there a way around such memory errors for DecisionTree and Ensemble models when combined with Pandas? Pandas is our memory data analytic engine.
The dataset is particularly huge (actual total of 100 million rows in source database as produced almost everyday). Using cross validation against the entire 100 million row dataset is computationally onerous.
I have come up with this Bagging-like model where the dataset is broken into samples of 500,000 rows each and the MSE is computed for each sample. The goal is to compute the averaged out MSE across all samples.
This question is important since it deals with scale which is an everyday computational problem in ML algorithms. I would also appreciate critical answers on various aspects of my code above versus down votes without reason. Thank you.
Topic ensemble-modeling decision-trees scikit-learn pandas python
Category Data Science