How to avoid memory error with Pandas pd.read_csv method call with GridSearchCV usage for DecisionTreeRegressor model?

I have been implementing a DecisionTreeRegressor model in Anaconda environment with a data set sourced from a 20 million row, 12-dimensional CSV file. I could get the chunks off of the data set with chunksize set to 500,000 rows and process the computation of the R-Squared score on the training/test split data sets in each iteration of 500,000 rows till iteration #20.

sklearn.__version__: 0.19.0 
pandas.__version__: 0.20.3 
numpy.__version__: 1.13.1

The GridSearchCV() instance uses parameter grid with parameter max_depth set to values [4, 6].

I then see memory errors in numpy module with the Anaconda Python interpreter throwing an exception.

This is the exception after iteration #20:

Score for model trained on Test Dataset:  -0.000287864727209
Best Parameters:  {'max_depth': 4} 
Best Cross-Validation Accuracy:  -0.00037759422675 

Traceback (most recent call last):

  File "ipython-input-1-a28a1b71d60d", line 1, in module
    runfile('C:/Kal/Stat-Work/Stat-Code/SciKit/Final/DecisionTreeRegression-MaximumDepthFour.py', wdir='C:/Kal/Stat-Work/Stat-Code/SciKit/Final')

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
    execfile(filename, namespace)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Kal/Stat-Work/Stat-Code/SciKit/Final/DecisionTreeRegression-MaximumDepthFour.py", line 21, in module
    for piece in chunker:

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 978, in __next__
    return self.get_chunk()

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1042, in get_chunk
    return self.read(nrows=size)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1023, in read
    df = DataFrame(col_dict, columns=columns, index=index)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 275, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 411, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 5506, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4309, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4381, in form_blocks
    int_blocks = _multi_blockify(int_items)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4450, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)

  File "C:\Users\bkalahas\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4493, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)

MemoryError

This is the code:

# Read the dataset into a DataFrame from the Test Regression CSV file. 
chunker = pd.read_csv('C:/Kal/Stat-Work/Stat-Code/SciKit/Test_Data_Set_Regression.csv',
             names=['ROW', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'RT'],
             chunksize=500000, low_memory=False)

for piece in chunker:
    # Create Training and Test Datasets. 
    X_train, X_test, y_train, y_test = train_test_split(
        piece[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']], piece['RT'], random_state=0)

    param_grid = {
                 'max_depth': [4, 6]
                 }

    # Instantiate the GridSearchCV class with the model DecisionTreeRegressor, the parameter
    # grid to search param_grid, and the cross-validation strategy we want to use, 
    # say 5 fold (stratified) cross-validation.
    grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5)

    # Call fit method to run cross-validation for each combination
    # of parameters we specified in param_grid
    grid_search.fit(X_train, y_train)

    # Print the best_score, best_parameters and the test_score.
    print("Score for model trained on whole Training Dataset: ", grid_search.score(X_train, y_train))

    # Evaluate the generalization performance by calling score method on the Test Dataset.
    print("Score for model trained on Test Dataset: ", grid_search.score(X_test, y_test))
    print("Best Parameters: ", grid_search.best_params_)
    print("Best Cross-Validation Accuracy: ", grid_search.best_score_)

Questions:

  1. Please point to ways to overcome the Python memory error exception.
  2. What is the best way to implement a DecisionTreeRegressor model on a cluster of 4 16-GB RAM, 2.5-GHz CPU machines (linux or windows)? I see memory errors with scale even with DecisionTreeRegressor model. SVM model does not even compute fully beyond 20,000 rows in a chunk with hung CPUs being seen by the Python interpreter. But, it is known that SVM is computationally tedious. Is there a way around such memory errors for DecisionTree and Ensemble models when combined with Pandas? Pandas is our memory data analytic engine.

The dataset is particularly huge (actual total of 100 million rows in source database as produced almost everyday). Using cross validation against the entire 100 million row dataset is computationally onerous.

I have come up with this Bagging-like model where the dataset is broken into samples of 500,000 rows each and the MSE is computed for each sample. The goal is to compute the averaged out MSE across all samples.

This question is important since it deals with scale which is an everyday computational problem in ML algorithms. I would also appreciate critical answers on various aspects of my code above versus down votes without reason. Thank you.

Topic ensemble-modeling decision-trees scikit-learn pandas python

Category Data Science


To answer your questions:

  1. First way out is, what you already did:

    Think about the data structure. Can you split without loosing information? Are there logical "clusters" or break points in your data? If this still causes issues, you should think about investigating further. In your example, why does the 20th iteration fail? It seems as there is enough memory for one iteration, which means that, in principle, you should be able to iterate over all of your chunks.

    To understand what is going on, you can track your memory profile . If you know what causes the memory error, you can explicitly save snapshots to disc or free memory. Although I experienced ownership issues between python and C/C++ base classes.

    The poorman's approach could also look like this: iterate over N chunks, write results to disc, repeat until all rows have been processed. Then combine results.

  2. Do you mean clusters with 4-16GB per node or your local machine with 4 threads?

    Generally speaking, as seanv507 mentioned, find a (scalable) solution that works for a small sample of your data then scale to larger sets. Make sure that your memory allocation does not exceed system limits.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.