Using sklearn knn imputation on a large dataset

I have a large dataset ~ 1 million rows by 400 features and I want to impute the missing values using sklearn KNNImputer.

Trying this off the bat I hit memory problems, but I think I can solve this by chunking my dataset... I was hoping someone could confirm my method is sound and I haven't hit any gotchas.

The sklearn KNNImputer has a fit method and a transform method so I believe if I fit the imputer instance on the entire dataset, I could then in theory just go through the dataset in chunks of even, row by row, imputing all the missing values using the transform method and then reconstructing a newly imputed dataset.

I'm wondering if there's an issue with this method regarding the chunksize or is the transformation on each new row independent?

50% of the dataset rows are fully populated... would it be better in terms of computation to fit the imputer object on only this portion of the dataset?

Topic data-imputation scikit-learn

Category Data Science


You could use a memmap

import numpy as np
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'newfile.dat') # or you could use another dat file that already constains your dataset

# supposing your data is loaded in a variable named "data"

fp = np.memmap(filename, dtype='float32', mode='w+', shape=data.shape)
fp[:] = data[:]

You can check full documentation on this page, the code above is based on the documentation.

This way you:

  • only change the declaration of your data matrices, keeping you code as clean as possible
  • uses numpy built-in ndarray subclass without having to explicitly manage data retrieval from the disk

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.