Using sklearn knn imputation on a large dataset
I have a large dataset ~ 1 million rows by 400 features and I want to impute the missing values using sklearn KNNImputer.
Trying this off the bat I hit memory problems, but I think I can solve this by chunking my dataset... I was hoping someone could confirm my method is sound and I haven't hit any gotchas.
The sklearn KNNImputer has a fit method and a transform method so I believe if I fit the imputer instance on the entire dataset, I could then in theory just go through the dataset in chunks of even, row by row, imputing all the missing values using the transform method and then reconstructing a newly imputed dataset.
I'm wondering if there's an issue with this method regarding the chunksize or is the transformation on each new row independent?
50% of the dataset rows are fully populated... would it be better in terms of computation to fit the imputer object on only this portion of the dataset?
Topic data-imputation scikit-learn
Category Data Science