Advice on imputing temperature data with StatsModels MICE
This may be a dumb question but I can't figure out how to actually get the values imputed using StatsModels MICE back into my data. I have a dataframe (dfLocal) with hourly temperature records for five neighboring stations (LOC1:LOC5) over many years and I'd like to impute the missing data for any given site. Following the examples I have:
imp = mice.MICEData(dfLocal)
fml = 'LOC1 ~ LOC2 + LOC3 + LOC4 + LOC5'
mice = mice.MICE(fml, sm.OLS, imp)
results = mice.fit(10, 10)
print(results.summary())
dfLocal.dropna(axis=0, how='all', inplace=True)
imp.data = imp.data.set_index(dfLocal.index)
# In this case I only want to fill one specific set of missing data
# hence gap_start and gap_end
dfLocal.loc[gapStart:gapEnd, 'LOC1'] = imp.data[fillSite]
My understanding of MICE is broadly that missing values are imputed multiple times and then combined to find the best value from the many. The only way I've found to actually get any numbers out of the above code is with imp.data
but I'm afraid that might just be one of the individual imputations before they're combined? All I can seem to get from fitting the model (results
), though, is the summary?
I'm far from a statistician (and not much of a programmer either) so I've been reading through the code for mice.MICE and other resources on general MICE applications, but I'd appreciate any guidance on this as I can't find much about using statsmodels' MICE online. Normally I'd post some data on Gist but the full set is a bit large. That said, I'll upload it if ya'll think it would help.
Thanks!
Topic data-imputation time-series python statistics
Category Data Science