Advice on imputing temperature data with StatsModels MICE

Question

Advice on imputing temperature data with StatsModels MICE

plytheman

2022年4月29日 18:01

This may be a dumb question but I can't figure out how to actually get the values imputed using StatsModels MICE back into my data. I have a dataframe (dfLocal) with hourly temperature records for five neighboring stations (LOC1:LOC5) over many years and I'd like to impute the missing data for any given site. Following the examples I have:

imp = mice.MICEData(dfLocal)
fml = 'LOC1 ~ LOC2 + LOC3 + LOC4 + LOC5'
mice = mice.MICE(fml, sm.OLS, imp)
results = mice.fit(10, 10)
print(results.summary())

dfLocal.dropna(axis=0, how='all', inplace=True)
imp.data = imp.data.set_index(dfLocal.index)

# In this case I only want to fill one specific set of missing data
# hence gap_start and gap_end
dfLocal.loc[gapStart:gapEnd, 'LOC1'] = imp.data[fillSite]

My understanding of MICE is broadly that missing values are imputed multiple times and then combined to find the best value from the many. The only way I've found to actually get any numbers out of the above code is with imp.data but I'm afraid that might just be one of the individual imputations before they're combined? All I can seem to get from fitting the model (results), though, is the summary?

I'm far from a statistician (and not much of a programmer either) so I've been reading through the code for mice.MICE and other resources on general MICE applications, but I'd appreciate any guidance on this as I can't find much about using statsmodels' MICE online. Normally I'd post some data on Gist but the full set is a bit large. That said, I'll upload it if ya'll think it would help.

Thanks!

Topic data-imputation time-series python statistics

Category Data Science

Ben Reiniger · Accepted Answer · 2019年7月3日 02:41

MICE does generate several datasets, but it does not then combine these datasets. Rather, it fits your model on each of those datasets and combines those models. If you really need an imputed dataset, you could just choose one or combine them in whatever way makes sense for your problem (or you might be better off with another method):

Now, for the statsmodels implementation, imp.data only keeps track of the latest imputed set [1]; you can loop through updates rather than using fit to get all of the datasets as in an example in [2].

Advice on imputing temperature data with StatsModels MICE

About