How to compare between two methods of multivariate to filling NA
In the Titanic dataset, I performed two methods to fill Age NA. The first one is regression using Lasso:
from sklearn.linear_model import Lasso
AgefillnaModel=Lasso(copy_X=False)
AgefillnaModel_X.dropna(inplace=True)
y=DF.Age.dropna(inplace=False)
AgefillnaModel.fit(AgefillnaModel_X,y)
DF.loc[ageNaIn,'Age']=AgefillnaModel.predict(DF.loc[ageNaIn,AgefillnaModel_X.columns])
and the second method is using IterativeImputer()
from scikit-learn.impute
.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Setting the random_state argument for reproducibility
imputer = IterativeImputer(random_state=42)
imputed = imputer.fit_transform(DF)
df_imputed = pd.DataFrame(imputed, columns=DF.columns)
round(df_imputed, 2)
Now, how can I decide which one is better?
Here is the result of scattered Age vs Sex:
Topic lasso data-imputation missing-data scikit-learn
Category Data Science