How to evaluate data imputation techniques

I have a data set with 29 features 8 if them have missing values. I've tried Sklearn simple imputer and all it's strategies KNN imputer and several Number of K Iterative imputer and all combinations of imputation order , estimators, number of iterations. My question is how to evaluate the imputation techniques and choose the better one for my Data. I can't run a base line model and evaluate it's performance because I'm not familiar with balancing the data and …
Category: Data Science

How to compare between two methods of multivariate to filling NA

In the Titanic dataset, I performed two methods to fill Age NA. The first one is regression using Lasso: from sklearn.linear_model import Lasso AgefillnaModel=Lasso(copy_X=False) AgefillnaModel_X.dropna(inplace=True) y=DF.Age.dropna(inplace=False) AgefillnaModel.fit(AgefillnaModel_X,y) DF.loc[ageNaIn,'Age']=AgefillnaModel.predict(DF.loc[ageNaIn,AgefillnaModel_X.columns]) and the second method is using IterativeImputer() from scikit-learn.impute. from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Setting the random_state argument for reproducibility imputer = IterativeImputer(random_state=42) imputed = imputer.fit_transform(DF) df_imputed = pd.DataFrame(imputed, columns=DF.columns) round(df_imputed, 2) Now, how can I decide which one is better? Here is the result of scattered Age …
Category: Data Science

Missing population values in census data

I have population data from Census.gov: Total US population by age by year from 1940 through 2010 Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff. Specifically it follows this pattern: 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85 and greater 1980 to 1999: Discrete data from 0 to …
Category: Data Science

Missing value Imputation in dataset

I have two separate files for Testing and Training. In the training data, I am dropping rows that contain too many missing values . But , In the test data , I cannot afford to drop the rows so I have chosen to impute the missing values using KNN approach . My question is , to impute missing values in the test data using KNN , is it enough to consider only the test data ? As in , neighbors …
Category: Data Science

Using sklearn knn imputation on a large dataset

I have a large dataset ~ 1 million rows by 400 features and I want to impute the missing values using sklearn KNNImputer. Trying this off the bat I hit memory problems, but I think I can solve this by chunking my dataset... I was hoping someone could confirm my method is sound and I haven't hit any gotchas. The sklearn KNNImputer has a fit method and a transform method so I believe if I fit the imputer instance on …
Category: Data Science

Should I Impute target values?

I am new to data science and I am currently playing around a bit. Data exploration and preparation is really annoying. Eventhough I use pandas. I achieved imputing missing values in independant variables. For numerical data by using the Imputer with the means strategy and for one categorical variable I used the Labelencoder and afterwards imputed with the mode strategy. But now I face the issue that the dependant variable $y$ also contains missing values. Should I delete those lines …
Category: Data Science

Advice on imputing temperature data with StatsModels MICE

This may be a dumb question but I can't figure out how to actually get the values imputed using StatsModels MICE back into my data. I have a dataframe (dfLocal) with hourly temperature records for five neighboring stations (LOC1:LOC5) over many years and I'd like to impute the missing data for any given site. Following the examples I have: imp = mice.MICEData(dfLocal) fml = 'LOC1 ~ LOC2 + LOC3 + LOC4 + LOC5' mice = mice.MICE(fml, sm.OLS, imp) results = …
Category: Data Science

How to implement single Imputation from conditional distribution?

In [*] page 264, a method of drawing a missing value from a conditional distribution $P(\bf{x}_{mis}|\bf{x}_{obs};\theta)$ which is defined as: I did not find any code implementation of this approach. My question is, how to implement it? Should we integrate the distribution w.r.t an assumed interval of $\bf{x}_{mis}$? Otherwise, is this just an intuitive mathematical representation that should be understood but the implementation is different. [*] Theodoridis, S., & Koutroumbas, K. “Pattern recognition. ” Fourth Edition, 9781597492720, 2008
Category: Data Science

Is there are way to impute missing values by clustering, regression and stochastic regression

I'd like to know if there are any libraries that allow imputation by clustering, regression and stochastic regression. So far, I've done imputation by mean, median and KNN. I'm trying to evaluate the best imputation method for an small dataset (Iris in this case). I had to delibrately create NaN values since Iris set has none. My code for KNN imputation: import pandas as pd import numpy as np import random from fancyimpute import KNN data = pd.read_csv("D:/Iris_classification/train.csv") mat = …
Category: Data Science

Should I inpute the missing values before the train-validation split?

validation is suppose to provide an unbiased evaluation of a model fit on the training data. In that case inputation before the training-validation split could cause an indirect data leakage because the data that is suposed to act as test data is already contaminated due to the imputation. So the correct approach would be to calculate the statistics(mean,mode) just with the training data and fill the missing values of the training and validation data. That for every partition of training …
Category: Data Science

K-Fold cross validation and data leakage

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …
Category: Data Science

How to impute using simple imputer (custom function)

I am imputing my data using simple imputer from sklearn. i want to test many different ways of applying transformations to the data. i.e for logisitcic regression i would like to remove nans and replace with mode replace +infs with max and -infs with min use standard scaler. then for using xgboost i would like to: simply replace -infs/+infs with very large or -ve large numbers. i have been playing with sklearn pipeline and i would like to know how …
Category: Data Science

Time Series Data Missing Value Treatment

I have an hourly time series data for a solar plant which covers 3 years (2019, 2020, 2021). I have a categorical feature named WWCode which has 54 unique values. WWCode is actually a weather condition code. WWCode feature is fully missing for the 2019 except December and there is no missing value at all in other years. I am thinking about how to treat this missing values. I first thought about deleting the feature since it's correlation with the …
Category: Data Science

Handling missing values in medical data

I have a medical dataset that contains maternal and foetal data during pregnancy. There are some missing values in the dataset that I am unsure how to handle. Here is a short example of my dataset: id insulin ultrasound_AC 0 33 2651 1 2743 2 29 Patient 0 was prescribed insulin at 33 weeks gestation, patient 2 at 29 weeks. Whereas patient 1 was not prescribed insulin, hence the missing value. Similarly, patient 0's foetus had an ultrasound abdominal circumference …
Category: Data Science

Right order for Data preparation in Machine Learning

For the below mentioned steps of data preparation Outlier detection/treatment Data imputation Data scaling/standardisation Class balancing There are two sub questions Should each of these steps performed post test/train split? Should it be done on test data? Would appreciate explanation for each step individually.
Category: Data Science

Merge two dataframes on multiple columns, only if not NaN

Given two Pandas dataframes, how can I use the second dataframe to fill in missing values, given multiple key columns? Col1 Col2 Key1 Key2 Extra1 Col1 Col2 Key1 Key2 Col1 Col2 Key1 Key2 Extra1 ------------------------------------------------------------------------------------------------------------------- ["A", "B", 1.10, 1.11, "Alice"] ["A", "B", 1.10, 1.11, "Alice"] # left df has more non-NaNs, so leave it ["C", "D", 2.10, 2.11, "Bob"] [np.nan, np.nan, 1.10, 1.11] ["C", "D", 2.10, 2.11, "Bob"] # unmatched row should still exist [np.nan, np.nan, 3.10, 3.11, "Charlie"] + …
Category: Data Science

Comparing two models with different (naive) baseline

I would like to compare a model with listwise deletion to a model with multiple imputation. However, the model with listwise deletion has a majority class of 70%, while the model with multiple imputation has a majority class of 60%. These classes differ due to the fact that the first model has deleted part of the observations. My accuracy results are 0.75 for the first model and 0.67 for the second model. Can I conclude that the second model performs …
Category: Data Science

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller I've been trying to come up with a workaround, but they all are extremely hackish and unreliable. Is there something I'm missing?
Category: Data Science

How to use K-NN imputer without replacing with decimal values example ( 0.75,0.6) instead of binary outcome (yes or no, 1 or 0)?

I am trying to impute some missing categorical values using K-NN imputer, after imputation the missing values are replaced with some decimal numbers. I want to use K-NN as classifier and the output (imputation) that I want was only 2 outcome (0,1). Does anyone know how to perform K-NN imputation as classifier using scikit learn library I am newbie to data science and have already read the documentation but no help.
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.