data-imputation

How to evaluate data imputation techniques

yassine sfayhi

2022年5月22日 19:42

I have a data set with 29 features 8 if them have missing values. I've tried Sklearn simple imputer and all it's strategies KNN imputer and several Number of K Iterative imputer and all combinations of imputation order , estimators, number of iterations. My question is how to evaluate the imputation techniques and choose the better one for my Data. I can't run a base line model and evaluate it's performance because I'm not familiar with balancing the data and …

Topic: imbalanced-learn data-imputation scikit-learn classification dataset

Category: Data Science

How to compare between two methods of multivariate to filling NA

Husam Khiry

2022年5月20日 16:35

In the Titanic dataset, I performed two methods to fill Age NA. The first one is regression using Lasso: from sklearn.linear_model import Lasso AgefillnaModel=Lasso(copy_X=False) AgefillnaModel_X.dropna(inplace=True) y=DF.Age.dropna(inplace=False) AgefillnaModel.fit(AgefillnaModel_X,y) DF.loc[ageNaIn,'Age']=AgefillnaModel.predict(DF.loc[ageNaIn,AgefillnaModel_X.columns]) and the second method is using IterativeImputer() from scikit-learn.impute. from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Setting the random_state argument for reproducibility imputer = IterativeImputer(random_state=42) imputed = imputer.fit_transform(DF) df_imputed = pd.DataFrame(imputed, columns=DF.columns) round(df_imputed, 2) Now, how can I decide which one is better? Here is the result of scattered Age …

Topic: lasso data-imputation missing-data scikit-learn

Category: Data Science

Missing population values in census data

Threadid

2022年5月16日 03:02

I have population data from Census.gov: Total US population by age by year from 1940 through 2010 Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff. Specifically it follows this pattern: 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85 and greater 1980 to 1999: Discrete data from 0 to …

Topic: data-imputation missing-data r

Category: Data Science

Missing value Imputation in dataset

Bharathi A

2022年5月15日 19:03

I have two separate files for Testing and Training. In the training data, I am dropping rows that contain too many missing values . But , In the test data , I cannot afford to drop the rows so I have chosen to impute the missing values using KNN approach . My question is , to impute missing values in the test data using KNN , is it enough to consider only the test data ? As in , neighbors …

Topic: k-nn data-imputation data-cleaning machine-learning

Category: Data Science

Using sklearn knn imputation on a large dataset

Oliver Farren

2022年5月10日 10:04

I have a large dataset ~ 1 million rows by 400 features and I want to impute the missing values using sklearn KNNImputer. Trying this off the bat I hit memory problems, but I think I can solve this by chunking my dataset... I was hoping someone could confirm my method is sound and I haven't hit any gotchas. The sklearn KNNImputer has a fit method and a transform method so I believe if I fit the imputer instance on …

Topic: data-imputation scikit-learn

Category: Data Science

Should I Impute target values?

Bestname

2022年5月6日 03:58

I am new to data science and I am currently playing around a bit. Data exploration and preparation is really annoying. Eventhough I use pandas. I achieved imputing missing values in independant variables. For numerical data by using the Imputer with the means strategy and for one categorical variable I used the Labelencoder and afterwards imputed with the mode strategy. But now I face the issue that the dependant variable $y$ also contains missing values. Should I delete those lines …

Topic: data-imputation preprocessing regression data-cleaning machine-learning

Category: Data Science

Advice on imputing temperature data with StatsModels MICE

plytheman

2022年4月29日 18:01

This may be a dumb question but I can't figure out how to actually get the values imputed using StatsModels MICE back into my data. I have a dataframe (dfLocal) with hourly temperature records for five neighboring stations (LOC1:LOC5) over many years and I'd like to impute the missing data for any given site. Following the examples I have: imp = mice.MICEData(dfLocal) fml = 'LOC1 ~ LOC2 + LOC3 + LOC4 + LOC5' mice = mice.MICE(fml, sm.OLS, imp) results = …

Topic: data-imputation time-series python statistics

Category: Data Science

How to implement single Imputation from conditional distribution?

Younes

2022年4月27日 20:03

In [*] page 264, a method of drawing a missing value from a conditional distribution $P(\bf{x}_{mis}|\bf{x}_{obs};\theta)$ which is defined as: I did not find any code implementation of this approach. My question is, how to implement it? Should we integrate the distribution w.r.t an assumed interval of $\bf{x}_{mis}$? Otherwise, is this just an intuitive mathematical representation that should be understood but the implementation is different. [*] Theodoridis, S., & Koutroumbas, K. “Pattern recognition. ” Fourth Edition, 9781597492720, 2008

Topic: data-imputation missing-data machine-learning

Category: Data Science

Is there are way to impute missing values by clustering, regression and stochastic regression

uharsha33

2022年4月24日 18:03

I'd like to know if there are any libraries that allow imputation by clustering, regression and stochastic regression. So far, I've done imputation by mean, median and KNN. I'm trying to evaluate the best imputation method for an small dataset (Iris in this case). I had to delibrately create NaN values since Iris set has none. My code for KNN imputation: import pandas as pd import numpy as np import random from fancyimpute import KNN data = pd.read_csv("D:/Iris_classification/train.csv") mat = …

Topic: data-imputation data python data-cleaning machine-learning

Category: Data Science

Should I inpute the missing values before the train-validation split?

blueglass

2022年4月16日 11:03

validation is suppose to provide an unbiased evaluation of a model fit on the training data. In that case inputation before the training-validation split could cause an indirect data leakage because the data that is suposed to act as test data is already contaminated due to the imputation. So the correct approach would be to calculate the statistics(mean,mode) just with the training data and fill the missing values of the training and validation data. That for every partition of training …

Topic: data-imputation preprocessing

Category: Data Science

K-Fold cross validation and data leakage

2022年4月15日 00:01

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …

Topic: data-leakage data-imputation feature-scaling cross-validation

Category: Data Science

Is SVM a good choice for this imputing a numerical variable?

Stonecat

2022年3月31日 12:56

Let's say I have 10,000 training points, 100,000,000 points to impute, and 5-10 prediction variables/parameters, all numeric (for now). The target variable is numeric, skewed normal with outliers. I want to use SVM, but I'm new, so I would appreciate any opinions.

Topic: non-parametric data-imputation svm algorithms

Category: Data Science

How to impute using simple imputer (custom function)

Maths12

2022年2月28日 21:03

I am imputing my data using simple imputer from sklearn. i want to test many different ways of applying transformations to the data. i.e for logisitcic regression i would like to remove nans and replace with mode replace +infs with max and -infs with min use standard scaler. then for using xgboost i would like to: simply replace -infs/+infs with very large or -ve large numbers. i have been playing with sklearn pipeline and i would like to know how …

Topic: training data-imputation scikit-learn python machine-learning

Category: Data Science

Time Series Data Missing Value Treatment

Ali Enver Arslan

2022年2月24日 15:51

I have an hourly time series data for a solar plant which covers 3 years (2019, 2020, 2021). I have a categorical feature named WWCode which has 54 unique values. WWCode is actually a weather condition code. WWCode feature is fully missing for the 2019 except December and there is no missing value at all in other years. I am thinking about how to treat this missing values. I first thought about deleting the feature since it's correlation with the …

Topic: data-imputation missing-data preprocessing time-series machine-learning

Category: Data Science

Handling missing values in medical data

sums22

2022年2月9日 12:44

I have a medical dataset that contains maternal and foetal data during pregnancy. There are some missing values in the dataset that I am unsure how to handle. Here is a short example of my dataset: id insulin ultrasound_AC 0 33 2651 1 2743 2 29 Patient 0 was prescribed insulin at 33 weeks gestation, patient 2 at 29 weeks. Whereas patient 1 was not prescribed insulin, hence the missing value. Similarly, patient 0's foetus had an ultrasound abdominal circumference …

Topic: data-imputation missing-data

Category: Data Science

Right order for Data preparation in Machine Learning

Chetan Sarnad

2022年2月2日 12:01

For the below mentioned steps of data preparation Outlier detection/treatment Data imputation Data scaling/standardisation Class balancing There are two sub questions Should each of these steps performed post test/train split? Should it be done on test data? Would appreciate explanation for each step individually.

Topic: data-imputation feature-scaling outlier class-imbalance machine-learning

Category: Data Science

Merge two dataframes on multiple columns, only if not NaN

krbcaypwd

2022年1月22日 15:00

Given two Pandas dataframes, how can I use the second dataframe to fill in missing values, given multiple key columns? Col1 Col2 Key1 Key2 Extra1 Col1 Col2 Key1 Key2 Col1 Col2 Key1 Key2 Extra1 ------------------------------------------------------------------------------------------------------------------- ["A", "B", 1.10, 1.11, "Alice"] ["A", "B", 1.10, 1.11, "Alice"] # left df has more non-NaNs, so leave it ["C", "D", 2.10, 2.11, "Bob"] [np.nan, np.nan, 1.10, 1.11] ["C", "D", 2.10, 2.11, "Bob"] # unmatched row should still exist [np.nan, np.nan, 3.10, 3.11, "Charlie"] + …

Topic: data-imputation pandas python

Category: Data Science

Comparing two models with different (naive) baseline

Tessa

2022年1月13日 13:37

I would like to compare a model with listwise deletion to a model with multiple imputation. However, the model with listwise deletion has a majority class of 70%, while the model with multiple imputation has a majority class of 60%. These classes differ due to the fact that the first model has deleted part of the observations. My accuracy results are 0.75 for the first model and 0.67 for the second model. Can I conclude that the second model performs …

Topic: data-imputation missing-data accuracy

Category: Data Science

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

lurscher

2021年12月25日 10:52

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller I've been trying to come up with a workaround, but they all are extremely hackish and unreliable. Is there something I'm missing?

Topic: data-imputation scikit-learn

Category: Data Science

How to use K-NN imputer without replacing with decimal values example ( 0.75,0.6) instead of binary outcome (yes or no, 1 or 0)?

Kyaw Swar Thant

2021年12月15日 01:58

I am trying to impute some missing categorical values using K-NN imputer, after imputation the missing values are replaced with some decimal numbers. I want to use K-NN as classifier and the output (imputation) that I want was only 2 outcome (0,1). Does anyone know how to perform K-NN imputation as classifier using scikit learn library I am newbie to data science and have already read the documentation but no help.

Topic: k-nn data-imputation missing-data classification data-cleaning

Category: Data Science

About