missing-data

Dealing with diverse groups in regression

Kira Bulatov

2022年6月1日 07:17

What happens if a certain dataset contains different "groups" that follow different linear models? For example, let's imagine that examining the scatterplot of a certain feature $x_i$ against $y$ we can see that some points follow a linear relationship with a coefficient $\beta_A<0$ while other points clearly have $\beta_B>0$. We can infer that these points belong to two different populations, population $A$ responds negatively to high values of feature $x_i$ while population $B$ responds positively. We then create a categorical …

Topic: missing-data linear-regression regression

Category: Data Science

How to compare between two methods of multivariate to filling NA

Husam Khiry

2022年5月20日 16:35

In the Titanic dataset, I performed two methods to fill Age NA. The first one is regression using Lasso: from sklearn.linear_model import Lasso AgefillnaModel=Lasso(copy_X=False) AgefillnaModel_X.dropna(inplace=True) y=DF.Age.dropna(inplace=False) AgefillnaModel.fit(AgefillnaModel_X,y) DF.loc[ageNaIn,'Age']=AgefillnaModel.predict(DF.loc[ageNaIn,AgefillnaModel_X.columns]) and the second method is using IterativeImputer() from scikit-learn.impute. from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Setting the random_state argument for reproducibility imputer = IterativeImputer(random_state=42) imputed = imputer.fit_transform(DF) df_imputed = pd.DataFrame(imputed, columns=DF.columns) round(df_imputed, 2) Now, how can I decide which one is better? Here is the result of scattered Age …

Topic: lasso data-imputation missing-data scikit-learn

Category: Data Science

Dealing with missing data in SVD

David

2022年5月18日 08:00

I am a newbie to machine learning and I am trying to apply the SVD on the movielens dataset for movie recommendation. I have a movie-user matrix where the row is the user id, the column is the movie id and the value is the rating. Now, I would like to perform normalization on the movie-user matrix (subtract the data by users ratings mean). Then pass the normalized matrix to Scipy.sparse svds as follow: from scipy.sparse.linalg import svds U, sigma, …

Topic: movielens missing-data machine-learning

Category: Data Science

Missing population values in census data

Threadid

2022年5月16日 03:02

I have population data from Census.gov: Total US population by age by year from 1940 through 2010 Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff. Specifically it follows this pattern: 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85 and greater 1980 to 1999: Discrete data from 0 to …

Topic: data-imputation missing-data r

Category: Data Science

How to deal with missing data for Bernoulli Naive Bayes?

Chuck

2022年5月9日 21:08

I am dealing with a dataset of categorical data that looks like this: content_1 content_2 content_4 content_5 content_6 0 NaN 0.0 0.0 0.0 NaN 1 NaN 0.0 0.0 0.0 NaN 2 NaN NaN NaN NaN NaN 3 0.0 NaN 0.0 NaN 0.0 These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1 indicates a user seeing content and downloading it, 0 indicates a user seeing content and not …

Topic: missing-data naive-bayes-classifier scikit-learn classification python

Category: Data Science

Correlations with NA or with Zeros?

Ben

2022年5月9日 12:20

When calculating correlations in R e.g. via cor is it better to treat missing data as NAs or as Zeros? The latter would be regarded as numerical valid values so I'd guess NA would be better?

Topic: missing-data correlation r

Category: Data Science

Fill the missing values (NA) in various columns (independently of each other) using imputeTS package (in particular, na_kalman function)

toric_actions

2022年5月9日 10:06

A friend of mine has recently started working on R-studio and is interested in filling the NA values in different columns using the above-mentioned function. Also, since he intends to run a time series analysis for every column, what should be the correct approach?

Topic: missing-data r

Category: Data Science

What best/correct algorithm/procedure to cluster a dataset with a lot 0's?

Lucas

2022年5月4日 22:01

I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation. I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred. My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now): So, I was wondering what was the best way to cluster this …

Topic: pca missing-data data k-means clustering

Category: Data Science

How to implement single Imputation from conditional distribution?

Younes

2022年4月27日 20:03

In [*] page 264, a method of drawing a missing value from a conditional distribution $P(\bf{x}_{mis}|\bf{x}_{obs};\theta)$ which is defined as: I did not find any code implementation of this approach. My question is, how to implement it? Should we integrate the distribution w.r.t an assumed interval of $\bf{x}_{mis}$? Otherwise, is this just an intuitive mathematical representation that should be understood but the implementation is different. [*] Theodoridis, S., & Koutroumbas, K. “Pattern recognition. ” Fourth Edition, 9781597492720, 2008

Topic: data-imputation missing-data machine-learning

Category: Data Science

Can anyone tell me why is my pipeline wrong?

user135091

2022年4月27日 18:45

I am trying to build a pipeline in order to perform GridSearchCV to find the best parameters. I already split the data into train and validation and have the following code: column_transformer = make_pipeline( (OneHotEncoder(categories = cols)), (OrdinalEncoder(categories = X["grade"])), "passthrough") imputer = SimpleImputer(strategy='median') scaler = StandardScaler() model = SGDClassifier(loss='log',random_state=42,n_jobs=-1,warm_start=True) pipeline_sgdlogreg = make_pipeline(imputer, column_transformer, scaler, model) When I perform GridSearchCV I am getting the follwing error: "cannot use median strategy with non-numeric data (...)" I do not understand why am …

Topic: pipelines missing-data encoding python

Category: Data Science

Filling NaN values

Mohammed Atif Ali

2022年4月20日 13:00

According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …

Topic: chi-square-test exploratory-factor-analysis missing-data correlation statistics

Category: Data Science

How to deal with feature with different sample size?

Geogre

2022年4月20日 07:12

I got a dataset that contains 50 features starting from 2009 to 2018. But one of the feature was only availiable since 2015 and unable to recover. I am concerning about if I train a model on the whole dataset, the estimated coefficient of that sparse feature will be biased (since the feature is not spare, just all the data from 2009-2014 is not availiable) Therefore, I would like to ask how would you guys deal with feature that was …

Topic: feature-engineering missing-data feature-selection

Category: Data Science

How to deal with missing values that are supposed to be missing?

IcarusX

2022年4月19日 22:16

I am trying to predict loan defaults with a fairly moderate-sized dataset. I will probably be using logistic regression and random forest. I have around 35 variables and one of them classifies the type of the client: company or authorized individual. The problem is that, for authorized individuals, some variables (such as turnover, assets, liabilities, etc) are missing, because an authorized individual should not have this stuff. Only a company can have turnover, assets, etc. What do I do in …

Topic: missing-data decision-trees logistic-regression

Category: Data Science

Detect Missing Records in Dataset

Skiddles

2022年4月17日 12:04

I have a dataset that contains several measures from various widgets on a daily basis. While the widgets remain relatively stable over time, sometimes there are legitimate reasons for one to disappear and another to appear in the data as a whole. Occasionally, a widget will just disappear and so the dataset is incomplete, invalidating the whole dataset for that day. What I am looking for is a method of comparing the current set of widgets with another set of …

Topic: missing-data time-series machine-learning

Category: Data Science

Optimization of a simple M x N dataset

Samuel Faure

2022年4月16日 05:16

I have a dataset consisting of M questionnaires and N students. Each students replied to some questionnaires. I would like to make the dataset better, by removing some questionnaires and/or some students. The goal is to optimize the dataset so we have as few "holes" as possible. To be clear, a hole in the dataset is when a student did not reply to a questionnaire. Let's say the number of "holes" in the dataset is H. We want H as …

Topic: missing-data optimization dataset

Category: Data Science

Handling missing values in IP addresses and key-like features

user134594

2022年4月14日 13:34

I have a log dataset that contains +30 features. One group of these features are of the following type, for example, request_id, user_partyrole_id, authentication_id, user_login_key and such ip and key related features. I wonder what is the best way to handle missing values in such features, since IP addresses aren't numbers in the sense that we can calculate their mean value for example. To explain the context more, the data is big, +1 million rows. Also, can someone explain how …

Topic: missing-data pandas

Category: Data Science

How do outliers and missing values impact these classifiers?

Vishnu dut

2022年4月14日 03:29

I am currently working with a bunch of classification models especially Logistic regression, KNN, Naive Bayes, SVM, and Decision Trees for my machine learning class. I know how to handle finding and removing the missing values and the outliers. But I would like to know which of the above models would perform really badly if the outliers and missing values are not removed. Like if I decide to leave the outliers and missing values in the dataset which model should …

Topic: missing-data outlier

Category: Data Science

Error in lmest: missing data in the covariates affecting the initial probabilities are not allowed

ysunny

2022年4月13日 11:03

I'm running an LM model using the LMest package available in R. The dataset contains NO missing values. pct_miss(df_long) [1] 0 n_miss(df_long) [1] 0 The lmest function with no covariates works fine. However, when I added covariates in the latentFormula, I got the following error message. "Error in lmest(responsesFormula = responseA + responseB + responseC + responseD ~ : missing data in the covariates affecting the initial probabilities are not allowed" My code follows: LMmodel <- lmest(responsesFormula = responseA + …

Topic: missing-data r

Category: Data Science

How to deal with data that is only available for a certain category

ml-enthusiast

2022年4月4日 07:33

I am working on a house pricer model, and I have a feature with values 0 or 1 to indicate if the rent price is capped by the government or not (houses with capped rents sell for much lower on average). when the rent is indeed capped, there is a second feature with the cap value. How do I deal with this second feature ? knowing that it's missing for more than 80% of the data ? Thanks in advance.

Topic: missing-data

Category: Data Science

How to treat patients without events in time-to-event analysis?

user112005

2022年4月1日 05:00

I'm working with longitudinal data for a series of patients. Duration of followup on a patient-level is non-uniform. Patients can either experience a discrete event (e.g., a heart attack) or never experience the event. This feature is of course binary. Additionally, patients that have experienced an event (e.g., the first heart attack) can also continue to experience more events (e.g., subsequent heart attacks). Each event is anchored to an event date which will be compared to when the patient was …

Topic: missing-data time-series

Category: Data Science

About