What happens if a certain dataset contains different "groups" that follow different linear models? For example, let's imagine that examining the scatterplot of a certain feature $x_i$ against $y$ we can see that some points follow a linear relationship with a coefficient $\beta_A<0$ while other points clearly have $\beta_B>0$. We can infer that these points belong to two different populations, population $A$ responds negatively to high values of feature $x_i$ while population $B$ responds positively. We then create a categorical …
In the Titanic dataset, I performed two methods to fill Age NA. The first one is regression using Lasso: from sklearn.linear_model import Lasso AgefillnaModel=Lasso(copy_X=False) AgefillnaModel_X.dropna(inplace=True) y=DF.Age.dropna(inplace=False) AgefillnaModel.fit(AgefillnaModel_X,y) DF.loc[ageNaIn,'Age']=AgefillnaModel.predict(DF.loc[ageNaIn,AgefillnaModel_X.columns]) and the second method is using IterativeImputer() from scikit-learn.impute. from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Setting the random_state argument for reproducibility imputer = IterativeImputer(random_state=42) imputed = imputer.fit_transform(DF) df_imputed = pd.DataFrame(imputed, columns=DF.columns) round(df_imputed, 2) Now, how can I decide which one is better? Here is the result of scattered Age …
I am a newbie to machine learning and I am trying to apply the SVD on the movielens dataset for movie recommendation. I have a movie-user matrix where the row is the user id, the column is the movie id and the value is the rating. Now, I would like to perform normalization on the movie-user matrix (subtract the data by users ratings mean). Then pass the normalized matrix to Scipy.sparse svds as follow: from scipy.sparse.linalg import svds U, sigma, …
I have population data from Census.gov: Total US population by age by year from 1940 through 2010 Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff. Specifically it follows this pattern: 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85 and greater 1980 to 1999: Discrete data from 0 to …
I am dealing with a dataset of categorical data that looks like this: content_1 content_2 content_4 content_5 content_6 0 NaN 0.0 0.0 0.0 NaN 1 NaN 0.0 0.0 0.0 NaN 2 NaN NaN NaN NaN NaN 3 0.0 NaN 0.0 NaN 0.0 These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1 indicates a user seeing content and downloading it, 0 indicates a user seeing content and not …
When calculating correlations in R e.g. via cor is it better to treat missing data as NAs or as Zeros? The latter would be regarded as numerical valid values so I'd guess NA would be better?
A friend of mine has recently started working on R-studio and is interested in filling the NA values in different columns using the above-mentioned function. Also, since he intends to run a time series analysis for every column, what should be the correct approach?
I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation. I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred. My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now): So, I was wondering what was the best way to cluster this …
In [*] page 264, a method of drawing a missing value from a conditional distribution $P(\bf{x}_{mis}|\bf{x}_{obs};\theta)$ which is defined as: I did not find any code implementation of this approach. My question is, how to implement it? Should we integrate the distribution w.r.t an assumed interval of $\bf{x}_{mis}$? Otherwise, is this just an intuitive mathematical representation that should be understood but the implementation is different. [*] Theodoridis, S., & Koutroumbas, K. “Pattern recognition. ” Fourth Edition, 9781597492720, 2008
I am trying to build a pipeline in order to perform GridSearchCV to find the best parameters. I already split the data into train and validation and have the following code: column_transformer = make_pipeline( (OneHotEncoder(categories = cols)), (OrdinalEncoder(categories = X["grade"])), "passthrough") imputer = SimpleImputer(strategy='median') scaler = StandardScaler() model = SGDClassifier(loss='log',random_state=42,n_jobs=-1,warm_start=True) pipeline_sgdlogreg = make_pipeline(imputer, column_transformer, scaler, model) When I perform GridSearchCV I am getting the follwing error: "cannot use median strategy with non-numeric data (...)" I do not understand why am …
According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …
I got a dataset that contains 50 features starting from 2009 to 2018. But one of the feature was only availiable since 2015 and unable to recover. I am concerning about if I train a model on the whole dataset, the estimated coefficient of that sparse feature will be biased (since the feature is not spare, just all the data from 2009-2014 is not availiable) Therefore, I would like to ask how would you guys deal with feature that was …
I am trying to predict loan defaults with a fairly moderate-sized dataset. I will probably be using logistic regression and random forest. I have around 35 variables and one of them classifies the type of the client: company or authorized individual. The problem is that, for authorized individuals, some variables (such as turnover, assets, liabilities, etc) are missing, because an authorized individual should not have this stuff. Only a company can have turnover, assets, etc. What do I do in …
I have a dataset that contains several measures from various widgets on a daily basis. While the widgets remain relatively stable over time, sometimes there are legitimate reasons for one to disappear and another to appear in the data as a whole. Occasionally, a widget will just disappear and so the dataset is incomplete, invalidating the whole dataset for that day. What I am looking for is a method of comparing the current set of widgets with another set of …
I have a dataset consisting of M questionnaires and N students. Each students replied to some questionnaires. I would like to make the dataset better, by removing some questionnaires and/or some students. The goal is to optimize the dataset so we have as few "holes" as possible. To be clear, a hole in the dataset is when a student did not reply to a questionnaire. Let's say the number of "holes" in the dataset is H. We want H as …
I have a log dataset that contains +30 features. One group of these features are of the following type, for example, request_id, user_partyrole_id, authentication_id, user_login_key and such ip and key related features. I wonder what is the best way to handle missing values in such features, since IP addresses aren't numbers in the sense that we can calculate their mean value for example. To explain the context more, the data is big, +1 million rows. Also, can someone explain how …
I am currently working with a bunch of classification models especially Logistic regression, KNN, Naive Bayes, SVM, and Decision Trees for my machine learning class. I know how to handle finding and removing the missing values and the outliers. But I would like to know which of the above models would perform really badly if the outliers and missing values are not removed. Like if I decide to leave the outliers and missing values in the dataset which model should …
I'm running an LM model using the LMest package available in R. The dataset contains NO missing values. pct_miss(df_long) [1] 0 n_miss(df_long) [1] 0 The lmest function with no covariates works fine. However, when I added covariates in the latentFormula, I got the following error message. "Error in lmest(responsesFormula = responseA + responseB + responseC + responseD ~ : missing data in the covariates affecting the initial probabilities are not allowed" My code follows: LMmodel <- lmest(responsesFormula = responseA + …
I am working on a house pricer model, and I have a feature with values 0 or 1 to indicate if the rent price is capped by the government or not (houses with capped rents sell for much lower on average). when the rent is indeed capped, there is a second feature with the cap value. How do I deal with this second feature ? knowing that it's missing for more than 80% of the data ? Thanks in advance.
I'm working with longitudinal data for a series of patients. Duration of followup on a patient-level is non-uniform. Patients can either experience a discrete event (e.g., a heart attack) or never experience the event. This feature is of course binary. Additionally, patients that have experienced an event (e.g., the first heart attack) can also continue to experience more events (e.g., subsequent heart attacks). Each event is anchored to an event date which will be compared to when the patient was …