dummy-variables

Is there a way to forecast a time series multiple linear regression using externally made dummy variables?

Cameron

2022年4月28日 22:41

This question concerns question 4h of this textbook exercise. It asks to make future predictions based on a chosen TSLM model which involves an endogenously (if i'm using this right) made dummy variable based off certain time points. My main code is as follows The main problem I've encountered is that when I use forecast() on my model, it gives an error message: This is very confusing because shouldn't my modified data already include the dummy variables? Hence, the model …

Topic: dummy-variables forecasting linear-regression time-series r

Category: Data Science

what would be the correct representation of categorical variables like sex?

Lila

2022年4月26日 14:06

I have a doubt about what will be the right way to use or represent categorical variables with only two values like "sex". I have checked it up from different sources, but I was not able to find any solid reference. For example, if I have the variable sex I usually see this in this form: id sex 1 male 2 female 3 female 4 male So I found that one can use dummy variables like this: (https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/) and also …

Topic: dummy-variables feature-selection

Category: Data Science

How to deal with a potencially multiple categorical variable

Diogo Santos

2022年4月1日 17:05

I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem which I can not see the obvious best answer to. For each individual we can have multiple instances of this categorical variable $X$. When such cases happen on numerical variables I usually take the max/mean/min depending on context. …

Topic: dummy-variables feature-engineering aggregation categorical-data

Category: Data Science

Dummy Predictors / Continuous Dependent Variable

user133793

2022年3月24日 16:45

I have a dataset with 50+ dummy coded variables that represent the purchases of an individual customer. Columns represent the products and the cell values 0 or 1, whether the product has been purchased in the past by this customer (rows = customers) or not. Now, I want to predict how these purchases predict the loyalty of the customer (continuous variable in years). I am struggling to find an appropriate prediction model. What would you suggest for such a prediction …

Topic: dummy-variables prediction predictive-modeling

Category: Data Science

inconsistency between y and x numbers in the Split into train and test sets

Rasha Abdin

2022年3月19日 19:05

I am new to the field to the data science, and need help in the following: I am working on a data set that consists of both categorical and numerical values, first I have concatenate the two files (train and test) to apply the EDA steps on it, then I have done the EDA steps on the follow data set, applied one hot encoding, spitted the data. I am getting the following message, it seems that there is inconsistency between …

Topic: dummy-variables data-science-model one-hot-encoding python

Category: Data Science

Dummy variables for unseen data in R

3nomis

2022年3月13日 07:00

I got the following problem: When I trained my model I created my dummy variables(before train-test split) in the following way: dummy <- dummyVars(formula = CLASS_INV ~ ., data = campaign_spending_final_imputed, fullRank = TRUE) dummy %>% saveRDS('model/dummy.rds') #I save it to use it later campaign_spending_final_dummy <- predict(dummy, newdata = campaign_spending_final_imputed) %>% as.data.frame() %>% mutate(CLASS_INV = campaign_spending_final$CLASS_INV) The model was trained and tested successfully. Now I want to test it on 'real world' data and I want to create dummy variables …

Topic: dummy-variables preprocessing r categorical-data machine-learning

Category: Data Science

Pandas get_dummies() rows dropping after joining back with X

datahappy

2022年2月25日 08:05

I'm having an issue that I can't explain and am hoping I am missing something simple. I have a large dataset of shape(45Million+, 51) and am loading it in for some analyses (classifiers, deep learning, basically just trying a few different things as some research for work). I take a few steps when I load it in: dropna() to get rid of all rows with an na (only about 6K out of the 45M) Use pandas get_dummies() to change a …

Topic: dummy-variables pandas

Category: Data Science

How to get dummy variables from "first name"

Raha Moosavi

2022年2月3日 23:28

I intend to predict the age of customers using some features. There are some categorical features that I need to convert to dummy variables before the modelling stage. Since the datasets are so big (millions of rows) when I used StringIndexer in pyspark to get dummies from first names, I got the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 399, …

Topic: dummy-variables pyspark feature-extraction categorical-data bigdata

Category: Data Science

Should I create single feature for each specific word which i find in text or one for all them?

Ir8_mind

2022年1月27日 05:27

I am doing feature engineering right now for my classification task. In my dataframe I have a column with text messages. I decided to create a binary feature which depends on whether or not in this text were words "call", "phone", "mobile", "@gmail", "mail" "facebook". But now I wonder should I create separate binary features for each word (or group of words) or one for all of them. How to check which solution is better. Is there any metric and …

Topic: dummy-variables feature-engineering nlp machine-learning

Category: Data Science

what are the effect on machine learning regression model if the dataset has two exact same columns

Kapil Agarwal

2022年1月14日 14:26

What will be the effect on the Machine learning model if the dataset has two exact same columns(exact 1 correlation). One thing that comes to my mind is that if two columns are exactly the same then it is like we are multiplying the column with 2 so in the regression model it is okay since it will take care of that problem by dividing it by 2. So here is the second question if models can work fine then …

Topic: dummy-variables linear-regression correlation machine-learning

Category: Data Science

How to obtain original feature names after using one-hot encoding

S Datta

2021年8月29日 03:59

This question is on an implementation aspect of scikit-learn's DecisionTreeClassifier(). How do I get the feature names ranked in descending order, from the feature_importances_ returned by the scikit-learn DecisionTreeClassifier()? The problem is that the input features to the classifier are not the original ones - they are numerically encoded ones from pandas DataFrame get_dummies. For example, I take the mushroom dataset from the UCI repository. Features in the dataset include - cap_shape, cap_surface, cap_color, odor, etc. pandas dataframe getdummies encodes …

Topic: dummy-variables one-hot-encoding decision-trees feature-selection

Category: Data Science

Use dummy variables to create a rank variable. R

Marvin Aliaga

2021年7月12日 09:00

I have a series of multiple response (dummy) variables describing causes for a canceled visits. A visit can have multiple reasons for the cancelation. My goal is to create a single mutually exclusive variable using the dummy variables in a hierarchical way. For example, in my sample data below the rank of my variables is as follow: Medical, NoID and Refuse. Ex. if a visit was cancelled due to medical and lack of ID reasons, I would like to recode …

Topic: dummy-variables ranking hierarchical-data-format r

Category: Data Science

Dummies Variables and Scaling in Regression Problems

fflpdqqoeit

2021年4月13日 00:03

I was wondering if having dummies variable and scaling other variables could joke my model. In particular, I have implemented a Random Forest Regressor by using scikit-learn, but my data model is composed by a set of dummies varibles and 2 numerical variables. I approached in this way: Convert categorical in dummies variables Separate the numerical variables Scale with Standard Scaler from scikit-learn the numerical variables (at point 2) Join the dummies and numerical Split train, test train the model …

Topic: dummy-variables random-forest

Category: Data Science

In which cases shouldn't we drop the first level of categorical variables?

Dan Chaltiel

2020年11月26日 18:58

Beginner in machine learning, I'm looking into the one-hot encoding concept. Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies. I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels. But since pandas.get_dummies() has its drop_first argument to false by default, this …

Topic: dummy-variables encoding algorithms machine-learning

Category: Data Science

What exactly is a dummy trap? Is dropping one dummy feature really a good practice?

UchuuStranger

2020年11月24日 23:35

So I'm going through a Machine Learning course, and this course explains that to avoid the dummy trap, a common practice is to drop one column. It also explains that since the info on the dropped column can be inferred from the other columns, we don't really lose anything by doing that. This course does not explain what the dummy trap exactly is, however. Neither it gives any examples on how the trap manifests itself. At first I assumed that …

Topic: dummy-variables one-hot-encoding

Category: Data Science

Should I include all dummy variables or N-1 dummy variables (keep one as reference) in neural networks

SiH

2020年11月5日 19:46

I have a categorical variable with N factor levels (e.g. gender has two levels) in classification problem. I have converted it into dummy variables (male and female). I have to use neural network (nnet) to classify. I have two options - Include any N-1 dummy variables in the input data (e.g. include either male or female). In statistical models, we use N-1 dummy variables. Include all N dummy variables (e.g. include both male and female) Can someone please highlight the …

Topic: dummy-variables neural-network machine-learning

Category: Data Science

how do tree based methods deal with missing feature columns?

Maths12

2020年11月2日 14:37

all, i have trained a model using xgboost. Some of the features are one hot encoded e.g. currency where it is either gbp or usd. it seems that when i output the feature importance gbp and usd were in 7'th 8th place respectively. now i would like to use the model to predict whether defaulter or not on australian countries, however the currency for these is in AUD. Therefore when i apply my feature engineering it will create a column …

Topic: dummy-variables one-hot-encoding xgboost decision-trees

Category: Data Science

Problem with converting string to dummy variables

ramin

2020年9月19日 06:01

I'm new in data science, I have data which want to work on it, I omitted extra columns and convert it to 4 columns ( Product, Date, Market, Demand ) . in this data Product and Market are string, I know for working on this data must convert them. I want to convert the string to dummy variables but this isn't logical because I have 64 fruits in the product column. I am confused and I don't know what can …

Topic: dummy-variables data-analysis data

Category: Data Science

How to handle fixed values for variables in pre-processing

Sm1

2020年9月13日 00:31

I have a dataset which contains few variables whose values do not change. Some of the variables are non-numeric (for example all values for that variable contain the value 5) and few variables are real-valued but all same values. When doing standardization of the variables so that each is a zero mean and variance 1, these variables give NaN values. Therefore, is it ok to exclude such variables (irrespective of being categorical or real-valued) that contain constant values from the …

Topic: dummy-variables data-science-model preprocessing

Category: Data Science

onehotencoder random forest

Newbie

2020年8月31日 03:44

In a Random Forest context, do I need to setup dummies/OnehotEncoder in a dataset where features/varibles are numerical but refer to some kind of category? Let's say I have the following variables: Where Y is the variable I want to predict. X's are features. I will focus on X1. Its numerical but refers to a specific category (i.e. 1 refers to math, 2 refers to literature and 3 for history). Do I need to apply OnehotEncoder (or dummy approach) for …

Topic: dummy-variables one-hot-encoding random-forest

Category: Data Science

About