Is there a way to forecast a time series multiple linear regression using externally made dummy variables?

This question concerns question 4h of this textbook exercise. It asks to make future predictions based on a chosen TSLM model which involves an endogenously (if i'm using this right) made dummy variable based off certain time points. My main code is as follows The main problem I've encountered is that when I use forecast() on my model, it gives an error message: This is very confusing because shouldn't my modified data already include the dummy variables? Hence, the model …
Category: Data Science

what would be the correct representation of categorical variables like sex?

I have a doubt about what will be the right way to use or represent categorical variables with only two values like "sex". I have checked it up from different sources, but I was not able to find any solid reference. For example, if I have the variable sex I usually see this in this form: id sex 1 male 2 female 3 female 4 male So I found that one can use dummy variables like this: (https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/) and also …
Category: Data Science

How to deal with a potencially multiple categorical variable

I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem which I can not see the obvious best answer to. For each individual we can have multiple instances of this categorical variable $X$. When such cases happen on numerical variables I usually take the max/mean/min depending on context. …
Category: Data Science

Dummy Predictors / Continuous Dependent Variable

I have a dataset with 50+ dummy coded variables that represent the purchases of an individual customer. Columns represent the products and the cell values 0 or 1, whether the product has been purchased in the past by this customer (rows = customers) or not. Now, I want to predict how these purchases predict the loyalty of the customer (continuous variable in years). I am struggling to find an appropriate prediction model. What would you suggest for such a prediction …
Category: Data Science

inconsistency between y and x numbers in the Split into train and test sets

I am new to the field to the data science, and need help in the following: I am working on a data set that consists of both categorical and numerical values, first I have concatenate the two files (train and test) to apply the EDA steps on it, then I have done the EDA steps on the follow data set, applied one hot encoding, spitted the data. I am getting the following message, it seems that there is inconsistency between …
Category: Data Science

Dummy variables for unseen data in R

I got the following problem: When I trained my model I created my dummy variables(before train-test split) in the following way: dummy <- dummyVars(formula = CLASS_INV ~ ., data = campaign_spending_final_imputed, fullRank = TRUE) dummy %>% saveRDS('model/dummy.rds') #I save it to use it later campaign_spending_final_dummy <- predict(dummy, newdata = campaign_spending_final_imputed) %>% as.data.frame() %>% mutate(CLASS_INV = campaign_spending_final$CLASS_INV) The model was trained and tested successfully. Now I want to test it on 'real world' data and I want to create dummy variables …
Category: Data Science

Pandas get_dummies() rows dropping after joining back with X

I'm having an issue that I can't explain and am hoping I am missing something simple. I have a large dataset of shape(45Million+, 51) and am loading it in for some analyses (classifiers, deep learning, basically just trying a few different things as some research for work). I take a few steps when I load it in: dropna() to get rid of all rows with an na (only about 6K out of the 45M) Use pandas get_dummies() to change a …
Category: Data Science

How to get dummy variables from "first name"

I intend to predict the age of customers using some features. There are some categorical features that I need to convert to dummy variables before the modelling stage. Since the datasets are so big (millions of rows) when I used StringIndexer in pyspark to get dummies from first names, I got the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 399, …
Category: Data Science

Should I create single feature for each specific word which i find in text or one for all them?

I am doing feature engineering right now for my classification task. In my dataframe I have a column with text messages. I decided to create a binary feature which depends on whether or not in this text were words "call", "phone", "mobile", "@gmail", "mail" "facebook". But now I wonder should I create separate binary features for each word (or group of words) or one for all of them. How to check which solution is better. Is there any metric and …
Category: Data Science

what are the effect on machine learning regression model if the dataset has two exact same columns

What will be the effect on the Machine learning model if the dataset has two exact same columns(exact 1 correlation). One thing that comes to my mind is that if two columns are exactly the same then it is like we are multiplying the column with 2 so in the regression model it is okay since it will take care of that problem by dividing it by 2. So here is the second question if models can work fine then …
Category: Data Science

How to obtain original feature names after using one-hot encoding

This question is on an implementation aspect of scikit-learn's DecisionTreeClassifier(). How do I get the feature names ranked in descending order, from the feature_importances_ returned by the scikit-learn DecisionTreeClassifier()? The problem is that the input features to the classifier are not the original ones - they are numerically encoded ones from pandas DataFrame get_dummies. For example, I take the mushroom dataset from the UCI repository. Features in the dataset include - cap_shape, cap_surface, cap_color, odor, etc. pandas dataframe getdummies encodes …
Category: Data Science

Use dummy variables to create a rank variable. R

I have a series of multiple response (dummy) variables describing causes for a canceled visits. A visit can have multiple reasons for the cancelation. My goal is to create a single mutually exclusive variable using the dummy variables in a hierarchical way. For example, in my sample data below the rank of my variables is as follow: Medical, NoID and Refuse. Ex. if a visit was cancelled due to medical and lack of ID reasons, I would like to recode …
Category: Data Science

Dummies Variables and Scaling in Regression Problems

I was wondering if having dummies variable and scaling other variables could joke my model. In particular, I have implemented a Random Forest Regressor by using scikit-learn, but my data model is composed by a set of dummies varibles and 2 numerical variables. I approached in this way: Convert categorical in dummies variables Separate the numerical variables Scale with Standard Scaler from scikit-learn the numerical variables (at point 2) Join the dummies and numerical Split train, test train the model …
Category: Data Science

In which cases shouldn't we drop the first level of categorical variables?

Beginner in machine learning, I'm looking into the one-hot encoding concept. Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies. I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels. But since pandas.get_dummies() has its drop_first argument to false by default, this …
Category: Data Science

What exactly is a dummy trap? Is dropping one dummy feature really a good practice?

So I'm going through a Machine Learning course, and this course explains that to avoid the dummy trap, a common practice is to drop one column. It also explains that since the info on the dropped column can be inferred from the other columns, we don't really lose anything by doing that. This course does not explain what the dummy trap exactly is, however. Neither it gives any examples on how the trap manifests itself. At first I assumed that …
Category: Data Science

Should I include all dummy variables or N-1 dummy variables (keep one as reference) in neural networks

I have a categorical variable with N factor levels (e.g. gender has two levels) in classification problem. I have converted it into dummy variables (male and female). I have to use neural network (nnet) to classify. I have two options - Include any N-1 dummy variables in the input data (e.g. include either male or female). In statistical models, we use N-1 dummy variables. Include all N dummy variables (e.g. include both male and female) Can someone please highlight the …
Category: Data Science

how do tree based methods deal with missing feature columns?

all, i have trained a model using xgboost. Some of the features are one hot encoded e.g. currency where it is either gbp or usd. it seems that when i output the feature importance gbp and usd were in 7'th 8th place respectively. now i would like to use the model to predict whether defaulter or not on australian countries, however the currency for these is in AUD. Therefore when i apply my feature engineering it will create a column …
Category: Data Science

Problem with converting string to dummy variables

I'm new in data science, I have data which want to work on it, I omitted extra columns and convert it to 4 columns ( Product, Date, Market, Demand ) . in this data Product and Market are string, I know for working on this data must convert them. I want to convert the string to dummy variables but this isn't logical because I have 64 fruits in the product column. I am confused and I don't know what can …
Category: Data Science

How to handle fixed values for variables in pre-processing

I have a dataset which contains few variables whose values do not change. Some of the variables are non-numeric (for example all values for that variable contain the value 5) and few variables are real-valued but all same values. When doing standardization of the variables so that each is a zero mean and variance 1, these variables give NaN values. Therefore, is it ok to exclude such variables (irrespective of being categorical or real-valued) that contain constant values from the …
Category: Data Science

onehotencoder random forest

In a Random Forest context, do I need to setup dummies/OnehotEncoder in a dataset where features/varibles are numerical but refer to some kind of category? Let's say I have the following variables: Where Y is the variable I want to predict. X's are features. I will focus on X1. Its numerical but refers to a specific category (i.e. 1 refers to math, 2 refers to literature and 3 for history). Do I need to apply OnehotEncoder (or dummy approach) for …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.