This question concerns question 4h of this textbook exercise. It asks to make future predictions based on a chosen TSLM model which involves an endogenously (if i'm using this right) made dummy variable based off certain time points. My main code is as follows The main problem I've encountered is that when I use forecast() on my model, it gives an error message: This is very confusing because shouldn't my modified data already include the dummy variables? Hence, the model …
I have a doubt about what will be the right way to use or represent categorical variables with only two values like "sex". I have checked it up from different sources, but I was not able to find any solid reference. For example, if I have the variable sex I usually see this in this form: id sex 1 male 2 female 3 female 4 male So I found that one can use dummy variables like this: (https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/) and also …
I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem which I can not see the obvious best answer to. For each individual we can have multiple instances of this categorical variable $X$. When such cases happen on numerical variables I usually take the max/mean/min depending on context. …
I have a dataset with 50+ dummy coded variables that represent the purchases of an individual customer. Columns represent the products and the cell values 0 or 1, whether the product has been purchased in the past by this customer (rows = customers) or not. Now, I want to predict how these purchases predict the loyalty of the customer (continuous variable in years). I am struggling to find an appropriate prediction model. What would you suggest for such a prediction …
I am new to the field to the data science, and need help in the following: I am working on a data set that consists of both categorical and numerical values, first I have concatenate the two files (train and test) to apply the EDA steps on it, then I have done the EDA steps on the follow data set, applied one hot encoding, spitted the data. I am getting the following message, it seems that there is inconsistency between …
I got the following problem: When I trained my model I created my dummy variables(before train-test split) in the following way: dummy <- dummyVars(formula = CLASS_INV ~ ., data = campaign_spending_final_imputed, fullRank = TRUE) dummy %>% saveRDS('model/dummy.rds') #I save it to use it later campaign_spending_final_dummy <- predict(dummy, newdata = campaign_spending_final_imputed) %>% as.data.frame() %>% mutate(CLASS_INV = campaign_spending_final$CLASS_INV) The model was trained and tested successfully. Now I want to test it on 'real world' data and I want to create dummy variables …
I'm having an issue that I can't explain and am hoping I am missing something simple. I have a large dataset of shape(45Million+, 51) and am loading it in for some analyses (classifiers, deep learning, basically just trying a few different things as some research for work). I take a few steps when I load it in: dropna() to get rid of all rows with an na (only about 6K out of the 45M) Use pandas get_dummies() to change a …
I intend to predict the age of customers using some features. There are some categorical features that I need to convert to dummy variables before the modelling stage. Since the datasets are so big (millions of rows) when I used StringIndexer in pyspark to get dummies from first names, I got the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 399, …
I am doing feature engineering right now for my classification task. In my dataframe I have a column with text messages. I decided to create a binary feature which depends on whether or not in this text were words "call", "phone", "mobile", "@gmail", "mail" "facebook". But now I wonder should I create separate binary features for each word (or group of words) or one for all of them. How to check which solution is better. Is there any metric and …
What will be the effect on the Machine learning model if the dataset has two exact same columns(exact 1 correlation). One thing that comes to my mind is that if two columns are exactly the same then it is like we are multiplying the column with 2 so in the regression model it is okay since it will take care of that problem by dividing it by 2. So here is the second question if models can work fine then …
This question is on an implementation aspect of scikit-learn's DecisionTreeClassifier(). How do I get the feature names ranked in descending order, from the feature_importances_ returned by the scikit-learn DecisionTreeClassifier()? The problem is that the input features to the classifier are not the original ones - they are numerically encoded ones from pandas DataFrame get_dummies. For example, I take the mushroom dataset from the UCI repository. Features in the dataset include - cap_shape, cap_surface, cap_color, odor, etc. pandas dataframe getdummies encodes …
I have a series of multiple response (dummy) variables describing causes for a canceled visits. A visit can have multiple reasons for the cancelation. My goal is to create a single mutually exclusive variable using the dummy variables in a hierarchical way. For example, in my sample data below the rank of my variables is as follow: Medical, NoID and Refuse. Ex. if a visit was cancelled due to medical and lack of ID reasons, I would like to recode …
I was wondering if having dummies variable and scaling other variables could joke my model. In particular, I have implemented a Random Forest Regressor by using scikit-learn, but my data model is composed by a set of dummies varibles and 2 numerical variables. I approached in this way: Convert categorical in dummies variables Separate the numerical variables Scale with Standard Scaler from scikit-learn the numerical variables (at point 2) Join the dummies and numerical Split train, test train the model …
Beginner in machine learning, I'm looking into the one-hot encoding concept. Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies. I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels. But since pandas.get_dummies() has its drop_first argument to false by default, this …
So I'm going through a Machine Learning course, and this course explains that to avoid the dummy trap, a common practice is to drop one column. It also explains that since the info on the dropped column can be inferred from the other columns, we don't really lose anything by doing that. This course does not explain what the dummy trap exactly is, however. Neither it gives any examples on how the trap manifests itself. At first I assumed that …
I have a categorical variable with N factor levels (e.g. gender has two levels) in classification problem. I have converted it into dummy variables (male and female). I have to use neural network (nnet) to classify. I have two options - Include any N-1 dummy variables in the input data (e.g. include either male or female). In statistical models, we use N-1 dummy variables. Include all N dummy variables (e.g. include both male and female) Can someone please highlight the …
all, i have trained a model using xgboost. Some of the features are one hot encoded e.g. currency where it is either gbp or usd. it seems that when i output the feature importance gbp and usd were in 7'th 8th place respectively. now i would like to use the model to predict whether defaulter or not on australian countries, however the currency for these is in AUD. Therefore when i apply my feature engineering it will create a column …
I'm new in data science, I have data which want to work on it, I omitted extra columns and convert it to 4 columns ( Product, Date, Market, Demand ) . in this data Product and Market are string, I know for working on this data must convert them. I want to convert the string to dummy variables but this isn't logical because I have 64 fruits in the product column. I am confused and I don't know what can …
I have a dataset which contains few variables whose values do not change. Some of the variables are non-numeric (for example all values for that variable contain the value 5) and few variables are real-valued but all same values. When doing standardization of the variables so that each is a zero mean and variance 1, these variables give NaN values. Therefore, is it ok to exclude such variables (irrespective of being categorical or real-valued) that contain constant values from the …
In a Random Forest context, do I need to setup dummies/OnehotEncoder in a dataset where features/varibles are numerical but refer to some kind of category? Let's say I have the following variables: Where Y is the variable I want to predict. X's are features. I will focus on X1. Its numerical but refers to a specific category (i.e. 1 refers to math, 2 refers to literature and 3 for history). Do I need to apply OnehotEncoder (or dummy approach) for …