Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …
how to assign back categorical variables to train and test data after training and testing using inverse_transform? Like training and testing, data will have encoded numerical values. So, how to assign back categorical values to those variables to train and test dataset after training and testing? Please help me with this.
I have 2 ddbb with around 60,000 samples each. Both have the same features (same column names) that represent particular things with text or categories (turned into numbers). Each sample in a ddbb is assumed to refer to a different particular thing. But there are some objects that are represented in both ddbb, yet with somewhat different values in the same-name column (like different open descriptions, or classified as another category). The aim is to train a machine learning model …
I have a problem. I want to predict when the customer will place another order in how many days if an order comes in. I have already created my target variable next_day_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this. Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 …
I have four different feature selection techniques, Backwards Elimination, Lasso, feature_importances, and Recursive feature selection. Each technique returns slightly different results. For example, Backwards Elimination: Spread Direction Lasso: Spread Move, and Spread Feature_Importances_: Spread Percentage and Spread Money Recursive: Spread Money is there a standard method when choosing features from different models? Should you choose the features that each model returns or is there a preferred method when doing this?
I have some data n x m and I want to ignore certain features. One idea I had is to mark those features as "missing", since XGBoost can handle missing values by default, e.g. using nan when constructing the DMatrix: n, m = 100, 10 X = np.random.uniform(size=(n, m)) y = (np.sum(X, axis=1) >= 0.5 * m).astype(int) # ignore certain features: mark them as missing X[:, 2:7] = np.nan dtrain = xgb.DMatrix(X, label=y, missing=np.nan) model = xgb.train(params={'objective': 'binary:logistic'}, dtrain=dtrain) My …
Need advice on the best way to represent the below data to be fed into an ML algorithm (yet to decided on) This is from the online order processing domain. An order consists of a set of variable number of items. Each item can be located in different warehouses, again this is a variable number. The entire order with multiple items and items with multiple warehouses per item, needs to be processed as one training sample. The goal is to …
I want to use VGG16 (or VGG19) for voice clustering task. I read some articles which suggest to use VGG (16 or 19) in order to build the embedding vector for the clustering algorithm. The process is to convert the wav file into mfcc or plot (Amp vs Time) and use this as input to VGG model. I tried it out with VGG19 (and weights='imagenet'). I got bad results, and I assumed it because I'm using VGG with wrong weights …
I'm working on a propensity model, predicting whether customers would buy or not. While doing exploratory data analysis, I found that customers have a buying pattern. Most customers repeat the purchase in a specified time interval. For example, some customers repeat purchases every four quarters, some every 8,12 etc. I have the purchase date for these customers. What is the most useful feature I can create to capture this pattern in the data?. I'm predicting whether in the next quarter …
I'm trying to solve a problem which is as follows: I need to train the autoencoder to extract useful data from text. I will use the trained autoencoder in another model to extract features. The goal is to teach the autocoder to compress the information and then reconstruct the exact same string. I solve the problem of classification for each letter. My dataset: X_train_autoencoder_raw: 15298 some text... 1127 some text... 22270 more text... ... Name: data, Length: 28235, dtype: object …
Is a neural network (for example a MLPClassifier in Python) able to learn to map a completely (or very) different input feature set to the same output class? Or is it better to work in this case with more than one output class and map these recognized output classes afterwards to the same class manually?
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables. I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much. Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d …
I would like to train a machine learning model with several features as input as X[] and with one output as Y. For example Every sample has a Data frame like this: X[0], X[1], X[2], X[3], X[4], Y Let's say One sample the followings Data is only one value: X[0], X[1], X[2], X[4], Y This is normal machine training problem. But now, if I would like to set X[3] multiple values for example sample 1 Data is: X[0] | X[1] …
I have a computer-generated music project, and I'd like to classify short passages of music as "good" or "bad" via machine learning. I won't have a large training set. I'll start by generating 500 examples each of good and bad music, manually. These examples can be transposed and mirror-imaged to produce 12,000 examples of each good and bad. I have a way of extracting features from the music in an intelligent way that mimics the way a perceptive listener would …
I was thinking if I have an input which has 36 possible values, and I make it as 36 inputs where exactly one of them is non 0, what is optimal value for each of the non 0 inputs? It may be: [1, 0, 0,....,0] [0, 1, 0,....,0] [0, 0, 1,....,0] Or: [36, 0, 0,....,0] [0, 36, 0,....,0] [0, 0, 36,....,0] Or even: [6, 0, 0,....,0] [0, 6, 0,....,0] [0, 0, 6,....,0] In order this feature to have same impact …
I am working on a linear regression problem. The features for my analysis have been selected using p-values and domain knowledge. After selecting these features, the performance of $R^2$ and the $RMSE$ improved from 0.25 to 0.85. But here is the issue, the features selected using domain knowledge have very high p-values (0.7, 0.9) and very low $R^2$ (0.002, 0.0004). Does it make sense to add such features even if your model shows improvement in performance. As far I know, …
There is this notebook solving housing prices. https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1/notebook?scriptVersionId=12846740 and it had this bit of code, can anyone explain the how addition and multiplication and weighs work? features['YrBltAndRemod']=features['YearBuilt']+features['YearRemodAdd'] features['TotalSF']=features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF'] features['Total_sqr_footage'] = (features['BsmtFinSF1'] + features['BsmtFinSF2'] + features['1stFlrSF'] + features['2ndFlrSF']) features['Total_Bathrooms'] = (features['FullBath'] + (0.5 * features['HalfBath']) + features['BsmtFullBath'] + (0.5 * features['BsmtHalfBath'])) features['Total_porch_sf'] = (features['OpenPorchSF'] + features['3SsnPorch'] + features['EnclosedPorch'] + features['ScreenPorch'] + features['WoodDeckSF'])
I'd like to explore some interactions between my variables but they're on different measurement scales. Would for example the absolute value of the difference of them after scaling make sense? From what I understand having them scaled on a 1-0 range would heavily rely on their max and min values, from this assumption it seems to me that interaction within them would not make sense since their position in their own scale would heavily depend on the observation.
What is the best way to feature engineer features which have more than one repeated values ? I want to parse this data and finally keep in a pandas df for further analysis. Example, I have data of people's profile which consists of Name, Age, Gender, Company, Degree Now it is easy to keep Name , age and gender which has specific single value, but company can have more than one value or multiple value like someone worked with Google …
I am working on a dataset of ATP (Association of Tennis Professionals - men only) tennis games over several years. I want to predict the outcome of tennis so one way to do that is using a Bradley-Terry model which is a probability model I am asking about how to do feature selection or feature engineering( I am not talking about domain knowledge FE) or preprocessing that must be applied before training the model