feature-engineering

Queries regarding feature importance for categorical features

Pradip

2022年6月2日 09:08

Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …

Topic: feature-engineering feature-selection categorical-data

Category: Data Science

how to assign back categorical variables to train and test data after training and testing using inverse_transform?

Nithin Reddy

2022年6月2日 04:02

how to assign back categorical variables to train and test data after training and testing using inverse_transform? Like training and testing, data will have encoded numerical values. So, how to assign back categorical values to those variables to train and test dataset after training and testing? Please help me with this.

Topic: feature-engineering scikit-learn machine-learning

Category: Data Science

How to train a model to predict if 2 samples refer to the same thing?

Martin

2022年5月30日 14:04

I have 2 ddbb with around 60,000 samples each. Both have the same features (same column names) that represent particular things with text or categories (turned into numbers). Each sample in a ddbb is assumed to refer to a different particular thing. But there are some objects that are represented in both ddbb, yet with somewhat different values in the same-name column (like different open descriptions, or classified as another category). The aim is to train a machine learning model …

Topic: automl text-classification feature-engineering supervised-learning

Category: Data Science

Create features for each row or only for a specific value

Test

2022年5月30日 08:43

I have a problem. I want to predict when the customer will place another order in how many days if an order comes in. I have already created my target variable next_day_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this. Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 …

Topic: features feature-engineering regression machine-learning

Category: Data Science

Is there a standard method for choosing features from different feature selection techniques?

Austin Johnson

2022年5月29日 18:03

I have four different feature selection techniques, Backwards Elimination, Lasso, feature_importances, and Recursive feature selection. Each technique returns slightly different results. For example, Backwards Elimination: Spread Direction Lasso: Spread Move, and Spread Feature_Importances_: Spread Percentage and Spread Money Recursive: Spread Money is there a standard method when choosing features from different models? Should you choose the features that each model returns or is there a preferred method when doing this?

Topic: feature-engineering machine-learning

Category: Data Science

Ignoring features in XGBoost by setting them as "missing"

Alexandru Dinu

2022年5月27日 09:22

I have some data n x m and I want to ignore certain features. One idea I had is to mark those features as "missing", since XGBoost can handle missing values by default, e.g. using nan when constructing the DMatrix: n, m = 100, 10 X = np.random.uniform(size=(n, m)) y = (np.sum(X, axis=1) >= 0.5 * m).astype(int) # ignore certain features: mark them as missing X[:, 2:7] = np.nan dtrain = xgb.DMatrix(X, label=y, missing=np.nan) model = xgb.train(params={'objective': 'binary:logistic'}, dtrain=dtrain) My …

Topic: feature-engineering xgboost

Category: Data Science

How to input sets as features

Tana

2022年5月26日 22:00

Need advice on the best way to represent the below data to be fed into an ML algorithm (yet to decided on) This is from the online order processing domain. An order consists of a set of variable number of items. Each item can be located in different warehouses, again this is a variable number. The entire order with multiple items and items with multiple warehouses per item, needs to be processed as one training sample. The goal is to …

Topic: feature-engineering data

Category: Data Science

Is there wights of voice or audio for VGG or Inception?

Boom

2022年5月24日 14:00

I want to use VGG16 (or VGG19) for voice clustering task. I read some articles which suggest to use VGG (16 or 19) in order to build the embedding vector for the clustering algorithm. The process is to convert the wav file into mfcc or plot (Amp vs Time) and use this as input to VGG model. I tried it out with VGG19 (and weights='imagenet'). I got bad results, and I assumed it because I'm using VGG with wrong weights …

Topic: vgg16 transfer-learning inception feature-engineering deep-learning

Category: Data Science

Feature creation ideas for propensity models?

NAS

2022年5月24日 05:22

I'm working on a propensity model, predicting whether customers would buy or not. While doing exploratory data analysis, I found that customers have a buying pattern. Most customers repeat the purchase in a specified time interval. For example, some customers repeat purchases every four quarters, some every 8,12 etc. I have the purchase date for these customers. What is the most useful feature I can create to capture this pattern in the data?. I'm predicting whether in the next quarter …

Topic: feature-engineering feature-construction classification feature-selection machine-learning

Category: Data Science

TextVectorization and Autoencoder for feature extraction of text

Лаврентий Крибель

2022年5月21日 19:49

I'm trying to solve a problem which is as follows: I need to train the autoencoder to extract useful data from text. I will use the trained autoencoder in another model to extract features. The goal is to teach the autocoder to compress the information and then reconstruct the exact same string. I solve the problem of classification for each letter. My dataset: X_train_autoencoder_raw: 15298 some text... 1127 some text... 22270 more text... ... Name: data, Length: 28235, dtype: object …

Topic: embeddings tensorflow autoencoder feature-engineering nlp

Category: Data Science

Is a neural network able to learn to map a completely different feature vector to the same class

jochen6677

2022年5月21日 04:08

Is a neural network (for example a MLPClassifier in Python) able to learn to map a completely (or very) different input feature set to the same output class? Or is it better to work in this case with more than one output class and map these recognized output classes afterwards to the same class manually?

Topic: feature-engineering neural-network machine-learning

Category: Data Science

Aggregating multiple encoded categorical values

Vishwa Kalyanaraman

2022年5月20日 05:05

I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables. I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much. Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d …

Topic: feature-engineering encoding categorical-data machine-learning

Category: Data Science

How to handle a feature vector that could be variable length?

Crazy9

2022年5月16日 07:04

I would like to train a machine learning model with several features as input as X[] and with one output as Y. For example Every sample has a Data frame like this: X[0], X[1], X[2], X[3], X[4], Y Let's say One sample the followings Data is only one value: X[0], X[1], X[2], X[4], Y This is normal machine training problem. But now, if I would like to set X[3] multiple values for example sample 1 Data is: X[0] | X[1] …

Topic: features feature-engineering feature-construction

Category: Data Science

Can a recommendation system be used as a binary classifier?

composerMike

2022年5月14日 21:07

I have a computer-generated music project, and I'd like to classify short passages of music as "good" or "bad" via machine learning. I won't have a large training set. I'll start by generating 500 examples each of good and bad music, manually. These examples can be transposed and mirror-imaged to produce 12,000 examples of each good and bad. I have a way of extracting features from the music in an intelligent way that mimics the way a perceptive listener would …

Topic: feature-engineering classification recommender-system machine-learning

Category: Data Science

One-hot encoding with values other than 1

Djura Marinkov

2022年5月13日 18:02

I was thinking if I have an input which has 36 possible values, and I make it as 36 inputs where exactly one of them is non 0, what is optimal value for each of the non 0 inputs? It may be: [1, 0, 0,....,0] [0, 1, 0,....,0] [0, 0, 1,....,0] Or: [36, 0, 0,....,0] [0, 36, 0,....,0] [0, 0, 36,....,0] Or even: [6, 0, 0,....,0] [0, 6, 0,....,0] [0, 0, 6,....,0] In order this feature to have same impact …

Topic: feature-engineering feature-scaling neural-network

Category: Data Science

Adding high p-value and low R square features in linear regression model to improve result

Shahnawaz Khan

2022年5月9日 05:00

I am working on a linear regression problem. The features for my analysis have been selected using p-values and domain knowledge. After selecting these features, the performance of $R^2$ and the $RMSE$ improved from 0.25 to 0.85. But here is the issue, the features selected using domain knowledge have very high p-values (0.7, 0.9) and very low $R^2$ (0.002, 0.0004). Does it make sense to add such features even if your model shows improvement in performance. As far I know, …

Topic: feature-engineering linear-regression feature-selection statistics machine-learning

Category: Data Science

can someone explain how to create new features using feature interactions?

Ishak

2022年5月8日 18:44

There is this notebook solving housing prices. https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1/notebook?scriptVersionId=12846740 and it had this bit of code, can anyone explain the how addition and multiplication and weighs work? features['YrBltAndRemod']=features['YearBuilt']+features['YearRemodAdd'] features['TotalSF']=features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF'] features['Total_sqr_footage'] = (features['BsmtFinSF1'] + features['BsmtFinSF2'] + features['1stFlrSF'] + features['2ndFlrSF']) features['Total_Bathrooms'] = (features['FullBath'] + (0.5 * features['HalfBath']) + features['BsmtFullBath'] + (0.5 * features['BsmtHalfBath'])) features['Total_porch_sf'] = (features['OpenPorchSF'] + features['3SsnPorch'] + features['EnclosedPorch'] + features['ScreenPorch'] + features['WoodDeckSF'])

Topic: feature-engineering regression kaggle

Category: Data Science

Would I be able to combine features on a different unit scale after normalizing?

cZeph

2022年5月5日 13:47

I'd like to explore some interactions between my variables but they're on different measurement scales. Would for example the absolute value of the difference of them after scaling make sense? From what I understand having them scaled on a 1-0 range would heavily rely on their max and min values, from this assumption it seems to me that interaction within them would not make sense since their position in their own scale would heavily depend on the observation.

Topic: feature-engineering normalization

Category: Data Science

What is the best way to feature engineer features which have more than one repeated values?

Shakti

2022年5月3日 08:01

What is the best way to feature engineer features which have more than one repeated values ? I want to parse this data and finally keep in a pandas df for further analysis. Example, I have data of people's profile which consists of Name, Age, Gender, Company, Degree Now it is easy to keep Name , age and gender which has specific single value, but company can have more than one value or multiple value like someone worked with Google …

Topic: feature-engineering machine-learning

Category: Data Science

how to deal with features in pairwaise comparison models?

Mohamed Amine

2022年5月1日 19:23

I am working on a dataset of ATP (Association of Tennis Professionals - men only) tennis games over several years. I want to predict the outcome of tennis so one way to do that is using a Bradley-Terry model which is a probability model I am asking about how to do feature selection or feature engineering( I am not talking about domain knowledge FE) or preprocessing that must be applied before training the model

Topic: probability feature-engineering preprocessing feature-selection predictive-modeling

Category: Data Science

About