Create features for each row or only for a specific value

I have a problem. I want to predict when the customer will place another order in how many days if an order comes in. I have already created my target variable next_day_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this. Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 …
Category: Data Science

Reverse engineer PII sensitive data from Inceptionv3 pre-trained model generated features

I'm using the pre-trained Inceptionv3 to build out features from proprietary documents. Some of these documents contain sensitive PII data. I use the 2K output from the second last layer as the feature vector. My question is if a set (say 2000) of these 2K generated features are available to someone, can they be used to reverse engineer the sensitive data like SSN, date of birth, etc. My thinking is since the Inceptionv3 was never trained with these proprietary documents, …
Category: Data Science

How to handle a feature vector that could be variable length?

I would like to train a machine learning model with several features as input as X[] and with one output as Y. For example Every sample has a Data frame like this: X[0], X[1], X[2], X[3], X[4], Y Let's say One sample the followings Data is only one value: X[0], X[1], X[2], X[4], Y This is normal machine training problem. But now, if I would like to set X[3] multiple values for example sample 1 Data is: X[0] | X[1] …
Category: Data Science

Hard time finding literature on feature clustering using Principal Component Analysis

Im new to StackExchange, so i am sorry if this is not the right way to ask a question on StackExhange. For my thesis I wish to propose a methode for future research on using PCA to cluster features (feature clustering) and then apply per-cluster PCA. I got the idea from this paper: this paper. But I have a hard time finding literature about PCA being used to cluster variables (not reduce variables). I could imagine that it is not …
Category: Data Science

LSTM for binary classification using multiple attributes

I haven't used neural networks for many years, so excuse my ignorance. I was wondering what is the most appropriate way to train a LSTM model based on my dataset. I have 3 attributes as follows: Attribute 1: small int e.g., [123, 321, ...] Attribute 2: text sequence ['cgtaatta', 'ggcctaaat', ... ] Attribute 3: text sequence ['ttga', 'gattcgtt', ... ] Class label: binary [0, 1, ...] The length of each sample's attributes (2 or 3) is arbitrary; therefore I do …
Category: Data Science

How to build multiple variable regression having a mix of numerical & categorical features?

There is a need to estimate Annual Average Daily Traffic Volume (AADT). We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT. The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$. …
Category: Data Science

Feature engineering before splitting

This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data. But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset? Also, …
Category: Data Science

Is there a multi-modal population based metaheuristic that is non-GA?

I have a feature set from which I want to select various combinations and permutations of the features. The length of a solution feature vector can range between , say 5 - 20 features , and the ordering of the features are important , meaning that feature vector ABC is different from BCA i.e they are sequential and depends on each others output. The goal is to find many near optimal solutions around optimal solutions and the solution space is …
Category: Data Science

Is there a way to combine multiple ML models where each use datasets with different features?

I have a dataset where some features (c,d) apply to only when a feature (a) is a specific value. For example a, b, c, d T, 60, 0x018, 3252002711 U, 167, , U, 67, , T, 66, 0x018, 15556 So I'm planning to splitting the dataset so that there are no missing values. a, b, c, d T, 60, 0x018, 3252002711 T, 66, 0x018, 15556 a, b U, 167 U, 67 and then put these into individual models which combine …
Category: Data Science

train-test split on forecasting a time series using external features

I have a question regarding the train-test split when forecasting a timeseries using features instead of the time series itself. I know that I should use a time-based train-test-split if i use lagged values of the time series to predict, but I am wondering if that is the case also if I use an external feature. Suppose I try to forecast the watermelon consumption using only the temprature (X feature) instead of using the time series regarding the watermelon. Leaving …
Category: Data Science

Finding attributes that make up dense clusters of fraudulent transactions

I have data about purchases customers made in my website. Some users later decline the purchase, a scenario I'd like to avoid. I have lots of data about the purchases made in my website, so I'd like to find clusters of users who share similar attributes, and are dense in "decliners" type of users. I have labelled data about those users (as we know who later declined the payment). The problem is, How do I cluster them in a meaningful …
Category: Data Science

Feature Map setup for Faster RCNN with resnet50 backbone

I'm trying to get an activation map using a Faster RCNN Resnet50 backbone, but am having issues getting the proper hook setup for output information. Most of the libraries, like gradcam, don't seem to have built-in support for faster rcnn setups. I think the flow for Faster RCNN requires something extra, but am unable to figure out what I need to hook into the model. Layer 4 is what I've concentrated on, as it's called out in numerous tutorials (which …
Category: Data Science

vertical or horizontal storage of timesteps in feature store

I'd like to use a feature store to store some time series and I asked myself what's the best way to store the timesteps. Is it better to store each timestep horizontal and then doing windowing after collecting it from the feature store to create the feature vector. Or is it better to store all timestep addiotionally in a column and doing the windowing before storing it to the feature store. Personally I think the better way is, to do …
Category: Data Science

How can I assess feature importance when determining whether a missing data is MCAR or not?

I was reading some lecture notes on missing data and the author suggests the following approach to determine whether some varibale is missing completely at random (MCAR) or not: Supervised Learning method: Code ‘missing’ as a new category. Run a supervised analysis (to predict a separate target variable) and check if ‘missing’ has an effect on the prediction of the response in the learned model. If the category ‘missing’ has an effect, this is evidence that data is not MCAR. …
Category: Data Science

How to insert two features in a model when a feature only applies to a certain group in the model

I'm building a machine learning model in Python to predict soccer player values. Consider the following feature columns of the dataframe: [features] --------------------------------- position | goals | goals_conceded -------- |-------|--------------- Forward | 23 | NaN Defender | 2 | NaN Defender | 4 | NaN Keeper | NaN | 20 Keeper | NaN | 43 Since keepers don't usually score goals, they'll almost always have null values in the "goals" column, but they still can have this statistic, so it …
Category: Data Science

If a categorical feature only occurs a few times in a data set, should I drop it?

I have a data set of mostly categorical variables. When I one-hot encoded them some of the features occur less than 3% of the time. For instance the Tech-support feature only occurs 928 times in a data set with 32561 samples ie. it only occurs 2.9% of the time. Is there a general cutoff point for when I should scrap these variables? I'm cleaning up this data set for binary logistic regression and an SVM. Thank you!
Category: Data Science

Training & Test feature shape is different from number of columns in dataset

I am making a Sequential Neural Network for regression with 3 dense layers which will be trained on a simple dataset. But before I even get to that part of the code to execute the model I am getting a different shape of my features than columns in dataset. Columns of the dataset includes: one categorical "Name" column which is one-hot encoded 2)the other 20 columns are integers/floats I have 21 features in my dataset. ValueError is telling me it …
Category: Data Science

combine two features into one

In an epidemic disease dataset of 3 months, I have a feature (var dt_died) with the death dates of patients (800 people died out of all 12k unique subjects in this dataset, so obviously only dead subjects have data for this feature). I also have a feature that indicates the (var dt_test_positive) date of testing positive for the disease (with no missing values). I would like to combine these two features into one (var difference). If I just make the …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.