I have a problem. I want to predict when the customer will place another order in how many days if an order comes in. I have already created my target variable next_day_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this. Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 …
I'm using the pre-trained Inceptionv3 to build out features from proprietary documents. Some of these documents contain sensitive PII data. I use the 2K output from the second last layer as the feature vector. My question is if a set (say 2000) of these 2K generated features are available to someone, can they be used to reverse engineer the sensitive data like SSN, date of birth, etc. My thinking is since the Inceptionv3 was never trained with these proprietary documents, …
I would like to train a machine learning model with several features as input as X[] and with one output as Y. For example Every sample has a Data frame like this: X[0], X[1], X[2], X[3], X[4], Y Let's say One sample the followings Data is only one value: X[0], X[1], X[2], X[4], Y This is normal machine training problem. But now, if I would like to set X[3] multiple values for example sample 1 Data is: X[0] | X[1] …
Im new to StackExchange, so i am sorry if this is not the right way to ask a question on StackExhange. For my thesis I wish to propose a methode for future research on using PCA to cluster features (feature clustering) and then apply per-cluster PCA. I got the idea from this paper: this paper. But I have a hard time finding literature about PCA being used to cluster variables (not reduce variables). I could imagine that it is not …
I haven't used neural networks for many years, so excuse my ignorance. I was wondering what is the most appropriate way to train a LSTM model based on my dataset. I have 3 attributes as follows: Attribute 1: small int e.g., [123, 321, ...] Attribute 2: text sequence ['cgtaatta', 'ggcctaaat', ... ] Attribute 3: text sequence ['ttga', 'gattcgtt', ... ] Class label: binary [0, 1, ...] The length of each sample's attributes (2 or 3) is arbitrary; therefore I do …
There is a need to estimate Annual Average Daily Traffic Volume (AADT). We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT. The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$. …
In Machine Learning, if the data we are working on has, say, 6 features/variables, does that mean the prediction line/curve of our ML model is represented by a Hexic polynomial equation whose degree is 6? In short, is the degree of our prediction line/curve the same as the number of features in our data?
This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data. But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset? Also, …
I have a feature set from which I want to select various combinations and permutations of the features. The length of a solution feature vector can range between , say 5 - 20 features , and the ordering of the features are important , meaning that feature vector ABC is different from BCA i.e they are sequential and depends on each others output. The goal is to find many near optimal solutions around optimal solutions and the solution space is …
I have a dataset where some features (c,d) apply to only when a feature (a) is a specific value. For example a, b, c, d T, 60, 0x018, 3252002711 U, 167, , U, 67, , T, 66, 0x018, 15556 So I'm planning to splitting the dataset so that there are no missing values. a, b, c, d T, 60, 0x018, 3252002711 T, 66, 0x018, 15556 a, b U, 167 U, 67 and then put these into individual models which combine …
I have a question regarding the train-test split when forecasting a timeseries using features instead of the time series itself. I know that I should use a time-based train-test-split if i use lagged values of the time series to predict, but I am wondering if that is the case also if I use an external feature. Suppose I try to forecast the watermelon consumption using only the temprature (X feature) instead of using the time series regarding the watermelon. Leaving …
I have data about purchases customers made in my website. Some users later decline the purchase, a scenario I'd like to avoid. I have lots of data about the purchases made in my website, so I'd like to find clusters of users who share similar attributes, and are dense in "decliners" type of users. I have labelled data about those users (as we know who later declined the payment). The problem is, How do I cluster them in a meaningful …
I'm trying to get an activation map using a Faster RCNN Resnet50 backbone, but am having issues getting the proper hook setup for output information. Most of the libraries, like gradcam, don't seem to have built-in support for faster rcnn setups. I think the flow for Faster RCNN requires something extra, but am unable to figure out what I need to hook into the model. Layer 4 is what I've concentrated on, as it's called out in numerous tutorials (which …
I'd like to use a feature store to store some time series and I asked myself what's the best way to store the timesteps. Is it better to store each timestep horizontal and then doing windowing after collecting it from the feature store to create the feature vector. Or is it better to store all timestep addiotionally in a column and doing the windowing before storing it to the feature store. Personally I think the better way is, to do …
I was reading some lecture notes on missing data and the author suggests the following approach to determine whether some varibale is missing completely at random (MCAR) or not: Supervised Learning method: Code ‘missing’ as a new category. Run a supervised analysis (to predict a separate target variable) and check if ‘missing’ has an effect on the prediction of the response in the learned model. If the category ‘missing’ has an effect, this is evidence that data is not MCAR. …
Should all the features in a dataset be converted to the same data type? For instance, if all the features have numerical values, some int & some float, should they all be converted to float? What difference would this conversion make?
I'm building a machine learning model in Python to predict soccer player values. Consider the following feature columns of the dataframe: [features] --------------------------------- position | goals | goals_conceded -------- |-------|--------------- Forward | 23 | NaN Defender | 2 | NaN Defender | 4 | NaN Keeper | NaN | 20 Keeper | NaN | 43 Since keepers don't usually score goals, they'll almost always have null values in the "goals" column, but they still can have this statistic, so it …
I have a data set of mostly categorical variables. When I one-hot encoded them some of the features occur less than 3% of the time. For instance the Tech-support feature only occurs 928 times in a data set with 32561 samples ie. it only occurs 2.9% of the time. Is there a general cutoff point for when I should scrap these variables? I'm cleaning up this data set for binary logistic regression and an SVM. Thank you!
I am making a Sequential Neural Network for regression with 3 dense layers which will be trained on a simple dataset. But before I even get to that part of the code to execute the model I am getting a different shape of my features than columns in dataset. Columns of the dataset includes: one categorical "Name" column which is one-hot encoded 2)the other 20 columns are integers/floats I have 21 features in my dataset. ValueError is telling me it …
In an epidemic disease dataset of 3 months, I have a feature (var dt_died) with the death dates of patients (800 people died out of all 12k unique subjects in this dataset, so obviously only dead subjects have data for this feature). I also have a feature that indicates the (var dt_test_positive) date of testing positive for the disease (with no missing values). I would like to combine these two features into one (var difference). If I just make the …