I am working with Orange for my thesis with logs and core data; however, since I am a beginner, I am a little bit stuck with the feature construction widget. Ultimately, I would like to combine different features to compare them. What kind of information should I put in "Values" field with a categorical feature? If you have any examples on this it would be really appreciated (the ones from Orange did not help me).
I'm working on a propensity model, predicting whether customers would buy or not. While doing exploratory data analysis, I found that customers have a buying pattern. Most customers repeat the purchase in a specified time interval. For example, some customers repeat purchases every four quarters, some every 8,12 etc. I have the purchase date for these customers. What is the most useful feature I can create to capture this pattern in the data?. I'm predicting whether in the next quarter …
I would like to train a machine learning model with several features as input as X[] and with one output as Y. For example Every sample has a Data frame like this: X[0], X[1], X[2], X[3], X[4], Y Let's say One sample the followings Data is only one value: X[0], X[1], X[2], X[4], Y This is normal machine training problem. But now, if I would like to set X[3] multiple values for example sample 1 Data is: X[0] | X[1] …
I'm currently trying to create a few features to improve the performances of a model. One of those features that I would like to create corresponds to the difference in days between a customer's purcharse and his last one. To create this feature is not a problem. However, I don't know which value to set if this is the first purcharse of a customer. Which value should I set and, more generally, how to treat these cases ? customer_id date_purchase …
Specifically what I am looking for are tools with some functionality, which is specific to feature engineering. I would like to be able to easily smooth, visualize, fill gaps, etc. Something similar to MS Excel, but that has R as the underlying language instead of VB.
Currently I have been trying to find some good algorithms for feature selection. Using correlation or other non casual type of method will not be the right way to do a feature selection. I'm am searching for aglorithms in python or libraries that use casual effects for feature selection. Currently there are only for binary outcomes, I'm searching for a regression problem so it must be continuous. "Causality-Guided Feature Selection"
I have a problem where I am trying to classify the outcome of costumer complaint cases. I have several features already such as type of item bought, reason for complaint etc... I am trying to add a feature that represents how long a case is 'open' (meaning waiting for resolution). The logic being that a case that is 'open' for long is unlikely to have a positive outcome. Issue is, I am traning my model on 'closed' cases, hence have …
We're training a binary classifier in AutoML, and one of the features consist of browser versions. Currently these versions are provided "normalized" to the model, according to the percentile of the browser the current observation falls into. For example, if the percentiles of some specific browser versions are: percentile version p25 34 p50 45 p75 53 p99 70 then an observation with said browser and version=54 would be represented as: p25 p50 p75 p99 1 1 1 0 My question …
Suppose there are 2000 movies and a company wants to recommend some movies (for example, at most 5 movies) to each visitor. The objective is to learn how to predict which movie will be selected if a specific set of movies is recommended. option-1 option-2 option-3 option-4 option-5 Selected-Movie 1. movie1 movie3 movie4 movie4 2. movie3 movie4 movie100 movie1000 movie1001 movie1001 3. movie4 movie5 movie34 movie34 Based on this data set, I want to learn when sample 1 is suggested …
Is there any resource with a list of feature engineering techniques? A mapping of type of data, model and feature engineering technique would be a gold mine.
I have extracted features from two types of signals. Prior to merging them to create one feature vector, I have computed an importance score of every feature within that type of signal. I would like to weight the features according to those scores. Would the best way to do this be by multiplying every feature with its score and then concatenate the features of both signals, and should the data be normalized again after multiplication? Or, is there a different …
I have a few thousand grayscale images, and I would like to generate a universal representation of the patterns within - a semantic/ordered composition of all features, so to speak. For instance, take 10000 images of a dog and draw the archetypical dog. Does this task have a technical name, and is there a method out there specifically for such purposes? I guess this similar to what happens during the training of a neural network. I just don't necessarily need …
I have a task of representing a users feature matrix , i have features like gender , age etc but I also have a multivalue feature called as "movies watched" which is essentially another table of movie names watched by that user with a numeric duration, the order of movies does not matter here. Also, movies watched can be from 20 movies to 300 movies. So what is the best way of representing this "movies watched" as a feature vector?
I have been playing with two dimensional machine learning using pandas (trying to do something like this), and I would like to combine Lat/Long into a single numerical feature -- ideally in a linear fashion. Is there a "best practice" to do this?
Suppose we are asked to predict something given a set of features, how do we know if that target is actually predictable? That is, how do we know if there is actually some relation between the dependant and independent features or there are some patterns in the data which could be exploited by a machine learning algorithm? What if the target outcomes are just random? How do we test for this relationship before we start building ML/DL models?
Is it possible to reduce non-correlated multi-dimensional data over features to 1D data? A working option is pooling (mean/min/max) over an embedding vector (n samples of embeddings of m dimensions). E.g. converts many embeddings (n × m) to a list of means (1 × m). However, these all loose a lot of information (especially the relationships between features in single embeddings). This doesn't have to be a reduction (i.e. the resulting 1D vector can be larger than m). If it's …
I have a 3D graph like below: Ref: google images It has 2 angles as X and Y and the Z axis is amplitude value (Each 3D graph is representing a pixel). I want to model this into some useful data structure like a graph or a vector considering some parameters extracted from the above 3D graph, so that I'll be able to feed it into a classification algorithm. But, I'm unable to extract all the local minimas/maximas, or slopes. …
As is always the way I stumbled across Tsallis entropy on SO whilst looking for something completely different. This soon lead me reading all sorts of interesting but terse academic papers. I am unfortunately a mere layman and I still have one big unsolved question. The key input to Tsallis entropy is a probability array. What I don't understand is how do you get it out of a time-series ? Allow me to give you a completely hypothetical example: I …
I have trained a model (Random Forest) and now I would like to use it to predict certain data on a particular day. I have a categorical column where there are some values (say a,b,c,d,e) over a period. Now on a particular day, only some of those values are there (say b,d). Now while making them to one-hot encoding, I am using LabelEncoder and the one-hot encoder. But, if I give that column for label encoding, it is labelling only …
I'm relatively new to ML/Statistical Analysis, and I'm facing a dataset structured like this person_id, pay, task, hours 1, 560, A, 3 1, 560, B, 5 2, 650, A, 7 3, 520, C, 6 3, 520, A, 2 ... meaning person 1 is cumulatively paid 560 to perform task A 3 hrs and task B 5 hrs; person 2 paid 650 for task A 7 hrs; person 3 paid 520 for task C 6 hrs and A 2 hrs, etc. …