Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …
I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …
I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values. This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes …
I have a large dataset that is entirely categorical. I'm trying to train with it using xgboost, so I must first convert this categorical data to numerical. So far I've been using sparse.model.matrix() in the Matrix library but it is far too slow. I found a great solution here, however, the sparse matrix it returns in not the same one that sparse.model.matrix returns. I know there is a way to force sparse.model.matrix to return identical output as the solution in …
I have a large dataset with around 200 features, consisting mostly of timeseries and categorical data, with some continuous. The dataset is extracted from/by a postal service. Small example: Random (scrambled) entries: shipment delivery cost location weight_kg 2020-04-22 2020-04-23 77.31 UK:66c54f531.... 0.5 2020-04-23 2020-04-25 44.14 DK:22c54f531.... 2.23 2020-04-24 2020-04-27 53.84 UK:66c54f531.... 1.57 2020-04-25 2020-04-26 22.09 UK:66c54f531.... My first inclination was to make a demand-forecast model on shipment/count_monthly(shipment), but considering the amount of features, a multivariate case seemed more relevant. I …
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables. I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much. Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d …
I have a huge dataset with the categorical columns in features and also my target variable is categorical. All the values are not ordinal so I think it is best to use one hot encoding. But I have one issue that my target variable have 90 classes so if I do one hot encoding there will be 90 columns as the target columns and it will become to much complex. But as all the values are not ordinal can I …
I have monthly snapshots (3 years) of all the contract data. It includes following information: Contract status [Categorical]: Proposed, tracked, submitted, won, lost, etc Contract stages [Categorical]: Prospecting, engaged, tracking, submitted, etc. Duration of contract [Date/Time] : months and years Bid Start date [Date/Time]: Date (But this changes when the contracts are delayed) Contract value [Numerical] : Value of the contract in local currency Future revenue projection [Numerical]: Currency value breakdown of revenue for next 5 years (this value is …
My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC-->XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands of combinations. I already tried KModes clustering where a number of clusters = unique values in column1. However, each cluster does not have equal density hence some bad data with high density is classified as normal and good data with low density is marked anomalous. I want …
So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values. For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some …
I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …
I'm currently getting more into the topic of GANs and Generating Models. I've understood how the Generator and Discriminator work together in optimization to generate synthetic samples. Now I'm wondering how the model learns to reflect the occurance frequencies of the true entries in case of categorical data. As an example, lets say we have two columns of entries (1,A), (1, A), (2, A), (2, B) The model, when trained would not only try to output real combinations, so e.g. …
I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …
There is a need to estimate Annual Average Daily Traffic Volume (AADT). We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT. The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$. …
I have a Pandas data frame with columns within a survey with the following categorical values - "Increased, Decreased, Neutral". My question is how can I assign specific numerical values to these categorical values, namely +1 for Increased, -1 for Decreased and 0 for Neutral.
I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0). But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels. The data is highly imbalanced. 0.5% of y is 1. I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't …
I face a classification task: with several features a target features is to be predicted. I'm working with python. My dataset includes 60 features from which I picked 16 which I think could be relevant (many others are time stamps, for example). The problem is that most of these 16 features are categorical - encoding them with get_dummies generates 886 features. The data also includes about 17 million observations. I am now wondering about how to tackle this problem and …
I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …
Just wanted to check if there are any obvious flaws with a custom encoding idea I have - for categorical features used with RandomForestClassifer or any tree-based classifier. As all of you would know that sklearn can only handle numerical valued features, categorical features should somehow be encoded to have numerical values. The most recommended encoding techniques on the web are - OneHotEncoding and OrdinalEncoding (and Label Encoding - but a lot of posts say this could make the model …
In the dataset I need to analyse, I need to look at whether the effect of people's profession (3 categories) on their scores on a test (I have already tested for this effect and found one) differs across levels of a second categorical variable (whether they work at home, in person, or a mixture). I'm struggling to wrap my head around how to do this in R...