Queries regarding feature importance for categorical features

Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …
Category: Data Science

Model for predicting duration based on categorical data

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …
Category: Data Science

An Unsupervised learning method suitable for large categorical data sets

I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values. This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes …
Category: Data Science

faster alternatives to sparse.model.matrix?

I have a large dataset that is entirely categorical. I'm trying to train with it using xgboost, so I must first convert this categorical data to numerical. So far I've been using sparse.model.matrix() in the Matrix library but it is far too slow. I found a great solution here, however, the sparse matrix it returns in not the same one that sparse.model.matrix returns. I know there is a way to force sparse.model.matrix to return identical output as the solution in …
Category: Data Science

Clustering algorithm for time series data with categorical dtypes

I have a large dataset with around 200 features, consisting mostly of timeseries and categorical data, with some continuous. The dataset is extracted from/by a postal service. Small example: Random (scrambled) entries: shipment delivery cost location weight_kg 2020-04-22 2020-04-23 77.31 UK:66c54f531.... 0.5 2020-04-23 2020-04-25 44.14 DK:22c54f531.... 2.23 2020-04-24 2020-04-27 53.84 UK:66c54f531.... 1.57 2020-04-25 2020-04-26 22.09 UK:66c54f531.... My first inclination was to make a demand-forecast model on shipment/count_monthly(shipment), but considering the amount of features, a multivariate case seemed more relevant. I …
Category: Data Science

Aggregating multiple encoded categorical values

I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables. I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much. Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d …
Category: Data Science

XGBOOST with target column has categorical data and features also has categorical data

I have a huge dataset with the categorical columns in features and also my target variable is categorical. All the values are not ordinal so I think it is best to use one hot encoding. But I have one issue that my target variable have 90 classes so if I do one hot encoding there will be 90 columns as the target columns and it will become to much complex. But as all the values are not ordinal can I …
Category: Data Science

What are some good methods to forecast future revenue on categorical and value based data?

I have monthly snapshots (3 years) of all the contract data. It includes following information: Contract status [Categorical]: Proposed, tracked, submitted, won, lost, etc Contract stages [Categorical]: Prospecting, engaged, tracking, submitted, etc. Duration of contract [Date/Time] : months and years Bid Start date [Date/Time]: Date (But this changes when the contracts are delayed) Contract value [Numerical] : Value of the contract in local currency Future revenue projection [Numerical]: Currency value breakdown of revenue for next 5 years (this value is …
Category: Data Science

Anomaly detection using clustering of highly correlated Categorical data

My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC-->XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands of combinations. I already tried KModes clustering where a number of clusters = unique values in column1. However, each cluster does not have equal density hence some bad data with high density is classified as normal and good data with low density is marked anomalous. I want …
Category: Data Science

Dealing with multiple distinct-value categorical variables

So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values. For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some …
Category: Data Science

Categorical data preprocessing for training a algorithm

I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …
Category: Data Science

How do GANs learn category distributions

I'm currently getting more into the topic of GANs and Generating Models. I've understood how the Generator and Discriminator work together in optimization to generate synthetic samples. Now I'm wondering how the model learns to reflect the occurance frequencies of the true entries in case of categorical data. As an example, lets say we have two columns of entries (1,A), (1, A), (2, A), (2, B) The model, when trained would not only try to output real combinations, so e.g. …
Category: Data Science

Encode each comma separated value in Pandas

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …
Category: Data Science

How to build multiple variable regression having a mix of numerical & categorical features?

There is a need to estimate Annual Average Daily Traffic Volume (AADT). We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT. The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$. …
Category: Data Science

Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0). But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels. The data is highly imbalanced. 0.5% of y is 1. I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't …
Category: Data Science

Which classification model to use on large, high-dimensional dataset?

I face a classification task: with several features a target features is to be predicted. I'm working with python. My dataset includes 60 features from which I picked 16 which I think could be relevant (many others are time stamps, for example). The problem is that most of these 16 features are categorical - encoding them with get_dummies generates 886 features. The data also includes about 17 million observations. I am now wondering about how to tackle this problem and …
Category: Data Science

What kind of hypothesis testing in Python can be used to validate that 4 job titles are significantly different based on their skillset?

I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …
Category: Data Science

Custom Encoding for Categorical Features - sklearn

Just wanted to check if there are any obvious flaws with a custom encoding idea I have - for categorical features used with RandomForestClassifer or any tree-based classifier. As all of you would know that sklearn can only handle numerical valued features, categorical features should somehow be encoded to have numerical values. The most recommended encoding techniques on the web are - OneHotEncoding and OrdinalEncoding (and Label Encoding - but a lot of posts say this could make the model …
Category: Data Science

Whether effect of one categorical variable on a continuous variable depends on levels of another categorical variable

In the dataset I need to analyse, I need to look at whether the effect of people's profession (3 categories) on their scores on a test (I have already tested for this effect and found one) differs across levels of a second categorical variable (whether they work at home, in person, or a mixture). I'm struggling to wrap my head around how to do this in R...
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.