categorical-data

Queries regarding feature importance for categorical features

Pradip

2022年6月2日 09:08

Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …

Topic: feature-engineering feature-selection categorical-data

Category: Data Science

Model for predicting duration based on categorical data

Kadin

2022年5月28日 17:05

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …

Topic: model-selection python predictive-modeling categorical-data

Category: Data Science

An Unsupervised learning method suitable for large categorical data sets

HoonP

2022年5月27日 22:05

I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values. This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes …

Topic: unsupervised-learning anomaly-detection categorical-data machine-learning

Category: Data Science

faster alternatives to sparse.model.matrix?

Isaac T

2022年5月26日 08:01

I have a large dataset that is entirely categorical. I'm trying to train with it using xgboost, so I must first convert this categorical data to numerical. So far I've been using sparse.model.matrix() in the Matrix library but it is far too slow. I found a great solution here, however, the sparse matrix it returns in not the same one that sparse.model.matrix returns. I know there is a way to force sparse.model.matrix to return identical output as the solution in …

Topic: representation r categorical-data

Category: Data Science

Clustering algorithm for time series data with categorical dtypes

Saurus

2022年5月21日 21:04

I have a large dataset with around 200 features, consisting mostly of timeseries and categorical data, with some continuous. The dataset is extracted from/by a postal service. Small example: Random (scrambled) entries: shipment delivery cost location weight_kg 2020-04-22 2020-04-23 77.31 UK:66c54f531.... 0.5 2020-04-23 2020-04-25 44.14 DK:22c54f531.... 2.23 2020-04-24 2020-04-27 53.84 UK:66c54f531.... 1.57 2020-04-25 2020-04-26 22.09 UK:66c54f531.... My first inclination was to make a demand-forecast model on shipment/count_monthly(shipment), but considering the amount of features, a multivariate case seemed more relevant. I …

Topic: time-series categorical-data clustering

Category: Data Science

Aggregating multiple encoded categorical values

Vishwa Kalyanaraman

2022年5月20日 05:05

I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables. I am currently using a dataset with a feature CATEGORY which has a cardinality of ~20,000. One-hot encoding does not make sense has it would increase the feature space by too much. Each observation in my dataset can take multiple values for the CATEGORY feature, for instance, row 1 could have the value a but row 2 could have the values a, b, c, d …

Topic: feature-engineering encoding categorical-data machine-learning

Category: Data Science

XGBOOST with target column has categorical data and features also has categorical data

Utkarsh Goyal

2022年5月19日 08:01

I have a huge dataset with the categorical columns in features and also my target variable is categorical. All the values are not ordinal so I think it is best to use one hot encoding. But I have one issue that my target variable have 90 classes so if I do one hot encoding there will be 90 columns as the target columns and it will become to much complex. But as all the values are not ordinal can I …

Topic: one-hot-encoding xgboost categorical-data

Category: Data Science

What are some good methods to forecast future revenue on categorical and value based data?

RWS

2022年5月16日 15:00

I have monthly snapshots (3 years) of all the contract data. It includes following information: Contract status [Categorical]: Proposed, tracked, submitted, won, lost, etc Contract stages [Categorical]: Prospecting, engaged, tracking, submitted, etc. Duration of contract [Date/Time] : months and years Bid Start date [Date/Time]: Date (But this changes when the contracts are delayed) Contract value [Numerical] : Value of the contract in local currency Future revenue projection [Numerical]: Currency value breakdown of revenue for next 5 years (this value is …

Topic: forecasting time-series feature-extraction categorical-data machine-learning

Category: Data Science

Anomaly detection using clustering of highly correlated Categorical data

viral kapadia

2022年5月8日 22:01

My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC-->XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands of combinations. I already tried KModes clustering where a number of clusters = unique values in column1. However, each cluster does not have equal density hence some bad data with high density is classified as normal and good data with low density is marked anomalous. I want …

Topic: anomaly-detection scikit-learn categorical-data clustering

Category: Data Science

Dealing with multiple distinct-value categorical variables

Abdullah Mohamed

2022年5月7日 21:02

So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values. For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some …

Topic: word-embeddings neural-network categorical-data machine-learning

Category: Data Science

Categorical data preprocessing for training a algorithm

spd

2022年5月6日 04:42

I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …

Topic: python-3.x prediction preprocessing categorical-data machine-learning

Category: Data Science

How do GANs learn category distributions

Guest

2022年5月5日 11:25

I'm currently getting more into the topic of GANs and Generating Models. I've understood how the Generator and Discriminator work together in optimization to generate synthetic samples. Now I'm wondering how the model learns to reflect the occurance frequencies of the true entries in case of categorical data. As an example, lets say we have two columns of entries (1,A), (1, A), (2, A), (2, B) The model, when trained would not only try to output real combinations, so e.g. …

Topic: adversarial-ml generative-models gan categorical-data machine-learning

Category: Data Science

Encode each comma separated value in Pandas

spd

2022年5月1日 04:14

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …

Topic: categorical-encoding one-hot-encoding python-3.x pandas categorical-data

Category: Data Science

How to build multiple variable regression having a mix of numerical & categorical features?

Артём Ощепков

2022年4月26日 22:02

There is a need to estimate Annual Average Daily Traffic Volume (AADT). We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT. The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$. …

Topic: multivariate-distribution features regression categorical-data

Category: Data Science

How do I assign specific values to categorical variables

2022年4月26日 04:11

I have a Pandas data frame with columns within a survey with the following categorical values - "Increased, Decreased, Neutral". My question is how can I assign specific numerical values to these categorical values, namely +1 for Increased, -1 for Decreased and 0 for Neutral.

Topic: data visualization pandas python categorical-data

Category: Data Science

Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

Martin

2022年4月24日 21:47

I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0). But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels. The data is highly imbalanced. 0.5% of y is 1. I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't …

Topic: imbalanced-data machine-learning-model classification categorical-data

Category: Data Science

Which classification model to use on large, high-dimensional dataset?

Julian

2022年4月20日 16:04

I face a classification task: with several features a target features is to be predicted. I'm working with python. My dataset includes 60 features from which I picked 16 which I think could be relevant (many others are time stamps, for example). The problem is that most of these 16 features are categorical - encoding them with get_dummies generates 886 features. The data also includes about 17 million observations. I am now wondering about how to tackle this problem and …

Topic: classification python categorical-data

Category: Data Science

What kind of hypothesis testing in Python can be used to validate that 4 job titles are significantly different based on their skillset?

Justin Schmidt

2022年4月18日 12:41

I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …

Topic: web-scraping hypothesis-testing scipy python categorical-data

Category: Data Science

Custom Encoding for Categorical Features - sklearn

ranger101

2022年4月18日 11:07

Just wanted to check if there are any obvious flaws with a custom encoding idea I have - for categorical features used with RandomForestClassifer or any tree-based classifier. As all of you would know that sklearn can only handle numerical valued features, categorical features should somehow be encoded to have numerical values. The most recommended encoding techniques on the web are - OneHotEncoding and OrdinalEncoding (and Label Encoding - but a lot of posts say this could make the model …

Topic: scikit-learn python categorical-data machine-learning

Category: Data Science

Whether effect of one categorical variable on a continuous variable depends on levels of another categorical variable

Luca Sorrentino

2022年4月15日 17:20

In the dataset I need to analyse, I need to look at whether the effect of people's profession (3 categories) on their scores on a test (I have already tested for this effect and found one) differs across levels of a second categorical variable (whether they work at home, in person, or a mixture). I'm struggling to wrap my head around how to do this in R...

Topic: r categorical-data

Category: Data Science

About