XGBOOST with target column has categorical data and features also has categorical data

I have a huge dataset with the categorical columns in features and also my target variable is categorical. All the values are not ordinal so I think it is best to use one hot encoding. But I have one issue that my target variable have 90 classes so if I do one hot encoding there will be 90 columns as the target columns and it will become to much complex. But as all the values are not ordinal can I …
Category: Data Science

Does one-hot encode effects chi-square test?

I am doing a feature selection for a data science project with one of those feature being a high cardinality categorical variable (for context, it’s nationality). I know chi-square test could handle multiclass feature like mine but I need to do one-hot encode (dividing a multiclass variable into multiple binary variable based on its values) to be able to input it into my machine learning algorithm (spark mllib). My question is does doing one-hot encode effects the result of a …
Category: Data Science

Multi Linear Regression on String Values

I'm using datasets which involves mostly of string values. The main outcome of the project is that it should predict success. Now I can use OneHotEncoding to convert string values in numerical format but the values are a lot. I'm using Multi Linear Regression and the only numerical value is of the output which is supposed to be predicted by my model. Query 1: By using sklearn, when encoding the string values, should it not take the whole resources as …
Category: Data Science

Encode each comma separated value in Pandas

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …
Category: Data Science

How to use prediction model after onehot encoding?

I have created a prediction model for this dataset >>df.head() Service Tasks Difficulty Hours 0 ABC 24 1 0.833333 1 CDE 77 1 1.750000 2 SDE 90 3 3.166667 3 QWE 47 1 1.083333 4 ASD 26 3 1.000000 >>df.shape (998,4) >>X = df.iloc[:,:-1] >>y = df.iloc[:,-1].values >>from sklearn.compose import ColumnTransformer >>ct = ColumnTransformer([("cat", OneHotEncoder(),[0])], remainder="passthrough") >>X = ct.fit_transform(X) >>x = X.toarray() >>x = x[:,1:] >>x.shape (998,339) >>from sklearn.ensemble import RandomForestRegressor >>rf_model = RandomForestRegressor(random_state = 1) >>rf_model.fit(x,y) How can I …
Category: Data Science

Shall I use ordinal encoding or One-Hot-Encoding when using DBSCAN for content clustering on websites?

I want to cluster the preparation steps on cooking recipes websites in one cluster so I can distinguish them from the rest of the website. To achieve this I extracted for each text node of the website the DOM path (e.g. body->div->div->table->tr ....) and did a One-Hot-Encoding before I executed the DBSCAN clustering algorithm. My hope was, that the DBSCAN algorithm recognizes also not only 100% identical DOM-paths as 1 common cluster, because sometimes one preparation step is e.g. in …
Category: Data Science

Encoding very large dataset to one-hot encoding matrix

I have a dataset of text corpus where the unique characters in the text are around 400. The maximum row length is 3000. We have 20000 rows, so we would have like $2000\times3000\times400$ one-hot encoding matrix, which lead to memory error as the size needed jumped over 900 GB of RAM. There are dimensionality reduction techniques such as PCA and others, but other than that what would you recommend in my case please to overcome this issue? The text is …
Category: Data Science

Got some troubles with using OneHotEncoder to multiple categories

I'm trying to get the final pipeline on the titanic dataset(Example was taken from the 'Hands-on ML' book). from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer num_pipeline = Pipeline([ ('selector', DataFrameSelector(list(df_num))), ('imputer',SimpleImputer(strategy='median', fill_value='num',missing_values=np.nan)), ('std_scaler',StandardScaler()) ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(list(df_cat))), ('imputer',SimpleImputer(strategy='most_frequent', fill_value='categorical',missing_values=np.nan)), ('cat_encoder', OneHotEncoder(sparse=False)), ]) from sklearn.pipeline import FeatureUnion full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline), ]) df_prepared = full_pipeline.fit_transform(df) df_prepared.shape df_total = pd.DataFrame(df_prepared, columns=df.columns) df_total Where df_num = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] …
Category: Data Science

Will one hot encoding / unbalanced columns cause bias to Clustering Analysis?

I'm wondering if having too many columns about one certain feature is gonna cause bias to the clustering analysis. For example, if my dataset has columns = ['incoming calls', 'outgoing calls', 'missing calls', 'age'], and if I run clustering algorithms such as K-means or Mixture Model, will the clustering results be biased since it splits datasets mainly based on calls? Another example is if I have two categorical columns: color ('red','blue','green'), and shape ('circle','square'), after one hot encoding, color will …
Category: Data Science

One Hot Encoding where all sequences don't have all values

Is there a way (other than manually creating dictionaries) to one hot encode sequences in which not all values can be present in a sequence? sklearn's OneHotEncoder and numpy's to_categorical only account for the values in the current sample so for example, encoding DNA sequences of 'AT' and 'CG' would both be [[1, 0], [0, 1]]. However, I want A, T, C, and G to be accounted for in all sequences so 'AT' should be [[1, 0, 0, 0], [0, …
Category: Data Science

How should I OneHotEncod a column of (8128 rows and) 2058 nuniques?

The title, pretty much. I just want to know the best and most efficient way to OneHotEncode a column with like 2058 nuniques. Doing a fit_transform of said column, I know I will get an array of 2058 (minus 1 when you drop first) columns. Is it the right approach? Apart from that, I have another column that has about 441 nuniques, so that's another headache I need to take care of. I know for a fact that the first …
Category: Data Science

One hot encoding with Keras

I have a large dataset (500k rows) where one column contains the weekday of the purchase. For now, it is in the 0-6 format (Mon-Sun) and I think I should one-hot encode it before training my sequential NN in Keras. It is not clear if I should do this in the dataset (transform the one column into 7 columns, manually or is there a Pandas function?) or I can just have that one column containing an array of length 7. …
Category: Data Science

inconsistency between y and x numbers in the Split into train and test sets

I am new to the field to the data science, and need help in the following: I am working on a data set that consists of both categorical and numerical values, first I have concatenate the two files (train and test) to apply the EDA steps on it, then I have done the EDA steps on the follow data set, applied one hot encoding, spitted the data. I am getting the following message, it seems that there is inconsistency between …
Category: Data Science

How to handle categorical variables with Random Forest using Scikit Learn?

One of the variables/features is the department id, which is like 1001, 1002, ..., 1218, etc. The ids are nominal, not ordinal, i.e., they are just ids, department 1002 is by no means higher than department 1001. I feed the feature to random forest using Scikit Learn. How should I deal with it? Some people say to use one-hot encoding. However, Some others say the one-hot encoding degrades random forest's performance. Also, I do have over 200 departments, so I …
Category: Data Science

One-hot & interaction one-hot on multiple categorical

I was wondering if there is any value to creating combined features out of multiple categorical variables when the individual categorical variables are already one-hot encoded? Simple example: there is a variable P with categories {X, Y} and a variable Q with categories {Z, W}. After one-hot, we would have 4 variables: P.X, P.Y, Q.Z, and Q.W. In this scenario, I'm wondering if the algorithm (Xgboost or a deep neural network) would sufficiently learn interaction effects between these or is …
Category: Data Science

Anomaly detection for varying dictionary

I want to detect the anomaly in the processes taking up the most CPU percent. I receive the data as a time series of dictionary values like so: time process_most_cpu cpu% 0 2022-02-22 21:04:57.021740 {'chromium-browse': 38.70,'python': 32.00,'mutter': 2.90,'python3': 1.60} 26.10 1 2022-02-22 21:05:32.836466 {'chromium-browse': 25.70,'mutter': 2.90,'python3': 1.60} 34.50 2 2022-02-22 21:05:55.558390 {'chromium-browse': 21.70,'python': 5.80,'mutter': 2.90,'python3': 1.50} 5.70 3 2022-02-22 21:07:01.069036 {'pip': 37.90,'chromium-browse': 19.30,'mutter': 2.90,'python3': 1.50} 11.70 I'm not sure how to detect the anomaly here as the processes keep on …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.