How do I handle string feature while performing model generation

Question

How do I handle string feature while performing model generation

nlper

2022年1月8日 12:28

I have data which looks like this

shift_id    user_id status  organization_id location_id department_id   open_positions  city    zip role_id specialty_id    latitude    longitude   years_of_experience                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
2   9   S   1   1   19  1   brooklyn    48001   2   9   42.643  -82.583                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
6   60  S   12  19  20  1   test    68410   3   7   40.608  -95.856                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
9   61  S   12  19  20  1   new york    48001   1   7   42.643  -82.583                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
10  60  S   12  19  20  1   test    68410   3   7   40.608  -95.856                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
21  3   S   1   1   19  1   pune    48001   1   2   46.753  -89.584 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
4   7   S   1   1   19  1   needham 2494    4   4   42.292  -71.246 2

So it contains string as well as numerical features.

I first want to perform Feature elimination and then SVM on it.

Here is my code to do it.

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)

#Data Pre-processing
data = dataset.drop('organization_id',1)
#data = data.drop('status',1)
#data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = data['zip'].median(),data['role_id'].median(),data['specialty_id'].median(),data['latitude'].median(),data['longitude'].median() 
data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)
target = dataset.location_id

#Perform Recursive Feature Extraction
svm = SVR(kernel="linear")
rfe = RFE(svm, 5, step=1)
rfe = rfe.fit(data, target) 
print(rfe.n_features_)
print(rfe.support_)

But as column status and city has string value, it is giving -

ValueError: could not convert string to float: 'S'

Having such string feature is obvious. What is the standard practice to handle this kind of scenario?

Topic descriptive-statistics scikit-learn pandas python

Category Data Science

Pluviophile · Accepted Answer · 2022年1月8日 12:28

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model.

The performance of a Machine Learning Model not only depends on the model and the hyperparameters but also on how we process and feed different types of variables* to the model. Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.

Usually there are 2 kinds of categorical data:

Ordinal Data: The categories have an inherent order in Job Change dataset are: [ 'education_level', 'experience','company_size','last_new_job']
Nominal Data: The categories do not have an inherent order in Job chanege dataset are: ['city','gender','enrolled_university','major_discipline', 'company_type','relevent_experience',] (binary data could be nominal or ordinal)

Generally:

In Ordinal data, while encoding, one should retain the information regarding the order in which the category is provided.

While encoding Nominal data, we have to consider the presence or absence of a feature. In such a case, no notion of order is present.

Types of Categorical Techniques:

Backward Difference Coding
BaseN
Binary
CatBoost Encoder
Count Encoder
Generalized Linear Mixed Model Encoder
Hashing
Helmert Coding
James-Stein Encoder
Leave One Out
M-estimate
One Hot
Ordinal
Polynomial Coding
Sum Coding
Target Encoder
Weight of Evidence
Wrappers
Quantile Encoder
Summary Encoder

More details on these encoding techniques can be found in the category_encoders documentation

Useful Links

A Kaggle notebook - 11 Categorical Encoders and Benchmark on using the encoders
Github Link on CategoricalEncodingBenchmark
Categorical Encoding, feature-engineering-for-machine-learning with detailed explanation
CODING SYSTEMS FOR CATEGORICAL VARIABLES

Tasos · Accepted Answer · 2019年2月15日 08:45

What you need to do is called One Hot Encoding. There are two ways to do. One is using Scikit-learn as described in Scikit-Learn documentation or use get_dummies from pandas.

Example 1:

from sklearn.preprocessing import OneHotEncoder
status_encoder = OneHotEncoder()
city_encoder = OneHotEncoder()
X = status_encoder.fit_transform(df.status.values.reshape(-1,1)).toarray()
Xm = city_encoder.fit_transform(df.city.values.reshape(-1,1)).toarray()

dfOneHot = pd.DataFrame(X, columns = ["Status_"+str(int(i)) for i in range(X.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

dfOneHot = pd.DataFrame(Xm, columns = ["City"+str(int(i)) for i in range(X.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

Example 2:

one_hot = pd.get_dummies(data=df, columns=['status', 'city'])
df = df.drop('status',axis = 1)
df = df.drop('city',axis = 1)
df = df.join(one_hot)

How do I handle string feature while performing model generation

About