How do I handle string feature while performing model generation

I have data which looks like this

shift_id    user_id status  organization_id location_id department_id   open_positions  city    zip role_id specialty_id    latitude    longitude   years_of_experience                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
2   9   S   1   1   19  1   brooklyn    48001   2   9   42.643  -82.583                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
6   60  S   12  19  20  1   test    68410   3   7   40.608  -95.856                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
9   61  S   12  19  20  1   new york    48001   1   7   42.643  -82.583                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
10  60  S   12  19  20  1   test    68410   3   7   40.608  -95.856                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
21  3   S   1   1   19  1   pune    48001   1   2   46.753  -89.584 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
4   7   S   1   1   19  1   needham 2494    4   4   42.292  -71.246 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

So it contains string as well as numerical features.

I first want to perform Feature elimination and then SVM on it.

Here is my code to do it.

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)

#Data Pre-processing
data = dataset.drop('organization_id',1)
#data = data.drop('status',1)
#data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = data['zip'].median(),data['role_id'].median(),data['specialty_id'].median(),data['latitude'].median(),data['longitude'].median() 
data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)
target = dataset.location_id

#Perform Recursive Feature Extraction
svm = SVR(kernel="linear")
rfe = RFE(svm, 5, step=1)
rfe = rfe.fit(data, target) 
print(rfe.n_features_)
print(rfe.support_)


But as column status and city has string value, it is giving -

ValueError: could not convert string to float: 'S'

Having such string feature is obvious. What is the standard practice to handle this kind of scenario?

Topic descriptive-statistics scikit-learn pandas python

Category Data Science


Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model.

The performance of a Machine Learning Model not only depends on the model and the hyperparameters but also on how we process and feed different types of variables* to the model. Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.

Usually there are 2 kinds of categorical data:

  • Ordinal Data: The categories have an inherent order in Job Change dataset are: [ 'education_level', 'experience','company_size','last_new_job']

  • Nominal Data: The categories do not have an inherent order in Job chanege dataset are: ['city','gender','enrolled_university','major_discipline', 'company_type','relevent_experience',] (binary data could be nominal or ordinal)

Generally:

In Ordinal data, while encoding, one should retain the information regarding the order in which the category is provided.

While encoding Nominal data, we have to consider the presence or absence of a feature. In such a case, no notion of order is present.


Types of Categorical Techniques:

  • Backward Difference Coding
  • BaseN
  • Binary
  • CatBoost Encoder
  • Count Encoder
  • Generalized Linear Mixed Model Encoder
  • Hashing
  • Helmert Coding
  • James-Stein Encoder
  • Leave One Out
  • M-estimate
  • One Hot
  • Ordinal
  • Polynomial Coding
  • Sum Coding
  • Target Encoder
  • Weight of Evidence
  • Wrappers
  • Quantile Encoder
  • Summary Encoder

More details on these encoding techniques can be found in the category_encoders documentation

Useful Links



What you need to do is called One Hot Encoding. There are two ways to do. One is using Scikit-learn as described in Scikit-Learn documentation or use get_dummies from pandas.

Example 1:

from sklearn.preprocessing import OneHotEncoder
status_encoder = OneHotEncoder()
city_encoder = OneHotEncoder()
X = status_encoder.fit_transform(df.status.values.reshape(-1,1)).toarray()
Xm = city_encoder.fit_transform(df.city.values.reshape(-1,1)).toarray()

dfOneHot = pd.DataFrame(X, columns = ["Status_"+str(int(i)) for i in range(X.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

dfOneHot = pd.DataFrame(Xm, columns = ["City"+str(int(i)) for i in range(X.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

Example 2:

one_hot = pd.get_dummies(data=df, columns=['status', 'city'])
df = df.drop('status',axis = 1)
df = df.drop('city',axis = 1)
df = df.join(one_hot)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.