How do I handle string feature while performing model generation
I have data which looks like this
shift_id user_id status organization_id location_id department_id open_positions city zip role_id specialty_id latitude longitude years_of_experience
2 9 S 1 1 19 1 brooklyn 48001 2 9 42.643 -82.583
6 60 S 12 19 20 1 test 68410 3 7 40.608 -95.856
9 61 S 12 19 20 1 new york 48001 1 7 42.643 -82.583
10 60 S 12 19 20 1 test 68410 3 7 40.608 -95.856
21 3 S 1 1 19 1 pune 48001 1 2 46.753 -89.584 0
4 7 S 1 1 19 1 needham 2494 4 4 42.292 -71.246 2
So it contains string as well as numerical features.
I first want to perform Feature elimination and then SVM on it.
Here is my code to do it.
dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)
#Data Pre-processing
data = dataset.drop('organization_id',1)
#data = data.drop('status',1)
#data = data.drop('city',1)
#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = data['zip'].median(),data['role_id'].median(),data['specialty_id'].median(),data['latitude'].median(),data['longitude'].median()
data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)
#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)
target = dataset.location_id
#Perform Recursive Feature Extraction
svm = SVR(kernel="linear")
rfe = RFE(svm, 5, step=1)
rfe = rfe.fit(data, target)
print(rfe.n_features_)
print(rfe.support_)
But as column status
and city
has string value, it is giving -
ValueError: could not convert string to float: 'S'
Having such string feature is obvious. What is the standard practice to handle this kind of scenario?
Topic descriptive-statistics scikit-learn pandas python
Category Data Science