How do I get the mean values that are greater than .5 for my model?

I am trying to build a classification model. One of the variables called specialty has 200 values. Based on a previous post I saw, I decided I wanted to include the values that have the highest mean. I am thinking greater than 0.5. How would I filter the specialty to have only values greater than 0.5 for the mean? I am trying to get my final dataset ready for machine learning. Any advice is appreciated.

Topic categorical-encoding logistic-regression classification categorical-data

Category Data Science


So if I understand you correctly you want to "one-hot-encode" or dummy-encode your variable "specialty" so that it goes from an interval scaled variable to a binary variable where 1 == >.5 and 0 == <=.5 correct?

So seeing as you are in python the following code would create a new variable that does what you want:

import pandas as pd
import numpy as np

df2['specialty_binned'] = np.digitize(df2['specialty'],bins=[0.5], right = True)

This would create a new variable in your data frame called 'specialty_binned' that is only 1s and 0s with 1 being values above 0.5 in the old variable.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.