what would be the correct representation of categorical variables like sex?

I have a doubt about what will be the right way to use or represent categorical variables with only two values like sex. I have checked it up from different sources, but I was not able to find any solid reference. For example, if I have the variable sex I usually see this in this form:

id sex 1 male 2 female 3 female 4 male

So I found that one can use dummy variables like this:

(https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/)

and also in this way:

(https://stattrek.com/multiple-regression/dummy-variables.aspx)

Therefore, which one would be more adequate way to deal with this variable, for example, in a classification system. I am inclined to go with the dummy variables, but I would like some opinion about it.

Thanks

Topic dummy-variables feature-selection

Category Data Science


There are three encoding options you can utilise for your scenario of sex(gender)

  1. One hot Encoding: Here each category is mapped to binary variable containing either 0 or 1.Widely utilized when features are nominal.
  2. Dummy Encoding: similar to one hot encoding. While one hot encoding utilises N binary variables for N categories in a variable. Dummy encoding uses N-1 features to represent N labels/categories
  3. Effect Encoding: Also known as deviation encoding or sum encoding. Similar to dummy encoding, however 3 values are used(1,0,-1)

Do look into Encoding Categorical Variables

Do note that gender identity is not always binary(0 or 1)

There are many different gender identities, including male, female, transgender, gender neutral, non-binary, agender, pangender, genderqueer, two-spirit, third gender, and all, none or a combination of these.


This case can be simplified with a single boolean feature because the original variable sex is binary: it can only have values male or female.

This implies that the two values are complementary of each other, so there is no need to keep both: $X_1$ contains exactly as much information as keeping both sex_male and sex_female.

Note that this simplification cannot be done as soon as the categorical variable can have more than two values.

Side note: sex is not always a binary variable anymore, many surveys would propose a third options such as "doesn't identify as binary".

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.