Is converting a categorical value into numerical needed to find a correlation?

I have a small dataset of 1300 observations x 20 features. They are all numerical but one, which is categorical; this was calculated independently and relates to each observation in any case.

I'm now attempting to find the correlation of each features in my dataset, but a simple dataframe.corr() would omit the categorical from the calculation.

I have two choices as far as I can see:

  • Do not consider the categorical value, but this means not being able to suggest whether the adoption of internal processes from which that feature infers the value are optimal or not
  • Convert into a numerical

The categorical value looks like the school grading system:

 A: higher
 B: ...
 ...
 E: low

I don't think that converting into a numerical would result in a loss of magnitude so long I how what that conversion has been made. But here's my crux.

  • Should I do something like: A = 1 E 5 .. or A = 5 E = 1?
  • Would the two different values eventually affect the correlation process in the end?

I've been seeing minimal differences. For instance, on the same dataset with the A starting at 1 I got Rating correlated to my Y variable at -0.33; when A starts at 5 it returns -0.32. What I noticed is the correlation varies, and goes positive the more I refine the dataset.

Also, do consider I am also after using this dataset to later do some linear regression, and calculate the RMSE.

Any advice is welcome.

UPDATE:

I was able to further play around with the dataset, and forked it in two way, replacing the Rating score in two ways:

  • With {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5}
  • With {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1}

The results are NOT what I would have expected (opposed values), which means I am now more confused than before.

Dataset below for you to test:

    Index Ranking   Rating Correlation  # Results   Label
    0   1   0.064138    840 PKW_A1
    1   3   0.087673    245 PKW_A1
    2   5   -0.028258   111 PKW_A1
    3   7   0.017542    117 PKW_A1
    4   9   -0.249403   77  PKW_A1
    5   11  -0.138552   51  PKW_A1
    6   13  -0.090198   41  PKW_A1
    7   15  -0.333333   18  PKW_A1
    8   17  -0.076830   17  PKW_A1
    9   19  -0.113594   24  PKW_A1
    10  1   0.027015    840 PKW_A5
    11  3   0.116202    245 PKW_A5
    12  5   0.134111    111 PKW_A5
    13  7   0.094221    117 PKW_A5
    14  9   -0.070592   77  PKW_A5
    15  11  -0.127137   51  PKW_A5
    16  13  -0.275387   41  PKW_A5
    17  15  0.092450    18  PKW_A5
    18  17  0.055994    17  PKW_A5
    19  19  0.081427    24  PKW_A5

Topic heatmap linear-regression correlation

Category Data Science


Your categorical feature is an ordinal one meaning it conveys an order between the levels that represent it. So you can convert it to a numerical feature, however the difference between 2 consecutive levels is unknown and cannot be translated into numbers.

Example : High - Medium - Low --> 3 - 2 - 1 or 4 - 2 - 1 or 5 - 2 - 1. See? The order is preserved but the difference between the levels is arbitrary.

Also you oughta now that picking values for your feature affect correlation due to the fact that you're free to pick any values you want for your categorical feature.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.