Encoding features like month and hour as categorial or numeric?

Is it better to encode features like month and hour as factor or numeric in a machine learning model?

On the one hand, I feel numeric encoding might be reasonable, because time is a forward progressing process (the fifth month is followed by the sixth month), but on the other hand I think categorial encoding might be more reasonable because of the cyclic nature of years and days ( the 12th month is followed by the first one).

Is there a general solution or convention for this?

Topic feature-engineering numerical encoding feature-extraction machine-learning

Category Data Science


To rephrase the answer provided by @raghu. One major difference between categorical and numerical features is whether the magnitude of the numbers are comparable, i.e., is 2019 bigger than 2018, or December(12) bigger than March (3)? Not really. While there is a sequential order in these numbers, their magnitude is not comparable. Thus, transforming into a categorical value may make more sense.


The answer depends on the kind of relationships that you want to represent between the time feature, and the target variable.

If you encode time as numeric, then you are imposing certain restrictions on the model. For a linear regression model, the effect of time is now monotonic, either the target will increase or decrease with time. For decision trees, time values close to each other will be grouped together.

Encoding time as categorical gives the model more flexibility, but in some cases, the model may not have enough data to learn well. One technique that may be useful is to group time values together into some number of sets, and use the set as a categorical attribute.

Some example groupings:

  • For month, group into quarters or seasons, depending upon the use case. Eg: Jan-Mar, Apr-Jun, etc.
  • For hour-of-day, group into time-of-day buckets: morning, evening, etc,
  • For day-of-week, group into weekday, weekend.

Each of the above can also be used directly as a categorical attribute as well, given enough data. Further, groupings can also be discovered by data analysis, to complement a domain knowledge based approach.


I recommend using numerical features. Using categorical features essentially means that you don't consider distance between two categories as relevant (e.g. category 1 is as close to category 2 as it is to category 3). This is definitely not the case for hours or months.

However, the issue that you raise is that you want to represent hours and months in a manner where 12 is as close to 11 as it is to 1. In order to achieve that, I recommend going with what was suggested in the comments and using a sine/cosine function before using the hours/months as numerical features.


Have you considered adding the (sine, cosine) transformation of the time of day variable? This will ensure that the 0 and 23 hour for example are close to each other, thus allowing the cyclical nature of the variable to shine through.

(More Info)


It depends on which algorithm you're using.

If you're using tree-based algorithms like random forest, just pass this question. Categorical encoding isn't necessary for tree-based algorithms.

For other algorithms like neural network, I suggest trying both method(continuous & categorical). The effect differs between different situations.


Because of all the data you have is well defined I would suggest you a categorical encoding, which is also easier to apply.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.