How should I handle time-duration-based columns in classification?

For example, say I am trying to predict whether I will win my next pickleball game. Some features I have are the number of hits, how much water I’ve drinken, etc, and the duration of the match.

I’m asking specifically for ensemble models but will extend this question to other scenarios, what format would the duration column best be in? (e.g. milliseconds, seconds, minutes (integer), minutes (float), one column for minutes and one column for seconds, etc)

Topic feature-engineering ensemble-modeling classification feature-extraction

Category Data Science


In Ensemble Techniques (Bagging, Boosting e.g. Trees) are based on making decisions by an ensemble of weak learners, and each learner make a decision by splitting values in each feature columns. This is basically how decisions trees are built to classify.

Categorical features: It is intuitive that X splits may be needed for X unique categories for a specific categorical feature in order for an ensemble to arrive at a decision label.

Numerical features (especially floats): Here those weak learners needs to split so many times (theoretical infinite bins) to reach to a decision. That is why they are suffer from such features often lead to overfitting.

Practical Tip: One way to handle numerical features is to bin them (look for binning methods) that fit best for your use case, in other words you are categorizing your numerical values into categories so that your ensemble can handle them properly. And as for time-based columns, you can derive relevant time-based features e.g. month, day of week, day or night, hour, and more (search easily how to extract features from timestamp column and see which ones are a good fit for the problem) to help the model to find suitable pattern when combined with other important features to classify. In my experience, this will work.

After all building a model is a trial and error endeavor. You need to try various scenarios performing above-mentioned techniques to see which one is giving the best performing model. Have a metric for your baseline so that you don't find yourself trapped into a endless training and feature engineering.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.