Clustering algorithm for time series data with categorical dtypes

Question

Clustering algorithm for time series data with categorical dtypes

Saurus

2022年5月21日 21:04

I have a large dataset with around 200 features, consisting mostly of timeseries and categorical data, with some continuous. The dataset is extracted from/by a postal service. Small example:

Random (scrambled) entries:

  shipment        delivery          cost        location                weight_kg

 2020-04-22      2020-04-23         77.31       UK:66c54f531....           0.5
 2020-04-23      2020-04-25         44.14       DK:22c54f531....           2.23
 2020-04-24      2020-04-27         53.84       UK:66c54f531....           1.57 
 2020-04-25      2020-04-26         22.09       UK:66c54f531....

My first inclination was to make a demand-forecast model on shipment/count_monthly(shipment), but considering the amount of features, a multivariate case seemed more relevant. I am just not sure which additional features to add - and without this project becoming to generic (linear regression). Mine initial EDA depicted variables with low correlation, or removed otherwise to avoid multicollinearity.

Then, instead I considered a clustering approach, to gather and depict relations between the features in more detail. Just not sure how to approach it with such a data size and with timeseries, never really worked with that dtype, especially in combination with categorical dtypes. Any advice would be appreciated.

Edit: the various date columns (like shipment and delivery) are not chronological, and their values appear numerous times, thus cannot be timeseries either. This begs another question: does it even make sense to convert the columns in question to a datetime object?

Topic time-series categorical-data clustering

Category Data Science

Brian Spiering · Accepted Answer · 2022年2月3日 14:55

Most clustering algorithms are designed to work with numeric features.

Those date-related columns can be converted to numeric features. Shattering is one way to convert them to numeric features, create separate features for year, month, and day.

Additional time-based features can be created such as length of shipment.

Clustering algorithm for time series data with categorical dtypes

About