Is there any clustering algorithm to find longest continuous subsequences?

I have data which contains access duration of some items.

Example:
t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't.

ID,t0,t1,t2,t3,t4
0,0,0,1,1,1
1,0,1,1,1,1
2,0,1,1,0,0
3,1,1,0,0,1
4,1,1,0,0,1

In the above example, groups ID=0,1 are what I want.

ID=3,4 aren't because their distance is short but they are not continuous.

I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what I want.

Is there any possible pre-processing of data to reach what I want ?

Or I should use other analytic tool?

Topic dbscan python k-means clustering machine-learning

Category Data Science


It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.

K-means minimizes the sum of squares. I don't see how that is beneficial here.

There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.

My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.


What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.

You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.