Is there any clustering algorithm to find longest continuous subsequences?

Question

Is there any clustering algorithm to find longest continuous subsequences?

code_worker

2021年10月13日 16:16

I have data which contains access duration of some items.

Example:
t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't.

ID,t0,t1,t2,t3,t4
0,0,0,1,1,1
1,0,1,1,1,1
2,0,1,1,0,0
3,1,1,0,0,1
4,1,1,0,0,1

In the above example, groups ID=0,1 are what I want.

ID=3,4 aren't because their distance is short but they are not continuous.

I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what I want.

Is there any possible pre-processing of data to reach what I want ?

Or I should use other analytic tool?

Topic dbscan python k-means clustering machine-learning

Category Data Science

Has QUIT--Anony-Mousse · Accepted Answer · 2019年3月30日 09:59

It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.

K-means minimizes the sum of squares. I don't see how that is beneficial here.

There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.

My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.

raghu · Accepted Answer · 2019年3月29日 17:34

What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.

You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.

Is there any clustering algorithm to find longest continuous subsequences?

About