Algorithm for segmentation of sequence data

Question

Algorithm for segmentation of sequence data

generall

2022年3月9日 23:17

I have a large sequence of vectors of length N. I need some unsupervised learning algorithm to divide these vectors into M segments.

For example:

K-means is not suitable, because it puts similar elements from different locations into a single cluster.

Update:

The real data looks like this:

Here, I see 3 clusters: [0..50], [50..200], [200..250]

Update 2:

I used modified k-means and got this acceptable result:

Borders of clusters: [0, 38, 195, 246]

Topic sequence clustering machine-learning

Category Data Science

Davide ND · Accepted Answer · 2019年12月5日 13:45

Just as a suggestion: you might try using the DBSCAN algorithm, as it often works much better than K-means for clustering

Otherwise, if you want to try something new for clustering and learn some interesting stuff, I suggest you try some Topological Data Analysis through persistent diagrams. I'mma leave you here a nice easy intro :)

https://towardsdatascience.com/persistent-homology-with-examples-1974d4b9c3d0

Kasra Manshaei · Accepted Answer · 2019年12月5日 12:58

Please see my comment above and this is my answer according to what I understood from your question:

As you correctly stated you do not need Clustering but Segmentation. Indeed you are looking for Change Points in your time series. The answer really depends on the complexity of your data. If the data is as simple as above example you can use the difference of vectors which overshoots at changing points and set a threshold detecting those points like bellow: enter image description here As you see for instance a threshold of 20 (i.e. $dx<-20$ and $dx>20$) will detect the points. Of course for real data you need to investigate more to find the thresholds.

Pre-processing

Please note that there is a trade-off between accurate location of the change point and the accurate number of segments i.e. if you use the original data you'll find the exact change points but the whole method is to sensitive to noise but if you smooth your signals first you may not find the exact changes but the noise effect will be much less as shown in figures bellow:

enter image description here

Conclusion

My suggestion is to smooth your signals first and go for a simple clustering mthod (e.g. using GMMs) to find an accurate estimation of the number of segments in signals. Given this information you can start finding changing points constrained by the number of segments you found from previous part.

I hope it all helped :)

Good Luck!

UPDATE

Luckily your data is pretty straightforward and clean. I strongly recommend dimensionality reduction algorithms (e.g. simple PCA). I guess it reveals the internal structure of your clusters. Once you apply PCA to the data you can use k-means much much easier and more accurate.

A Serious(!) Solution

According to your data I see the generative distribution of different segments are different which is a great chance for you to segment your time series. See this (original, archive, other source) which is probably the best and most state-of-the-art solution to your problem. The main idea behind this paper is that if different segments of a time series are generated by different underlying distributions you can find those distributions, set tham as ground truth for your clustering approach and find clusters.

For example assume a long video in which the first 10 minutes somebody is biking, in the second 10 mins he is running and in the third he is sitting. you can cluster these three different segments (activities) using this approach.

Nolatar · Accepted Answer · 2016年1月13日 00:10

K-means clustering is known to give local minima, depending on your initial initialisation of the cluster centres.

However, k-means segmentation can, I think, be solved globally, since we do not permute anything in finding the solution.

I can see from your comments that you did manage eventually to reach a segmentation. Would you be able to give some feedback, please? Is your solution the best solution? Or did you settle for a good enough solution?

Algorithm for segmentation of sequence data

Pre-processing

Conclusion

UPDATE

A Serious(!) Solution

About