confusing regarding to kmeans clulstering for data correlation

I am trying to think through my process before doing any real coding. However, got really confused easily.

Say I have 100 instruments and I know their price movements every day for a year. So I can create a movement matrix

A =[[I1-1, I2-1, .... I100-1],  (I1-1 is price for instrument 1 on day 1)
    [I1-2, I2-2, .... I100-2],
    ....
    [I1-365, I-2365, .... I100-365]
    ]

Then for each instrument, I can calculate a price movement correlation between other instruments for the whole year.

   C =[C1-2, C1-3,...C1-100,C2-3,....C99-100] (C1-2 is the price movement correlation between instrument 1 and 2 for the whole year)

Then I would like to apply a K-Means clustering algorithm to classify the correlation into say 10 categories. So in theory, I created 10 categories that the prices turned to move together.

However, the more I think about it, the more it is not correct. For example, if this is my Correlation result:

 C =[0.35, 0.59,...0.88(C1-100),0.48,....0.99(C99-100)]

isn't it K-Means clustering may classify C1-100, C99-100 in one cluster, and C1-2, C1-3, C2-3 in another cluster.

When I read that, it means instrument 1,100, 99 in one category, and instrument 1,2,3 in another category. But I would like each instrument only available in one category, so looks like there is a hole in my idea or maybe my idea is totally wrong?

Topic python k-means clustering machine-learning

Category Data Science


You will not get what you seek this way, but you are on the right path. Use the correlation between two instruments as a measure of similarity, and then perform spectral clustering with this measure as the kernel.

Basically, you will start with your correlation matrix $R$, and build the corresponding Laplacian matrix $L$. The eigen vectors corresponding to the smallest eigen values of $L$ will give you a projection space in which you can perform k-means clustering.

This technique is efficient if you have a good similarity measure, but only works for reasonably sized datasets (because you need the eigen decomposition of the $n \times n$ matrix).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.