Question About Coming Up With Own Function for Distance Matrix (For Clustering)

Right now, I am currently working on implementing a clustering algorithm with millions data entries with regards to game users for a mobile game.

A lot of the features I plan on using are unique to this game (data that can only be analyzed if one knows the game well), and thus I believe that it is best for my data that I come up a new function to generate the distance matrix that I plan on using in the various clustering algorithms later on.

My data is just a mixture of continuous and categorical data, and instead of using Gower Distance for the reasons specified above, I wanted to come up with my own formula.

For example, when comparing the download date (a time period measured by the datetime object from pandas), I would like to cross reference the download date with the time periods of the various promotions that went on and adjust the similarity score for the download date accordingly.

Now, if I have 35 million different user data, and around 20 features (for each of the 35 million different users), is this a feasible method of calculating the distance matrix?

I'm asking because I've read that using the gower distance to calculate the dissimilarity matrix for a large dataset is not feasible due to the time complexity.

I was wondering if the method I described earlier is feasible in terms of time complexity, or am I just wasting my time?

Thank you. As someone who doesn't have much experience with data science and clustering, any input would be grateful!

Thank you.

Topic distance hierarchical-data-format k-means clustering bigdata

Category Data Science


As @Anony-Mousse remind it, compute a similarity matrix is note applicable to large dataset. Memory complexity is linked to time complexity, if you want to read or writte $n$ value you will require at least $n$ unit of time to do it. Knowing that for $n$ element a similarity matrix will have $n^2$ entries and every regular problem that have at least a quadratic, as well as with memory than time, complexity are not scalable at all. Then you need $O(n)$ or $O(n.log(n))$ algorithm. In your position i would recommand to start with a basic algorithm working with mixed data, the $K$-Prototypes with a linear time complexity.

If you want to have a custom distance measure you can try the Clustering4Ever implementation of this algorithm which allow you to use any metric respecting the library design, the algorithm is available in Scala and Scala/Spark to run on multiple machines.


It's trivial to see that you don't want to compute the distance matrix. Just compute the size of the entire matrix, even assuming zero memory overhead... 35 million x 35 million x 8 bytes per double /2 for optimizing symmetry. That is something like 5 Petabyte?

Forget anything that is not linear. Try to shrink your data first by weeding out the easy cases first: duplicates, single-time visitors where you have too little data to say much, etc.

Then begin experimenting with a manageable sample first. If you find something on a sample then it might be useable on the entire data, too. And if a method does not work in the samples it likely will not work on the entire data (but you'll have to wait much less to know it doesn't work).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.