Scaling DBSCAN clustering - minHash?
Applying density based clustering (DBSCAN) on $50k$ data points and about $2k$-$4k$ features, I achieve the desired results.
However, scaling this to $10$ million data points requires a creatively efficient implementation since DBSCAN requires $O(n^2)$ to calculate the distance matrix and crushes my memory.
There must be some efficient sampling-based method to overcome this, ideally something similar to minHash - but I'm not sure how to approach this, and if there exists a solution that can work on existing sklearn DBSCAN algorithm. Any ideas?
Topic dbscan clustering scalability machine-learning
Category Data Science