Scaling DBSCAN clustering - minHash?

Question

Scaling DBSCAN clustering - minHash?

Adam

2018年3月26日 13:18

Applying density based clustering (DBSCAN) on $50k$ data points and about $2k$-$4k$ features, I achieve the desired results.

However, scaling this to $10$ million data points requires a creatively efficient implementation since DBSCAN requires $O(n^2)$ to calculate the distance matrix and crushes my memory.

There must be some efficient sampling-based method to overcome this, ideally something similar to minHash - but I'm not sure how to approach this, and if there exists a solution that can work on existing sklearn DBSCAN algorithm. Any ideas?

Topic dbscan clustering scalability machine-learning

Category Data Science

Has QUIT--Anony-Mousse · Accepted Answer · 2017年12月26日 12:19

DBSCAN is O(n) times the cost of a neighbor search.

If you use an index like LSH that could answer neighborhood search in O(1) (assuming a very even data distribution, with O(1) neighbors per point) then DBSCAN can run in O(n) plus the time needed to build such an index.

Yes, minHash indexes could be used if they are appropriate for your data and distance.

Scaling DBSCAN clustering - minHash?

About