DBSCAN - Space complexity of O(n)?

According to Wikipedia, "the distance matrix of size $\frac{(n^2-n)}{2}$ can be materialized to avoid distance recomputations, but this needs $O(n^2)$ memory, whereas a non-matrix based implementation of DBSCAN only needs $O(n)$ memory."

$\frac{(n^2-n)}{2}$ is basically the triangular matrix. However, it says that a non-matrix based implementation only requires $O(n)$ memory. How does that work? Regardless of what data structure you use, don't you always have to have $\frac{(n^2-n)}{2}$ distance values? It would still be $O(n^2)$ space complexity, no? Is there something I'm missing here? I'm working with a huge dataset and I would really like to cut down on memory usage.

Topic dbscan clustering scalability

Category Data Science


You can run DBSCAN without storing the distances in a matrix. This has the drawback that each time you visit a point, you have to recalculate all the relevant distances, which requires more time. However, the space complexity stays $O(n)$, since the only things you have in memory at any single time are the positions of the n points, their various labels, the neighbors of the current point and the neighbors of a particular neighbour in the case that the point turns out to be a core point.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.