Scalable way to calculate betweenness centrality for a graph in spark

I have a use-case to calculate betweenness centrality of nodes. I have tried graphx with spark-betweenness but it is a very long running job. Has anyone successfully calculated betweenness centrality of a large network with around 10 million vertices and 100 million edges?

Topic networkx graphs apache-spark social-network-analysis machine-learning

Category Data Science


Sorry, I do not think you can compute the exact betweenness centrality of nodes in a graph this size, as its complexity is $O(n\cdot m)$ where $n$ is the number of nodes, $m$ the number of links.

The good news is that you may approximate it, and in a way that may take benefit from parallel computations. Indeed, computing betweenness centrality relies on counting the number of shortest paths from any node to any other. You may (randomly) select some nodes and compute the numbers of shortest path from each of them to all others, and use the obtained number to approximate the betweenness. The more nodes you select, the best the approximate will be, but it is empirically rather good even with a small sample set.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.