Is this Neo4j comparison to RDBMS execution time correct?
Background: Following is from the book Graph Databases, which covers a performance test mentioned in the book Neo4j in Action:
Relationships in a graph naturally form paths. Querying, or traversing, the graph involves following paths. Because of the fundamentally path-oriented nature of the datamodel, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient. In their book Neo4j in Action, Partner and Vukotic perform an experiment using a relational store and Neo4j.
The comparison shows that the graph database is substantially quicker for connected data than a relational store.Partner and Vukotic’s experiment seeks to find friends-of-friends in a social network, to a maximum depth of five. Given any two persons chosen at random, is there a path that connects them which is at most five relationships long? For a social network containing 1,000,000 people, each with approximately 50 friends, the results strongly suggest that graph databases are the best choice for connected data, as we see in Table 2-1.
Table 2-1. Finding extended friends in a relational database versus efficient finding in Neo4j
Depth RDBMS Execution time (s) Neo4j Execution time (s) Records returned 2 0.016 0.01 ~2500 3 30.267 0.168 ~110,000 4 1543.505 1.359 ~600,000 5 Unfinished 2.132 ~800,000
At depth two (friends-of-friends) both the relational database and the graph database perform well enough for us to consider using them in an online system. While the Neo4j query runs in two-thirds the time of the relational one, an end-user would barely notice the the difference in milliseconds between the two. By the time we reach depth three (friend-of-friend-of-friend), however, it’s clear that the relational database can no longer deal with the query in a reasonable timeframe: the thirty seconds it takes to complete would be completely unacceptable for an online system. In contrast, Neo4j’s response time remains relatively flat: just a fraction of a second to perform the query—definitely quick enough for an online system.
At depth four the relational database exhibits crippling latency, making it practically useless for an online system. Neo4j’s timings have deteriorated a little too, but the latency here is at the periphery of being acceptable for a responsive online system. Finally, at depth five, the relational database simply takes too long to complete the query. Neo4j, in contrast, returns a result in around two seconds. At depth five, it transpires almost the entire network is our friend: for many real-world use cases, we’d likely trim the results, and the timings.
Questions are:
- Is this a reasonable test to emulate what one might except to find in a social network? (Meaning do real social networks normally have nodes with approximately 50 friends for example; seems like the "rich get richer" model would be more natural for social networks, though might be wrong.)
- Regardless of the naturalness of the emulation, is there any reason to believe the results are off, or unreproducible?
Category Data Science