Is this Neo4j comparison to RDBMS execution time correct?

Question

Is this Neo4j comparison to RDBMS execution time correct?

blunders

2015年5月10日 21:18

Background: Following is from the book Graph Databases, which covers a performance test mentioned in the book Neo4j in Action:

Relationships in a graph naturally form paths. Querying, or traversing, the graph involves following paths. Because of the fundamentally path-oriented nature of the datamodel, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient. In their book Neo4j in Action, Partner and Vukotic perform an experiment using a relational store and Neo4j.

The comparison shows that the graph database is substantially quicker for connected data than a relational store.Partner and Vukotic’s experiment seeks to find friends-of-friends in a social network, to a maximum depth of five. Given any two persons chosen at random, is there a path that connects them which is at most five relationships long? For a social network containing 1,000,000 people, each with approximately 50 friends, the results strongly suggest that graph databases are the best choice for connected data, as we see in Table 2-1.

Table 2-1. Finding extended friends in a relational database versus efficient finding in Neo4j
Depth   RDBMS Execution time (s)    Neo4j Execution time (s)     Records returned
2       0.016                       0.01                         ~2500    
3       30.267                      0.168                        ~110,000 
4       1543.505                    1.359                        ~600,000 
5       Unfinished                  2.132                        ~800,000
At depth two (friends-of-friends) both the relational database and the graph database perform well enough for us to consider using them in an online system. While the Neo4j query runs in two-thirds the time of the relational one, an end-user would barely notice the the difference in milliseconds between the two. By the time we reach depth three (friend-of-friend-of-friend), however, it’s clear that the relational database can no longer deal with the query in a reasonable timeframe: the thirty seconds it takes to complete would be completely unacceptable for an online system. In contrast, Neo4j’s response time remains relatively flat: just a fraction of a second to perform the query—definitely quick enough for an online system.

At depth four the relational database exhibits crippling latency, making it practically useless for an online system. Neo4j’s timings have deteriorated a little too, but the latency here is at the periphery of being acceptable for a responsive online system. Finally, at depth five, the relational database simply takes too long to complete the query. Neo4j, in contrast, returns a result in around two seconds. At depth five, it transpires almost the entire network is our friend: for many real-world use cases, we’d likely trim the results, and the timings.

Questions are:

Is this a reasonable test to emulate what one might except to find in a social network? (Meaning do real social networks normally have nodes with approximately 50 friends for example; seems like the "rich get richer" model would be more natural for social networks, though might be wrong.)
Regardless of the naturalness of the emulation, is there any reason to believe the results are off, or unreproducible?

Topic neo4j nosql databases

Category Data Science

DVK · Accepted Answer · 2015年5月10日 21:18

There are good/fast ways to model graphs in RDBMS, and dumb/slow ways.

Some use clever indexing and Stored Procs, trading CPU load and tuned temp tables on RAM disks for faster graph retrieval speed.
Some use precomputed graph paths (this may be less feasible in social network scenario, but in a tree with majority of nodes being leaf nodes, it's a pretty good tradeoff space-for-time
Some simply compute in a loop, using un-tuned in-indexed temp table. From the #s thrown in the article, that smells like what they did (30 second- performance on fairly smallish data-set)

For example, I have my own tree computation.
- It is encapsulated in a highly-tuned stored proc
- While it's running in an enterprise-sized-hardware Sybase ASE15 dataserver, that server is shared with a couple terabytes of data from all other enterprise apps, some much more data hungry than mine; and isn't dedicated solely to executing my queries.
- I did not have access to the main speedup tool, a temp table on a RAM disk.
- A representative set of data I was retrieving that seems to somewhat match theirs was getting a 150,000 node subtree out of 2.5M node full forest dataset (unlimited depth of tree, which varies between 5 and 15, but smaller average arity of a given node than the 50 friends listed in the experiment)
- I tuned it to the point that this query ~30-45 seconds. It most certainly does NOT exhibit the exponential slowdown that the figures in the question seem to indicate on their RDBMS performance, which is extra double strange given there is no exponential growth in the result set (which to me reeks of un-tuned index on a temp table from personal experience).

So, this comparison is most likely incorrect and based on poor RDBMS side design, although as the previous answer noted, it is impossible to ascertain without them open sourcing 100% of their code and table definitions.

rapaio · Accepted Answer · 2014年5月15日 09:30

Looking at this document called Anatomy of Facebook I note that the median is 100. Looking at the cumulative function plot I can bet that the average is higher, near 200. So 50 seems to not be the best number here. However I think that this is not the main issue here.

The main issue is the lack of information on how the database was used.

It seems reasonable that a data storage designed specially for graph structures to be more efficient than traditional RDBMs. However, even if the RDBMs are not in the latest trends as a data storage of choice, these systems evolved continuously in a race with the data set dimensions. There are various types of possible designs, various ways of indexing data, improvements related with concurrency and so on.

To conclude I think that regarding reproducibility, the study lack a proper description of how the database schema was designed. I do not expect that a database to dominate on such king of interrogations, however I would expect that with a well-tuned design the differences to not be such massive.

Is this Neo4j comparison to RDBMS execution time correct?

About