Monday 28 January 2013

Demining the “Join Bomb” with graph queries

For the past couple of months, and even more so since the beer post, people have been asking me a question that I have been struggling to answer myself for quite some time: what is so nice about the graphs? What can you do with a graph database that you could not, or only at great pains, do in a traditional relational database system. Conceptually, everyone understands that this is because of the inherent query power in a graph traversal - but how to make this tangible? How to show this to people in a real and straightforward way?

And then Facebook Graph Search came along, along with it’s many crazy search examples - and it sort of hit me: we need to illustrate this with *queries*. Queries that you would not - or only with a substantial amount of effort - be able to do in traditional database system - and that are trivial in a graph. 

This is what I will be trying to do in this blog post, using an imaginary dataset that was inspired by the Telecommunications industry. You can download the dataset here, but really it is very simple: a number of “general” data elements (countries, languages, cities), a number of “customer” data elements (person, company) and a number of more telecom-related data elements (operators - I actually have the full list of all mobile operators in the countries in the dataset coming from here and here, phones and conference call service providers).

So to start of with: what would this data set look like in a relational model?


What is immediately clear is that there is *so* much overhead in the model. In order to query anything meaningful from this normalised rdbms, you *need* to implement these things called “join tables”. And really: these things stink. It is just an example of what a poor match the relational model is to the real world - and the complexity it introduces when you start using it. 

Compare this to the elegance of the graph model:


It is such a good match to reality - it is just great. But the beauty is not just in the model - it’s in what you do with the model, in the queries.

So let’s how we could ask some very interesting, Facebook style, queries of this model:
  • Find person in London who speaks more than one language and who owns an iPhone5
  • Find a city where someone from Neo Technology lives who speaks English and has Three as his operator
  • Find a city where someone from Neo Technology lives who speaks English and has Three as his operator in the city that he lives in
  • Find a personnot living in the Germany, who roams to more than 2 countries and who emails peoplewho live in London
  • Find a person who roams to more than 2 countries and who lives in the USA and uses aConferenceCallService there

These are all very realistic queries that could serve real business purposes (pattern recognition, recommendation engines, fraud detection, new product suggestions, etc), and that would be terribly ugly to implement in a traditional system, and surprisingly elegant on a graph. To do that, we would use our favourite graph query language, Cypher, to describe our patterns and get the data out.

So let’s explore a couple of examples with some real world queries.


The first thing to realise here is the relevance of a very important concept in the world of databases, and more specifically so in the world of Graph Databases - the use of Indexes. In a traditional database, indexes are expensive but indispensable tools to quickly find the appropriate records in a table using a “key”. And when joining two tables, the indexes on both tables would need to be scanned completely and recursively to find *all* the data elements fitting the query criteria. This is why “joins” are so expensive computationally - and this is also why graph queries are so incredibly fast for join-intensive requests. The thing is, that in a graph database, you *only* use the index on the data *once*, at the start of the query - to find your starting points of the “traversals”. Once you have the starting points, you can just “walk the network” and find the next data element by hopping along the edges/relationships, and NOT using any indexes. This is what we call “index-free adjacency” - and it is a fundamental concept in understanding graph traversals.

In the example below you can see that we are using three index lookups (depicted in green, and I even added a nice little parachute symbol to illustrate what we are doing here) to “parachute” or land into the network, and start walking the graph from there.


The query above is to look for a city where someone from Neo Technology lives that speaks English and has Three as his operator in the city that he lives in.

// These are the three parachutes, landing by doing an index lookup for nodes using the node_auto_index of Neo4j.
START
  neo=node:node_auto_index(name="Neo Technology"),
  english=node:node_auto_index(name="English"),
  three=node:node_auto_index(name="3")

// Here we describe the pattern that we are looking for. From the three starting points, we are looking for a city that has very specific, directed relationships that need to match this pattern.
MATCH
  (person)-[:LIVES_IN]->(city)-[:LOCATED_IN]->(country),
  (person)-[:HAS_AS_HOME_OPERATOR]->(three)-[:OPERATES_IN]->(country),
  (person)-[:SPEAKS]->(english),
  (person)-[:WORKS_FOR]->(neo)

// We return the city’s name and the person’s name as a result set from this query.
RETURN city.nameperson.name

// and order them by the name of the city
ORDER BY city.name;


And here’s another example:


Here are are looking for two people in the same countries but on different home operators that call, mail or text each other.

// Here we use just one index lookup to find a “country” and then we start looking for the pattern.
START 
  country=node:node_auto_index(name="Country")

// The pattern in this case is quite a complex one, we quite a few hops on the different relationship types.
MATCH
  (samecountry)-[:IS_A]->(country),
  (person)-[:LIVES_IN]-()-[:LOCATED_IN]-(samecountry),
  (otherperson)-[:LIVES_IN]-()-[:LOCATED_IN]-(samecountry),
  (person)-[:HAS_AS_HOME_OPERATOR]->(operator),
  (otherperson)-[:HAS_AS_HOME_OPERATOR]->(otheroperator)

      
// Here we limit the results to a specific condition that has to be applied.
WHERE 
  (otherperson)-[:CALLS|TEXTS|EMAILS]-(person)
  AND operator <> otheroperator

// And here we return the distinct set of name of the person and the countries’ name.
RETURN DISTINCT person.namesamecountry.name;


I am hoping that you can see that these kinds of queries, which directly address the nodes and relationships rather than going through endless join-tables, is a much cleaner way to pull this kind of data from the database.

The nice thing about this way of querying is that in principle, its performance is extremely scalable and constant: we will not suffer the typical performance degradation that relational databases suffer when doing lots of joins over very long tables. The reason for this is simple: because we only use the indexes to find the starting points, and because the other “joins” will be done implicitly by following the relationships from node to node, we can actually know that performance will remain constant as the dataset grows. Finding the starting point may slow down (a bit) as the index grows, but exploring the network will typically not - as we know that not everything will be connected to everything, and the things unconnected to the starting nodes will simply “not exist” from the perspective of the running query. 

Obviously there a lot more things to say about graph queries, but with these simple examples above, I hope to have given you a nice introduction as to where exactly the power of graph traversals is - and it’s in these complex, join-intensive queries.

Yours sincerely,

Rik Van Bruggen.

5 comments:

  1. Thanks Rik for the video and blog. New to Neo4j and loving it.

    ReplyDelete
    Replies
    1. Super great to hear. Just reach out if you need anything else.

      Delete
  2. Rik, Thank you for the commentary. Especially useful resource for conceptual understanding before I dive into the primary literature. The images in the blog post are currently broken. Could you please repair the links? -Mark

    ReplyDelete
    Replies
    1. Fixed. See also http://neo4j.com/blog/demining-the-join-bomb-with-graph-queries/ ...

      Delete