Friday, 21 April 2017

Autocompleting Neo4j - part 2/4 of a Googly Q&A

So in the previous post, I explained my plan of doing a series of blogposts around the most frequently asked Google questions as recorded and suggested by Google's Autocomplete feature.
We'll start this week with the most asked question of all - which I get all the time from users and customers - and it's the inevitable "scale" question. Let's do this.

1. Does Neo4j Scale

Let's start at the beginning, with the first question that lots of people ask is: "Does Neo4j scale?" Interesting. Should not surprise anyone in an age of "big data" right? Let's tackle that one.

To me, this is one of the trickiest and most difficult things to answer - for the simple reason that "to scale" can mean many different things to many different people. However, I think there are a couple of distinct things that people mean with the question - it least that's my experience. So let's try to go through those - noting that this is by no means an exhaustive discussion on "scalability" - just a my 0,02 Euros.

a. Scale = lots of data volume

Size does matter - right? Or does it? Most graph applications are not really about volume - they are more about complexity and connectedness, and that's probably still the sweet spot for Neo4j. But: big data is here, and it's here to stay - so what do I do if I really do have a LOT of data and I want to use Neo4j to master all of it... Can I do it?

Well: there used to be a time when Neo4j had some (admittedly, very high - but still) hard limits with regards to the maximum numbers of nodes and relationships that the database could ever store - even if you had the big box on this blue earth to run it on. There were - and there still are - very good reasons for those limits: Neo needed to find a balance between the size of the addressing space and the overhead that it represented, and put a stake in the ground with regards to that compromise. This meant that there had to be an upper limit: ever Neo4j database would have a certain amount of addressable data.

In version 3 of the Enterprise version of Neo4j, however, my colleagues were able to remove those upper limits (or actually: make them ridiculiously high so that you could store all the atoms in the universe in Neo4j if you really wanted to) and still preserve all of the good purposes that these limits served. So today, I think this aspect of "scale" is a non-issue for Neo4j users: you can store as much data in a Neo4j machine as your hardware will allow. And there's hardware out there that can fit *a LOT* of data.

b. Scale = lots of reads

If however, what you mean with "scale" is something totally different, specifically the ability to serve lots and lots of clients that need to read data from the graph database, then Neo4j really has a great solution for you. With the new causal clustering architecture that was added in 3.1 and refined in the upcoming 3.2, Neo4j Enterprise has some amazing features to really let you scale out your Neo4j servers, and handle virtually unlimited amounts of read operations. The clustering architecture splits your Neo4j server into two groups of servers:
  • core servers: dedicated to safeguarding your data, making sure that data can never be written unsafely
  • edge servers: dedicating to enabling you to have lots of clients read from the cluster, and add more replica's as you and your application and it's read query volume grows.
This architecture really is a very modern and state-of-the art way of making sure that the database *cannot loose data*, and at the same time, maintains its operational efficiency and scaling-out characteristics.

c. Scale = lots of writes

Hand on heart, I would still claim that Neo4j has a very sound approach and architecture for scaling out write operations. Sure, we can still improve stuff (eg. memory management, batching of writes, etc) but really we have very performant and optimized write channel ... to the graph. Which is kind of a fundamental point, right: we are writing to a connected structure, to a graph - which means that when we add data to the database, we need to not just write the data - but also connect it up to the rest of the graph. Fundamentally, we need to do more work at write time than the average NOSQL store - which we readily accept because we believe that we will win that time back at read time.

But as always, it's still a good idea to really understand the workload: do you really need to connect the data (or is it enough to just store the exceptional connections)? Can you batch the write operations? Are you trying to do too much work in one go for this machine? It does not hurt to be mindful of that - and that will allow us to scale writes just fine on Neo4j. You still have to know what you're doing though - like with any complex system. If you don't, then don't be surprised to hear a big BOOMing noise.

d. Scale = lots of complex real-time queries

Then of course there's another definition of scale, which would be right up Neo4j's alley. If what you mean with scale is that you want your database to be running lots of complex real-time queries, then oh yeah - that's what Neo4j loves to do best.

The key words here in my book are
  • complex queries: Neo4j is a graph database. Graph databases are good for highly connected domains, and running queries over these domains that require lots of different entity types to be evaluated. That's what makes a query complex. In a traditional database that would require your database to do lots of complex join calculcations - and guess what: Neo4j does not need to do those. More complexity <> more joins in Neo4j. It's way better at executing these types of workloads.
  • real-time queries: Neo4j is a graph database, which means that it is really geared to answer your requests immediately, as in, between a web-request and a web-response. The query is executed immediately, in real-time, and brings back the results immediately. It's not really geared to analytical workloads - although it can of course be used for some and can actually be quite efficient at them.
The more complexity, and the more real-time the query patterns of your use case are, the better the "scalability" characteristics of Neo4j will compare to other - no doubt solid and dependable - database infrastructures.

e. Scale = sharding

Finally, I would like to address one of the most common - but in many ways also one of the most bizarre - definition of scale. But I cannot, not address it in this blogpost, as I get the question every other day from lots of smart people. So let's talk about it.

First, let's try to define sharding. As always, we go to a wikipedia page where a lot of very solid info is posted, the summary being that

So in my own words, the question of sharding basically means: "does Neo4j automagically distribute the data in the graph over an arbitrary number of commodity hardware machines?" , and to that, the simple and straightforward answer is of course that it doesn't. Neo4j does not shard data automatically - there, I've said it.

But let's try to move beyond the simplicities of that question, and try to have a real answer. I'll do so in a couple of points:
  1. When you say sharding, you effectively say that you want to put different parts of the connected graph structure on different machines. Of course, you can do that manually today (Neo4j does not mind it at all - manual sharding is a fine and common usage pattern for it), but the question is how would you automatically decide where to "cut" the graph? How do you partition the graph? How do you decide which part of the graph goes to which machine? That, my friends, is a VERY tough - if not impossible and NP hard - problem to solve in the generic case. Really, what you need is some more intelligence about the domain model that you are trying to store in Neo4j, and then you may be able to make a more informed, semi-automatic, decision on what data to put where. And even if you do that, it still remains a very arbitrary decision to take - and you will likely be taking it at the expense of deep traversal power. Deep traversals - the sweetspot query pattern for Neo4j - will at some point inevitably hit a machine boundary and start slowing down, dramatically. It's just natural. Is that really what you want?
  2. When you say sharding, you effictively mean that you want to have a fully distributable database that can have data on lots of different machines and still behave as one system, right? Well - how do you want to do transactions then? We have discussed before that we really believe that transactions are essentially important to graph database management systems because corruption tends to spread in a graph... so do you really want us to give that up just to get that sharding thing that you talk about?

    Not to beat the dead horse - but the philosophy of Neo4j is all about the fact that transactions are important, and distributed transactions are very hard. Only Google has really been able to implement something vaguely similar to it - with their cloud-based Spanner service - and we all know that they have hardware, networking, datacenter, and other skills that average organisations just don't have access to. 
For these two reasons, Neo4j has essentially always opted to remain "single image", so that all the members of a Neo4j Causal Cluster, will guarantee to have the same data, AND transactional. It still seems to us that that is the best thing to do in today's computer science environment. One day we may change that - and one day we may evolve the Causal Cluster into a fully distributable database - but that day is not there yet. And for all the use cases that Neo4j excels at today (which all require fast, deep traversals in real time) it remains to be seen if it really would be the best option. 

So, in each of these perspectives, I think we can claim that Neo4j does in fact scale very well. Of course it does not do everything, of course it will suck at some things and excel at others - show me a system that does not have to deal with these trade-offs.

But if I were to summarize the answer to the autocomplete question: Does Neo4j scale? Yes, it definitely does. We have the users and the customers to prove it.

In the next post, we'll continue the Autocomplete questions with more answers - that will hopefully not take such a long-winding answer :)

As always, feedback very welcome.


No comments:

Post a Comment