Here's the transcript of our conversation:
RVB: 00:02.563 Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology. And here I am again recording the second podcast of this year. I know it's only two months into the year, so I've been slacking, but I [laughter]--AJ: 00:15.346 You've been picking up the pace again.RVB: 00:16.082 Yeah, picking up the pace again. And for the second episode, I have invited a returning guest to our podcast, and that's my friend and colleague, Alistair Jones, from the Neo Technology engineering team. Hi, Alistair.AJ: 00:28.380 Hi, Rik.RVB: 00:29.089 Hey, thank you for making the time. I know you're a busy man these days, so thanks for taking the time. Alistair, the reason why I invited you back is because I know you've been hard at work in the engineering team, on some of the really big, new features in Neo4j. 3.1 was released at GraphConnect San Francisco last year. Or, no, it was actually announced and was released a little bit later, but one of the biggest new features in Neo4j 3.1 was the new clustering architecture, right?AJ: 01:03.036 Yep.RVB: 01:03.119 And that was what you and your team were working on?AJ: 01:05.054 Yeah, it was a really big thing for us, actually. So I've been working on this area for nearly two years, actually, on this new clustering architecture. And as you know, Neo4J is a clustered database designed to run over multiple servers. And we've had clustering in place for six or seven years in Neo. This is the biggest change we've ever made, by miles. It's a huge, huge upgrade of all the technology around the clustering.RVB: 01:40.760 Wow. I remember like in version 1.8 it was like Zookeeper that was doing some of the work.AJ: 01:46.629 Yeah, we had a small change in the 1.9 release back in the day.RVB: 01:54.312 Back in the day, yes.AJ: 01:55.544 This is a much bigger release in 3.1.RVB: 02:00.387 So what's it all about?AJ: 02:01.882 So the first part of it is getting up to date. So the world around us has moved on, and one of the great things about Neo is that we can take research from academia and actually apply it. So reasonably recent stuff that, if you read the academic papers and blog about it, we read all those, and some of those things we can put them fairly quickly into the product. So, for us, this time, it was doing the Raft protocol, which is a consensus algorithm. So what that means is getting agreement between participants, so computers in this case. We--
NOTE: Diego reacted to this part of the podcast with a super-cool tweet:RVB: 02:52.316 Members of the cluster, right?AJ: 02:53.396 Yeah. So, in this case, it can be different services in the class are getting consensus between those servers, when the servers themselves and the communication between servers is potentially unreliable. So you need to account for the unreliability in the design. Now, we know a little bit about consensus algorithms because previously, back in that 1.9 release, we implemented Paxos. And, at the time, that was the kind-of state-of-the-art thing to do. Raft, you could argue, at some theoretical level, is the same thing, but it's much more clearly structured. Raft is--RVB: 03:38.960 You mean a consensus protocol, right?AJ: 03:40.209 Yeah, yeah, exactly. So it's from Diego Ongaro, who's the lead researcher in this area, and it's really impressive how it's described.
@rvanbruggen @apcj nice. I found an error though: I'm not "the leading" anything.— Diego Ongaro (@ongardie) March 17, 2017
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!It's actually aimed to be simple to understand and to explain. And that makes it really good to implement because you can be very clear about what you've done. You can see the direction that you've gone in. So we've changed from one consensus algorithm to another.RVB: 04:13.411 Yep. Which is a big change [crosstalk].AJ: 04:14.871 Which is a big change, but architecturally it's totally different, because previously we were using Paxos to agree on membership of the cluster. So actually a very small amount of data. Not that many servers. They don't go that often. Now what we're doing is we're using Raft, and we're using it for every single transaction in the database. So every single node, relationship, property you create in the database it goes through the Raft protocol. You've got consensus across the cluster. And what that means is that every single change is agreed to by a majority of the cluster, so no matter what happens in terms of loss of connectivity or failure of the minority of the servers, still, the cluster as a whole agrees on what the state is as you move forward, so--RVB: 05:06.995 Sounds a bit like open heart surgery to me.AJ: 05:09.230 Yeah, it's quite a major change, but it's actually really nice. Once you've got that super solid foundation, you can build a whole load of things on top of it. So it's extremely solid for-- it's like the most reliable we could make it, and it stores every single transaction in this replicated log across all members of the cluster. And also as the membership changes, that's agreed to with protocol as well, so you know every time who the people were, who the servers were. People were allowed to [inaudible] transactions and to get them committed. So the whole thing's very tightly integrated into the core of the clustering.RVB: 05:52.520 So I would never claim that I understand everything about it, but what I've read is that it's very different architecturally in terms of-- previously we had masters and slaves, now we talk about cores and edges, right?AJ: 06:05.636 Yeah. The second part of this is that we were aiming to have much larger clusters than people had previously been running in Neo. Neo's been around for a long time. And, previously, people used to think of having 3, 5, 10 servers being a large database cluster. Now people want to run hundreds of servers, and we have customers and users running 200 servers in a database cluster. We want to be able to get higher than that, and the consensus algorithm that we were using before, the design of it, or perhaps the membership, yeah, it had a sort of limit on the-- or do we say kind of--? It was hard to get to that scale.AJ: 06:55.530 And the reason is that all of the servers had to be aware of each other and what they were doing at any stage to basically make sure that they hadn't disappeared. So that led to heartbeats going from every server to every other server, and that ultimately gets very expensive when you have a large number of servers. It also gets very difficult when you're committing across the majority of the servers because you have to wait for a large number of them to come back before you can say, "Yes, this is now safely committed."AJ: 07:30.229 So just having one huge cluster of Raft servers is not a good design for that kind of hundreds of servers or thousands of servers. So we came up with a new architecture. And what we do now is we divide the cluster into two groups. We mark some of the servers as being in what we call a call. Call servers participate in Raft and they are about safety. They're about storing your data durably. Secondly, we have a lot of potentially much larger group of read replicas. And these are servers that are for running your queries on, and--RVB: 08:14.895 Read queries, not write queries.AJ: 08:16.298 Yeah, yeah, read queries. So you don't have to worry about safety here, and the idea is these are about-- they're disposable, where you can scale them up and down; when your web traffic is high a certain time of day, have more and more of them.RVB: 08:30.725 Just have more of them, yep.AJ: 08:30.721 [inaudible] your cloud instances when it's quieter, and you can adapt to the shape of your traffic with the read replicas. What's interesting is that the name read is that we're doing more service than reading. Why does that make sense in a--? How does that help you in a database, have more read only things? Surely you need them more so to write. Well, that's because of the shape of graph data. It's because, actually, when we look at the-- I'll show you, because it's a nice slide [laughter] with audio only. You're looking at a slide that shows kind of how we see people do stuff with graphs, and what you notice is that the right [inaudible] updates tend to be quite small.RVB: 09:17.172 Local [crosstalk]?AJ: 09:17.605 Yeah, very, very local. Like, two or three nodes in relationships, up to maybe 100 things in a transaction, whereas on the read side - the whole point of graphs is to really fast, and people go a long way - they traverse along the graph in a read transaction. So they're doing hundreds of thousands of relationships in one transaction. Now, that's very fast, but it still takes resources. It takes memory bandwidths, it takes CPU to run these queries. And that's what people are really hammering their graph with, thousands of these, each very big, queries. And that's an enormous amount of computational load. We want to spread that across a lot of servers, and this is a way to do it - have loads of re-replicas that can handle that traffic for you. So it is really helping you in the kind of [inaudible] applications. It's a very specific architecture to the type of system that we're building.RVB: 10:13.028 Pretty cool. And so, as I understand it, the core is-- so they're all about the safety, and about writing to the graphs, and the age servers are all about reading. Is there any downside to this? Is this good news show all the way around, or are there some things that we should take care with?AJ: 10:34.656 So there's one thing that's just like-- a challenge here for people when they're deploying these type of applications, is that the transaction's being pushed out from the core, out to the B replicates, and there's some delay in that happening. It's very small, but there is some delay. So people call this eventual consistency, and this is something that we're aware of. And lots of modern sort of web systems that you get into this kind of eventual consistency situation. An example of this that could kind of catch you out is, say you're a user, you create an account, or you make a booking, that's a right transaction. It updates the graph. Then when you come to refresh your page, you try another operation and it's a read only operation, maybe you hit a read replica that hasn't quite seen your update, so, as a user, it almost appears like the thing you just did has disappeared, like you've gone back in time. There's a bit of a--RVB: 11:45.015 It's [crosstalk] read your own writes problem.AJ: 11:46.385 Yeah, I can't really-- so what we did at the same time as this, is we actually added a whole new feature that became the name of the whole clustering architecture. So this is what I like to call causal clustering, because we added in a feature of causal consistency.RVB: 12:08.446 Tell me more about that, because I don't know what that means [laughter].AJ: 12:10.836 Okay, Rik. So causal consistency. So it's actually something that's been-- again, from research, there's some academic and industry research in this area, but it's not very commonly implemented. There are only a handful of other implementations out there, and what it's about is trying to represent what causally has happened in the user's application. So the cause and effects of the changes that you've made.AJ: 12:45.088 Practically, it's very easy to use. What happens is that when you update the graph or when you touch the graph in any way, the database can give you a bookmark. And this bookmark represents the latest thing that you've changed or the latest thing that you've seen in the database. And then when you make another request to any other server in the cluster, you can supply that bookmark that's saying bookmark, and the database will make sure that it has at least as up-to-date a state as the bookmark represents. So the bookmark is just a little string and it comes back to your database driver into your application code. You can store it in your application server, or you can hold onto it temporarily while you make another inquiry, or you can send it all the way back to the client. You can send it back to your web browser or your mobile device, and route it back, ultimately, to the database.RVB: 13:46.373 So that basically assures that the client of the database always takes into consideration everything that it calls [crosstalk]?AJ: 13:54.484 Yeah, it prevents you from going back in time--RVB: 13:56.751 Ah, yeah, that's it.AJ: 13:56.890 --is what it does. And it supports a totally stateless architecture - everything between the user and the database. The database is storing state. Why should you need to store it anywhere else? So this is [inaudible]. Your sessions, you don't need to worry about sophisticated routing. Just have stateless application servers, pass your bookmark around, and you get causal consistency. That's the idea.RVB: 14:28.017 Wow.AJ: 14:28.699 And we've tried to make this even easier to use by building some of the primitives. The kind of passing backwards and forwards keeping track of things is built into the database drivers. So in 3.0, we introduced--RVB: 14:42.424 BOLT drivers, right?AJ: 14:42.490 Yeah, the BOLT drivers. So they initially supported native language drivers in your [crosstalk]--RVB: 14:48.561 Right. And so the new version of the driver supports this bookmarking--AJ: 14:51.515 Exactly, yeah.RVB: 14:52.721 --and that gives us the causal consistency.AJ: 14:54.338 The causal consistency, yeah. Exactly.RVB: 14:56.476 So let's talk a little bit about the future. What's coming up? What are you working on now, and what keeps you up at night, and [laughter]---?AJ: 15:03.404 Yeah. Well, [crosstalk]. I mean, it's kind of following on logically from where we are now, so the next stage of this is to be-- it's that kind of how people actually deploy this stuff. And these days, not just a cluster of servers that are using it to run a database. It's also servers across multiple data centres and multiple regions around the world. Around the country, all around the world. So that's what the cloud environment's been very easy to do, to have geographic distribution. And we are taking account of that feature in the product, or that server usage in the product. So what we're going to do is make the clustering aware of data centres and how they're organised, and allow the client to give hints about how might be the best way to serve it. So that means that you can do your reads from a server that's very close to you, with a low latency, and you can support fault tolerance across data centres when one of them goes away, or explicitly recover in a disaster recovery zone. All of these different operational scenarios. So--RVB: 16:24.512 Is that something that's coming up in the next couple of versions of Neo4j or--?AJ: 16:27.024 Yeah, yeah. So in the next couple of versions, that's the stuff that's going on. And, again, it's to be seamless all the way through the driver, so you write your application once for Neo4j on your laptop, and then it should move forward [inaudible].RVB: 16:46.117 That's very cool. I have one more question. Don't you miss the visualisation stuff that you were doing before [laughter]?AJ: 16:52.896 Yeah. So I always miss the visualisation. I try to devote my spare time to get back into it every now and then, so--RVB: 17:03.147 Very cool. Well, thank you so much for spending your time, Alistair. I mean, we want to keep these podcasts fairly short, but I'm sure we'll include a bunch of links to the documentation and the blog post that we wrote about this topic. I really appreciate you making the time, and look forward to seeing what's up next.AJ: 17:21.907 Thanks very much.RVB: 17:23.060 Thank you. Bye.
All the best
Rik