Tuesday 26 May 2015

Podcast Interview with Johan Svensson, CTO of Neo Technology

One of the people at Neo4j that is not often on stage, but always there in the background, is our CTO, Johan Svensson. One of the many "silent forces" behind the project, you could call him - and I have gotten to known him as a very knowledgeable and thoughtful person, with a great sense of humor that gets even better as the evening progresses :) ... So here is another great conversation to share with you - hope you enjoy it.

Here's the transcript of our conversation:
RVB: Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and we are recording another podcast again. Yippee. And today I'm on a Skype call with Johan, Johan Svensson, from Neo. Hi Johan. 
JS: Hi, Rik. How are you? 
RVB: I'm very well, and you? The sun is shining over here. 
JS: I'm well, thanks. 
RVB: [chuckles] So Johan-- 
JS: It's snowing in Malmø. 
RVB: Okay. Johan, you've been one of the founders of Neo, right? But lots of people might not know you - would you mind introducing yourself a little bit, if you don't mind? 
JS: Sure. As you said, I'm one of the founders of Neo4j and I'm currently the CTO of Neo Technology. I've been working with Neo basically since 2002, I would say, as it's been a long time now, and-- yeah. 
RVB: That is a long time right? How did it start? How do you guys get started with Neo? 
JS: Me, Emil, and Peter were working at this other company where we were building a content management system, and we had a lot of trouble pushing in the data we wanted to store into a relational database. I was mostly working at the-- what we call the-- I think we call it the kernel team, the core team or something, trying to get data in and out of the database. And it turned out that the things we tried to model wasn't a very good fit for a relational database, so that's where this new model came from. I was not initially part of all the-- it was mostly Peter and Emil who actually built-- Ali came up with this new model, and then I got started working with it when we tried to build the system that could handle this. 
RVB: Was that really for performance reasons and stuff like that? Or what was the main reason for deciding, "We need something new here"? 
JS: I think it was two-fold. One thing was performance and the other thing was modelling capabilities. The way we solved it in the system before we had Neo was basically store everything lazily and read everything up in memory on start-up. So my first project, when I started working at the company, was to actually optimise start-up time. That was 4 hours at the moment and we got it down to 30 minutes.
RVB: No way [chuckles].
JS: Yeah. But I mean it became clear that the tool we were using was not the right tool, and we had lots of hierarchies. Sometimes the hierarchy could have multiple parents which makes it a graph. We didn't think of it as a graph back then, but we spoke more about networks, and this year our case interlinked into each other in various way. So it became many dimensions and really, really hard to get into rows and columns. 
RVB: How difficult was it for you to create the minimal product, so to speak? Was it months? Was it years? How much time did you spend on the first versions? 
JS: We started and first we did just a few proof of concept versions that I was not part of, in using EJBs, and what we got from that was basically that the model works really, really well, it solves our modelling problems, but probably didn't solve our performance problems. So then we tried to do a new version more directly on top of Postgres, and that still didn't work out for us. And then I've been experimenting on my own, because Java had just released Java Nio, a new way of doing IOs. So I've been experimenting some with our own native-- like building a native solution for storing graphs, and it turned out that that one performed much, much better. 
RVB: That was when you started to really try to have your own file system format and all those types of things, is that what I'm hearing? 
JS: Exactly, yes. We started building that, I believe in-- was it mid 2002 or maybe early 2002, I can't remember, and then we put the first system in production in 2003.
RVB: And when did it start taking the shape that Neo4j has today, like a database, like a full-on database? When would you say was the first version of Neo as a database?
JS: Well, you could argue that the first version that we put in production was a database. I mean, it had all the requirements, but on the other hand it was very early and being built quite fast, so-- it always takes many years before you have a stable database. I actually believe-- what's his name? Curt Monash says that it takes at least five years to build a database. We had lots of problems with it in the beginning, of course, but it was only hosting our own system so we could easily handle that. Then we saw that this is absolutely something-- technology that we could use in other projects and we could even use this technology as creating a product around it. But there was no real way of doing back then because object-oriented databases had just failed, so there was no one who was challenging the relational databases back then, so we didn't do that. But then Dynamo came around and things started happening in 2006. That's when we actually spun out the IP in a separate company, and started Neo Technology. 
RVB: That's when we started surfing the NOSQL wave, right [chuckles]? 
JS: Yes, well that came a few years later, I guess, before someone put a name on that. But, yeah. 
RVB: Essentially, yeah. So you mentioned already that it was around performance and modelling, those were the two things. Are there any other things that you think are super great about Neo graph databases today, or why you think that people should be looking at it right now? 
JS: I think it enables people to solve the problems that they haven't been able to solve before. Basically, any field if you look at it today, that stores data in an old-fashioned way, they're not making use of their data. The thing that comes to mind always when I think about this, is actually healthcare. I think that we could do a lot of things in the areas to help the world or help doctors make better diagnosis, and so on. There are so many things we can do. The data is already there, we're just not making sense of it. 
RVB: We're not making the connections. 
JS: Exactly. So if we start doing that, we're going to have a much better society, I would say. 
RVB: Yeah, absolutely. So that also sort of brings me to my last question - where is this going? Where do you see Neo as a technology going, but also where do you see the industry going? Any interesting comments about that? 
JS: Yeah. I wouldn't be doing this unless I was convinced that we have something big. I actually think that the majority of data will soon be stored in graph databases. And that soon, that maybe two, three, four, five, ten years, I don't know but I think that's where we are going. And when it comes to technology, I have a lot of things [laughter] on my mind. I don't know how technical you want to get there, but-- 
RVB: Well, just the big things, right. 
JS: We just released 2.2 which is a very nice improvement. I think it's our most solid release so far. It basically lays the foundation for us to be doing a lot of work that we have wanted to do for a long time but have not been able to do because of old legacy things. Some of the code that I wrote back in 2003 is still there, but it's getting less and less and less. Right now, I would say that we have come to a point where I think we can accelerate a lot of the things we want to build. 2.3 is going to be something that improves both stability and it's going to improve performance over 2.2. Then we have three overlays coming that we'll introduce some great things, specifically around how to interact with the product but also in continue the internal work that you don't see so much of a user-- as a user, like the product surface doesn't change that much but, still, it's going to be a lot of changes and much of this is actually driven from our hardware levels. If you look at how a computer looks today and how it will look tomorrow, that's very different from what it looked liked 10 years ago or 20, 30 years ago, when many of the other databases were designed and built. So, I think we see great things coming. 
RVB: Fantastic. Well, Johan, thank you so much for coming on the podcast. I know there are so many things we could talk about, but we want to keep these fairly short. I really appreciate you making the effort to come online. Thank you so much for doing that. 
JS: You're welcome. Thanks. 
RVB: All right, have a good one. Bye. 
JS: Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Sunday 24 May 2015

Cycling Tweets Part 5: querying the CycleTweets

After parts 1 to 4 of this series where we have mostly been loading and tweaking data for the Cycling Tweets, now it's time to have some fun querying this dataset with Cypher.

All of the queries are on Github, but let me just walk you through a couple more interesting ones.

Warming up the "Cypher Muscles"

Let's start with something easy, just to get the "Cypher Muscles" going:
//degree of handlesmatch (h:Handle)-[:TWEETS]->(t:Tweet)return h.name, h.realname, count(t)order by count(t) DESClimit 10
This query gives us the "degree" (the number of "TWEETS") relationships of a "Handle" node) of the nodes in our dataset, and a first indication of what to look for further on:
 Turns out that this is quite interesting. Very few of the "top gun" cyclists seem to be the top Tweeters. The only one that really stands out I think is Luca Paolini - the others are basically excellent riders, but not the "top guns". At least not in my opinion/experience of the sport.

So lets take a look at the #hashtags. Which ones are most mentioned in Tweets?
//most mentioned handles or hashtagsmatch (h)-[:MENTIONED_IN]->(t:Tweet)return h.name, labels(h), count(t)order by count(t) DESClimit 10
And the result is kind of obvious:
The big races like Paris-Roubaix and the "Ronde van Vlaanderen" (#RVV, or "Tour of Flanders") are the top mentions.

Using the NodeRanks

As you may remember from part 4 of this blogpost series, we used the GraphAware framework to calculate the PageRank of the different nodes in our dataset. So let's take a look at that:
 You immediately see the "big guns" (like Tour de France winners Alberto Contador, Chris Froome, Cadel Evans) pop out. But this does not say a lot - so we want to do a bit more exploration of these top riders and see what they are really connected to.

"Impossible is Nothing" with the power of WITH

To paraphrase my dear friend Jim Webber, "Impossible is Nothing" with Cypher and Neo4j.

Because here's the thing. When I was starting to do this little project, I did not know what I was going to find. I really didn't. Of course I knew a bit about Neo4j, I am a fan of cycling, but still... it was all kind of an experiment, a jump into the unknown. So this is where I fell in love - again - with Neo4j, Cypher, and the way it allows you to interactively and iteratively explore your data - hunting out the hidden insights that you may not have thought of beforehand.

A key tool in this was the "WITH" clause in Cypher. From the manual:
The WITH clause allows query parts to be chained together, piping the results from one to be used as starting points or criteria in the next.
So that means that I can basically iteratively query my dataset, and use the result of one iteration as input for the next iteration. Which is very powerful in my opinion. So here's what I did with the "top ranked" nodes:

  • First I explored which other nodes are connected to these top-ranked ones, using WITH:
//what is connected to the top NodeRanked handlesmatch (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)-[r*..2]-()return h,rlimit 50

This gave me a nice little overview:
However, because I had to "LIMIT" the result, it felt as if I was artificially skewing the view. So lets take a second pass at this.

  • Second, I looked at the labels of the nodes that are directly connected to the single one top ranked node:

//what is connected to the top NodeRanked handles at depth 1match (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)--(connected)return labels(connected), count(connected)limit 25
And I can do something very similar by just tweaking the query to find out what is connected at dept 2 or 3...

//what is connected to the top NodeRanked handles at depth 3match (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)-[*..2]-(connected)return labels(connected), count(connected)order by count(connected) DESC
The order of the result is a bit different then:

So that gave me some good feel for the dataset. Again:I think it's mostly this interactive query capability that makes it so interesting.

Betweenness on a subgraph

So then I thought back to some work that I did last year to try and implement Betweenness Centrality in Cypher. The result of that was clearly that it was pretty easy to do, but... that it would be very expensive to do so on a large dataset. I think this would be a prime candidate for another Graphaware component :) ... but let's see if we can use WITH to

  • first find a subgraph of interesting suspect nodes
  • then calculate the betweenness on these suspect nodes
Turns out that this was pretty straightforward. Here's the query:

//betweenness centrality for the top ranked nodes - query using UNWIND//first we create the subgraph that we want to analysematch (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 50//we store all the nodes of the subgraph in a collection, and pass it to the next queryWITH COLLECT(h) AS handles//then we unwind this collection TWICE so that we get a product of rows (2500 in total)UNWIND handles as sourceUNWIND handles as target//and then finally we calculate the betweenness on these rowsMATCH p=allShortestPaths((source)-[:TWEETS|MENTIONED_IN*]-(target))WHERE id(source) < id(target) and length(p) > 1UNWIND nodes(p)[1..-1] as nWITH n.realname as Name, count(*) as betweennessWHERE Name is not nullRETURN Name, betweennessORDER BY betweenness DESC;
Here's the result:

As you can see - this is quite interesting. It's clear that there are a number of lesser known riders that are very "between" the top guns (in terms of PageRank).

Wrapping it up with some pathfinding

So last but not least, we need to do some pathfinding on this dataset. In my experience, that always gives away some interesting insights.

So let's experiment with two very well known riders, Tom Boonen (former world champ and winner of the Tour of Flanders and Paris Roubaix multiple times) and Alexander Kristoff (this year's winner of the Tour of Flanders). Here's the simple query:
//the link between Boonen and Kristoffmatch (h1:Handle {name:"@kristoff87"}), (h2:Handle {realname:"BOONEN Tom"}),p = allshortestpaths ((h2)-[*]-(h1))return p
The result is:

But then my suspicion is that the Teams that these riders belong to are actually really important. So lets take a look:
//the link between Boonen and Kristoff and their teamsmatch (h1:Handle {name:"@kristoff87"}), (h2:Handle {realname:"BOONEN Tom"}),p = allshortestpaths ((h2)-[*]-(h1))with nodes(p) as Nodesunwind Nodes as Nodematch (Node)--(t:Team)return Node, t
As you can see we are using the same principle as above: WITH ties it all together.

That's about it, folks. There are so many other things that I would love to do with this dataset (Community detection is high on my wishlist) - but I think 5 parts to a blogpost series is probably enough :) ...

I guess you could have seen from this series of blogposts, that I am a bit into Cycling, and that I enjoy working this stuff with Neo4j. It's been a lot of fun - and a bit of effort - to get all of this done, but overall... I am pretty happy with the result.

Please let me know what you thought of it too - would love to get feedback.



Thursday 21 May 2015

Cycling Tweets Part 4: Ranking the Nodes

In the previous couple of blogposts in this series (here's part 1, part 2 and part 3 for you), I have explained how I got into the Cycling Twitterverse, how I imported data from a mix of sources (CQ Ranking, TwitterExport, and a Python script talking to the Twitter API), and thereby constructed a really interesting graph around Cycling.

There's so many more things to do with this dataset. But in this post, I want to explore something that I have been wanting to experiment with for a while: The GraphAware Framework. Michal and his team have been doing some real cool stuff with us in the past couple of years, not in the least the creation of a couple of very nice add-ons/plugins to the Neo4j server.

One of these modules is the "NodeRank" module. This implements the famous "PageRank" algorithm that made Google what it is today.
It does this in a very smart way - and also very unintrusively, utilising only excess capacity on your Neo4j server. It's really easy to use. All you need to do is

  • drop the runtimes in the Neo4j ./plugins directory
  • activate the runtimes in the Neo4j.properties file that you find in your Neo4j ./conf directory. 
Here's what I added to my server (also available on github):

//Add this to the <your neo4j directory>/conf/neo4j.properties after adding //graphaware-noderank- and //graphaware-server-enterprise-all- //to <your neo4j directory>/plugins directory   com.graphaware.runtime.enabled=true  #NR becomes the module ID: com.graphaware.module.NR.1=com.graphaware.module.noderank.NodeRankModuleBootstrapper   #optional number of top ranked nodes to remember, the default is 10 com.graphaware.module.NR.maxTopRankNodes=50   #optional damping factor, which is a number p such that a random node will be selected at any step of the algorithm #with the probability 1-p (as opposed to following a random relationship). The default is 0.85 com.graphaware.module.NR.dampingFactor=0.85   #optional key of the property that gets written to the ranked nodes, default is "nodeRank" com.graphaware.module.NR.propertyKey=nodeRank   #optionally specify nodes to rank using an expression-based node inclusion policy, default is all business (i.e. non-framework-internal) nodes com.graphaware.module.NR.node=hasLabel('Handle')   #optionally specify relationships to follow using an expression-based relationship inclusion policy, default is all business (i.e. non-framework-internal) relationships com.graphaware.module.NR.relationship=isType('FOLLOWS') #NR becomes the module ID: com.graphaware.module.TR.2=com.graphaware.module.noderank.NodeRankModuleBootstrapper   #optional number of top ranked nodes to remember, the default is 10 com.graphaware.module.TR.maxTopRankNodes=50   #optional damping factor, which is a number p such that a random node will be selected at any step of the algorithm #with the probability 1-p (as opposed to following a random relationship). The default is 0.85 com.graphaware.module.TR.dampingFactor=0.85   #optional key of the property that gets written to the ranked nodes, default is "nodeRank" com.graphaware.module.TR.propertyKey=topicRank   #optionally specify nodes to rank using an expression-based node inclusion policy, default is all business (i.e. non-framework-internal) nodes com.graphaware.module.TR.node=hasLabel('Hashtag')   #optionally specify relationships to follow using an expression-based relationship inclusion policy, default is all business (i.e. non-framework-internal) relationships com.graphaware.module.TR.relationship=isType('MENTIONED_IN')
As you can see from the above, I have two instances of the NodeRank module active. 
  1. The first attempts to get a feel for the importance of "Nodes" (in this case, the nodes with label "Handle") by calculating the nodeRank along the "FOLLOWS" relationships. After just half an hour of "ranking" we get a pretty good feel:

    This seems to be confirming - in my humble opinion - some of the more successful riders in April, for sure. But also confirms that the "big names" (Contador, Froome, Cancellara) are attracting their share of Twitter activity no matter what.
  2. The second does the same for the "Topics" (in this case, the nodes with the label "Hashtag") along the the "MENTIONED_IN" relationships.

    The classic races are clearly "top of mind" in the Twitterverse! But upon investigation I have also found that there are a lot of confusing #hashtags out there that make it difficult to understand the really important ones. Would love to investigate a bit more there.
Like I said before, the GraphAware framework is really interesting. It gives you the opportunity to make stuff that you could also do in Cypher more easily, faster, and more consistently. I really liked my experience with it.

Hope this was useful for you - as always feedback is very very welcome.



Tuesday 19 May 2015

Podcast Interview with Nigel Small, Neo Technology

Waw. Seems like I have recorded 22 (!) podcast episodes in the past 2 months - that's pretty sweet! So here's another one that will make you smile: great conversation with the inimitable Nigel Small, aka Technige, aka Neonige. You may know Nigel from his work on the superb Python language driver for Neo4j, py2neo. What you may not know is that he was one of the original (co)inventors of Graph Karaoke, and that he is a generally super sweet and smart guy. He's currently working on some super interesting stuff at Neo's engineering team - but let's have him explain that himself:

Here's the transcription of our conversation
RVB: Hello everyone. My name is Rik - Rik Van Bruggen from Neo Technology, and here we are again recording another podcast session. Today I am joined by Nigel Small all the way from the UK. Hi Nigel.
NS: Hello Rik. 
RVB: Hey. 
NS: How are you doing? 
RVB: I'm doing very well, and you? 
NS: Yeah, not too bad. Thank you. 
RVB: The sun is shining over here. I hope it is over there as well. 
NS: It's pretty bright here as well, actually. 
RVB: Fantastic. Nigel, welcome to the podcast. We always talk about a couple short things here. The first thing is, who are you? 
NS: Well, Nigel Small [chuckles]. I joined Neo Technology last year - last August. And that was after being a groupie for about three years prior to that. I built one of the python drivers, so I've been hanging around the community for some time, gathering uses for the driver and gradually getting more and more into the database itself. 
RVB: Absolutely. Well, You know the py2neo is very popular it seems, right? That's a-- 
NS: It's definitely become a lot more popular than I ever expected. It kind of fell out; it was an accident really but [laughing] it's become reasonably successful. I'm quite pleased. 
RVB: Fantastic. Would you mind telling us a little bit how you got into graphs? And why you got into graphs and, of course, why do you get into py2neo? 
NS: All right. Well, it all started due to Jim Webber. 
RVB: Oh, no. Not Jim again. 
NS: Yes. His name keeps cropping up. I worked with Jim briefly when he was consulting in a previous life, and we stayed in touch, and I remember having a conversation with him at some point about databases and him telling me that the odd relational type of table-based databases were a bit passé and I needed to look at [chuckles] these graph databases. So, having no knowledge really of what these were and no knowledge of graph theory at all - it's not something I'd ever come across - I spent some time looking into it and decided to try to apply it to my family tree, which was a hobby of mine at the time. So I started looking at how I could store some of my family tree data into a graph. Python was the language which I enjoyed using anyway, from a hobby point of view. So played around with the REST interface which was quite new at the time. Wrote a few bits of python code to get some data in and out and ended up getting rather distracted on the mechanism for actually putting data in and out and forgot about the family tree side of it [chuckles]. And ended up developing those bits of code into what's now py2neo. 
RVB: Oh, wow. So it's basically a wrapper around the REST API that you built, right?
NS: Exactly. Yeah. It's-- 
RVB: What's called the language driver. 
NS: Absolutely. One of the early ones. I think because the REST interface was reasonably new at that time, I was one of the early pioneers I think of writing drivers. 
RVB: It's called a guinea pig, Nigel [laughter]. 
NS: [laughter] It's been rewritten several times since to correct a lot of the errors I made in the early days. 
RVB: Oh, okay. So what do you like about working with the graph database? At the first instance, what attracted you? 
NS: I think the fact that it was something different. It was good to get my head in something that was a lot different to anything else I'd used before. I'd worked very heavily with databases for some time. I'd worked as a DBA and programmer for about 15 years prior to that. But had only ever been exposed to standard tables. So it was nice to get my head in something else and see what it was like. It was a challenge to start with. Because as I say, I knew no graph theory at all. Didn't really, at first, see quite how this was going to apply to the vast majority of data that I'd ever worked with before, because I was still thinking very much in tables. It took quite some time to undo everything that I already knew and reapply it to graphs. But now I think I'm looking around at most bits of data-- I was recently putting together some political data for a session that I'm doing and realising that it actually fits very, very naturally when you're talking about politicians who belong to a particular party and who stood in a particular election. All of those things are very much objects you can represent as notes with relationships between them. And a graph now feels very natural for most kinds of data modelling. 
RVB: Yeah. Absolutely. Yeah. To be honest, it's funny that you mention the family tree. Hierarchies are graphs, right? I actually did a family tree of my own one day and I discovered I was Dutch. [laughter] Which was a hilarious meetup presentation actually. 
NS: Was that a good or a bad thing? I don't have much opinion on-- 
RVB: Let's talk about something else [laughter]. 
NS: Okay [laughter]. 
RVB: So let's talk about where is it going, Nigel. I mean, you've been working on some really exciting stuff at Neo. Where do you see graph databases in general and some of the work that you've recently been doing as an engineer at Neo-- where do you see that going? 
NS: Well, the work I've been doing for the past few months has been on what we've loosely branded our new remoting project, so, given that I've come in with some knowledge of drivers and the interaction between the clients and service, it's been quite nice to fall into a project that's very closely related to that, to rebuild a lot of the protocol and the client server capabilities for the database itself. So we're looking-- 
RVB: Is that an alternative to REST then? Or what is-- 
NS: It will be eventually, yes. We're looking at something that's going to be, hopefully, well-- more performant. Something that's much more in the order of magnitude of performance that we see the embedded databases. Ultimately, yes, replace a large number of the used cases for the REST interface. I don't know whether we'll end up replacing everything, because there are still some good uses of having an http interface for a very low barrier of entry. But the vast majority of applications, I think we'll end up using our new protocol. And one of the things I particularly want to do is to try to level out the experience across languages.
Traditionally, Neo's been very Java-centric for obvious reasons. This is where it came from. This is it's back-- but coming from a Python world I want to make sure that we've got the same kind of performance capabilities in Python and then the same in PHP and Ruby and all the other languages that we want to be able to connect to Neo. You almost shouldn't have to know that the underlying database has been written in Java. It doesn't matter what you're using - what stack you're using - you're going to find the Neo performs blistering fast regardless. 
RVB: So that's the first point of evolution, right? Where we're going with the binary protocol like that. That's a big new thing, right? 
NS: Absolutely. 
RVB: Any other new things that you see coming up on the horizon that you think are really exciting? 
NS: There's a lot of work going on with the big graph side of things. So not only are we making access faster, but there's a team working on scalability as well and making sure that we can add new servers, make things perform [chuckles] in a linear way, faster with each server that you add. So I think the capabilities of the platform itself are growing very, very rapidly and I think we're going to see a little more installs. It's going to become a much more mainstream product than it has been in the past. 
RVB: Yeah, absolutely. Well, thank you so much, Nigel, for taking your time to come on the podcast. It was a pleasure talking to you. 
NS: Thank you. 
RVB: I'm sure ever one will have a chance to meet you at the GraphConnect right? 
NS: Yes. I'm going to hovering around London on the 6th to 7th and the 8th, so we've got-- the 7th is the graph connect day but on the 6th, we have an eco-system day where I'm doing a couple of sessions talk about the new remoting project as well. It will be good to see as many people as possible. 
RVB: Yes, super. Thanks, Nigel. Talk to you soon man. 
NS: Great stuff, Rik. Thanks very much. Bye bye. 
RVB: Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Sunday 17 May 2015

Cycling Tweets Part 3: Adding "Friends" to the CyclingGraph

In this 3rd part of this blogpost series about Cycling (you can find part 1 and part 2 in earlier posts) we are going to take the existing Neo4j dataset a bit further. We currently have the CQ Ranking metadata in there, the tweets that we exported all connected up to the riders' handles, and then we analysed the tweets for @handle and #hashtag mentions. We got this:

Now my original goal included having a social graph in there too: the "friends" relationships for different twitterers could be interesting too. Friends are essentially two-way follow-relationships - where two handles follow each other, thereby indicating some kind of closer relationship. It's neatly explained over here. So how to get to those?

Well, I did some research, and while there are multiple options, my conclusion was that really you would need to have a script that would talk to the twitter API. And since we also know that IANAP (I AM Not A Programmer), I would probably need a little help from my friends.

Friends to the rescue: my first python script

Turns out that my friend and colleague Mark Needham had already done some work on a very similar topic: he had developed a set of Python scripts that used the Tweepy library for reading from the Twitter API, and Nigel Small's Py2Neo for writing to Neo4j.  So I started looking at these and found them surprisingly easy to follow.

So I took a dive at the deep end, and started to customize Mark's scripts. I actually spent some time going through a great python course at Codecademy, but really my tweaks to Mark's script could have been done without that too. His original script had two interesting arguments that I decided to morph:

  • --download-all-user-profiles
    I tweaked this one to "download all user friends" from the users.csv file. The new command is below.
  • --import-profiles-into-neo4jI tweaked this one to "import all friends into neo4j" from the .json files in the ./friends directory. The new command is also below.

In order to use this, you do need to put in placeI have put my final script over here for you to take a look. In order to use it, you have to register an App at Twitter, and generate credentials for the script to work with:

That way, our python script can read stuff directly from the Twitter API. Don't forget to "source" the credentials, as explained on Mark's readme.

2 new command-line arguments

Mark's script basically uses a number of different command-line arguments to do stuff. I decided to add two arguments. The first argument I added was

python twitter.py --download-all-user-friends 

This one talks to the Twitter API, and downloads the friends of all the users that it found in the users.csv file.  I generated that file based on the CQ ranking spreadsheet that I had created earlier.
As you can see, it pauses when the "rate limit" is reached - this is standard Tweepy functionality. The output is a ./friends directory full of .json files. Here's an example of such a file (tomboonen1.json)
In these .json files there is a "friends" field. Using the second part of the twitter.py script, we can then import these friends to our existing Neo4j CyclingTweets database using the following Cypher statement (note that the {variables} in green are Python variables, the rest is pure Cypher):
MATCH (p:Handle {name: '@'+lower(
SET p.twitterId =
WHERE p is not null
UNWIND {friends} as friendId
MATCH (friend:Handle {twitterId: friendId})
MERGE (p)-[:FOLLOWS]->(friend)
So essentially this finds the screenName (aka the "handle"), adds the twitterId to the screenName, and then adds the "FOLLOWS" relationships between that handle and the friends of that handle. Pretty sweet.

So let's run the script, but this time with a different command line argument, and with a running Neo4j server in the background that the script could talk to:

python twitter.py --import-friends-into-neo4j

After a couple of minutes (if that), this is done, and we have a shiny new graph that includes the FOLLOWS relationships:
This is pretty much what I set out to create in the first place, but thanks to the combination of the import (part 2) and this Python script - I have actually got a whole lot more info in my dataset. Some very cool stuff.

Hope you liked this 3rd part of this blogpost series. There's so much more we could do so - so look out for part 4 soon!



Friday 15 May 2015

Podcast Interview with Nicole White, Neo Technology

Here's another fantastic episode of our Neo4j Graph Database Podcast: I had a super nice late-night (for me) conversation with Nicole White, a colleague of mine in our San Mateo office. She is a Data Scientist at Neo, which means she helps us out with a lot of our internal data questions - and develops some fantastic tools for that. She also frequently speaks at conferences and meetups, and writes stuff over here, here (super cool Flask tutorial btw!) and here.

Here's the episode for you:

Here's the transcription of our conversation
RVB: Hello everyone. My name is Rik - Rik Van Bruggen - from Neo Technology, and here I am again recording another episode of our Neo4j Graph Database podcast. With me tonight is - all the way from California - Nicole White, from Neo. Hi, Nicole. 
NW: Hi Rik, how are you? 
RVB: I'm very well. And yourself? 
NW: Very good. 
RVB: Very good. Well, it's late at night for me, it's still afternoon for you, but I thought I'd take the opportunity to talk to you a little bit because-- well, maybe you can explain that, yourself? Who are you, and what do you do with Neo? Do you mind explaining that to our listeners? 
NW: Right. Yeah. My name is Nicole White, I'm a data scientist at Neo4j, and we actually use Neo4j internally to hold all of our data that we collect - marketing, sales, product usage. Particularly with Neo4j, I'm using Neo4j to  perform common data science tasks, but Neo4j is our data storage solution. All of our data sits in one spot, one nice clean spot, and thus it's very easy to answer some of the complicated questions that we weren't able to answer before. Actually, all of our tools that I've built out internally are built on top of Neo4j, which is probably my favorite part about my job - is that I get to use Neo4j, I don't have to touch SQL ever, I just get to write cypher all day long which is super, super fun. So with regards to Neo4j, that's who I am. But I just recently graduated from grad school with a degree in statistics, and before that I got an undergraduate degree in economics and math. Just hailed from Austin, Texas, moved here to California, San Mateo, ten months ago, I think, is when I started. I'm coming up on my first year here at Neo. 
RVB: Okay. Well, this sounds like we're eating our own dog food, right? Using Neo for a-- 
NW: Yes, we are. We actually just upgraded to 2.2. All of our systems were just upgraded to 2.2. 
RVB: Fantastic. How did you get into Neo, Nicole? I mean, you must have started using that at grad school or at university, or how did you get into it? 
NW: Yeah. It was actually the GraphGist Challenge. It was the very first one. I saw it on Twitter. Someone who I was following re-tweeted a Neo4j tweet about the GraphGist challenge, and so I looked at the page and I saw a GraphGist. I think the first GraphGist I saw was something about doctors and prescriptions or something, and I saw Cypher and I was like, "This looks really cool." And, of course, there is an opportunity to make money so I was all about it. I looked at Cypher-- 
RVB: Typical student, right [chuckles]? 
NW: I know, right [chuckles]. I was actually, at the time that I came across these GraphGists, I was working on a project with My Flights data set in school with all of the-- it was the Bureau of Transportation's statistics, all their data on delayed flights across all domestic-- US domestic airports. I had that all in Oracle database - a SQL database - and I was doing just like some pretty basic analysis on it for a school project, and then as soon as I saw Neo4j-- as soon as I saw Cypher, I already knew that a lot of my SQL queries would be so much easier in Cypher. I was already seeing that I would prefer to have Neo4j.  So I moved it all to Neo4j, and then I also created that GraphGist of the flights, and that's the first data set that I learned Neo4j on and learned Cypher on.
RVB: Yeah, fantastic. So you mentioned that you thought , you know-- 
NW: The GraphGist on. 
RVB: Yeah. You mentioned that you thought that it would be a lot simpler than in SQL, that in SQL. Did that turn out to be true? Is that-- 
NW: Yeah. 
RVB: --one of the things that you like about it, or where does the love for Neo come from? 
NW: That has to be the first thing that hit me, was there are some SQL queries that I really struggled with. There was one - it was so simple to say in English. It was just like,  "I want to see airports that are, by definition, span multiple states." Because some airports are technically-- some over in the DC area, they technically sit in several states somehow, and writing that query in SQL was strangely hard. I had to use a partition by and something weird. I remember it was that query specifically, and when I saw Cypher, I was like, "That's going to be super easy," and it was. I took a SQL query that was probably, like, 20 lines and really hard to read, and put it into Cypher. And that's what I love about Neo4j, is that you can take a question that you've posed in English and very easily translate it into Cypher and vice versa. Like, I can take a Cypher query and then translate it back to English very easily, even if it's a data set I've never seen before, a Cypher query I've never seen before. I can easily scan through it and say, "This is what they're doing," in English, whereas when someone sends you a SQL query and that-- particularly with a data set you haven't worked with, translating it back to English is really hard. Just trying to scan everything that's going on in the cycle-- or in the SQL query, I think is very difficult. So I think from a collaboration standpoint, Neo4j is super awesome. Because I got a few of my classmates to work with me on this so I'm putting all the flights data into Neo4j, and just collaborating across queries was much simpler because we could understand what-- 
RVB: Because of the readability, yeah. 
NW: The readability is just a huge factor for me, and I think that's probably what I love most about Neo4j, is Cypher, I would have to say [chuckles]. 
RVB: Yeah, very cool. You mentioned earlier that you were using it for data science and I believe you're also doing a lot of talks on integrating Neo with R right, with the R project. Can you tell us a little bit more about that maybe? 
NW: Yeah, so I wrote the R driver for Neo4j. It's called RNeo4j, and I use that internally here as well a lot as in addition to Python. Python and Neo4j do all the heavy work, and then any reporting, or analysis, or charting visualization stuff, I'll spin up my R driver and pull Neo4j data into R for more fancier statistic stuff which we've been doing recently for some new projects that we've just started here at work. The R driver, essentially, is just a wrapper for the rest API and it will pull Cypher query results into your R environment very easily, and then you can-- and that opens up a lot of doors for analysis purposes. But yeah, I've been doing a lot of talks around that. I do a meet-up on the R driver probably like once every couple of months here in the Bay Area-- 
RVB: You should do that in Europe, Nicole. I mean-- 
NW: I should [laughter]. I'll be in Europe soon. I'll be there for a GraphConnect London, so I'll probably do something with Mark while I'm over there, because he uses the R driver probably more than I do. If you look at his blog [chuckles]-- 
RVB: Yeah, exactly. So maybe wrapping up, Nicole, where do you think this is going? Where do you hope, where do you want it to go, and where do you think it will go? The evolution of graph databases is so quick this days yet-- but what do you think [chuckles] is coming at us right now? 
NW: I just think we get to look forward to just a huge improvement in user experience from a user standpoint. I've been a user of Neo4j for a little bit over a year now and it's just crazy how quickly they improved the user experience. Just from 1.9, I think, is when I first saw it, 2.0 was huge, just the Neo4j browser has gotten so much better with the 2.2 release. 
RVB: Absolutely, yeah. 
NW: There's just so many nice, convenient-- like they're subtle, the changes are subtle, but when you're a super-heavy user of the Neo4j, they really stand out. They've made some subtle changes to the Neo4j browser that I really like. I think, I'm mostly looking forward to the huge improvements in user experience that are most likely continuing to come. Also, I think from my standpoint as well, the whole import process for Neo4j is going to continue getting more awesome because I feel like import was our biggest weakness when I first encountered Neo4j. We didn't have a lot of really easy-to-use tools. And within a year, now we have load CSV, which is really easy, and then we have the 2.2 import tool, which is really easy and super-fast. I feel like that's also what I'm looking forward to, is continued improvements on the import part of Neo4j, because that's the first part you're going to encounter, right? As a new user, the first thing you're going to do is import your data, so I'm really happy that we've been putting so much work into that part. The whole import experience has gotten much better. 
RVB: I could not agree more [chuckles]. As you know, I'm in sales, and I know how important this is. Very good, thank you so much Nicole for coming online and doing this recording with me, I really appreciate it. 
NW: Thanks for having me. 
RVB: It was great having you on the podcast and I really appreciate it. Thank you again, and yeah, I look forward to seeing you at GraphConnect. 
NW: Yeah, I look forward to it as well. Have a good one. 
RVB: See you, bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Thursday 14 May 2015

Cycling Tweets Part 2: Importing into Neo4j

So after completing the first part of this blogpost series, I had put together a bit of infrastructure to easily import data into Neo4j. All the stuff was now in CSV files and ready to go:
So I got out my shiny new Neo4j 2.2.1, and started using Load CSV for getting the data in there. Essentially there were three steps:

  • Importing the metadata about the riders and their twitter handles: importing the metadata
  • Importing the actual tweets
  • Processing the actual tweets
So let's go through this one by one. We will be using the following model to do so:

1. Importing the Cycling metadata into Neo4j

I wrote a couple of Cypher statements to import the data from CQ ranking:

//add some metadata //country info load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=1390098748" as csv create (c:Country {code: csv.Country, name: csv.FullCountry, cq: toint(csv.CQ), rank: toint(csv.Rank), prevrank: toint(csv.Prev)});   //team info load csv with headers from"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=1244447866" as csv merge (tc:TeamClass {name: csv.Class}) with csv, tc match (c:Country {code: csv.Country}) merge (tc)<-[:IN_CLASS]-(t:Team {code: trim(csv.Code), name: trim(csv.Name), cq: toint(csv.CQ), rank: toint(csv.Rank), prevrank: toint(csv.Prev)})-[:FROM_COUNTRY]->(c);   //twitter handle info using periodic commit 500load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=0" as csv match (c:Country {code: trim(csv.Country)}) merge (h:Handle {name: trim(csv.Handle), realname: trim(csv.Name)})-[:FROM_COUNTRY]->(c);   //rider info load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=1885142986" as csv match (h:Handle {realname: trim(csv.Name)}), (t:Team {code: trim(csv.Team)}) set h.Age=toint(csv.Age) set h.CQ=toint(csv.CQ) set h.UCIcode=csv.UCIcodeset h.rank=toint(csv.Rank) set h.prevrank=toint(csv.Prev) create (h)-[:RIDES_FOR_TEAM]->(t);   //add the index on Handle create index on :Handle(name); create index on :Hashtag(name); create index on :Tweet(text); create index on :Handle(nodeRank); create constraint on (h:Handle) assert h.twitterId is unique;

As you can see, I also added some indexes. The entire script is also on Github.

The graph surrounding Tom Boonen now looked like this:

Once I had this, I could start adding the actually twitter info. That's next.

2. Importing the tweet data into Neo4j

As we saw previously, I had one CSV file for every day now. So how to iterate through this? Well, I did it manually, and create a version of this query for every day between April 1st and 30th.

//get the handles from the csv file //this should not do anything - as the handles have already been loaded above using periodic commit 500load csv with headers from "file:<yourpath>/20150401.csv" as csv with csv where csv.Username<>[] merge (h:Handle {name: '@'+lower(csv.Username)});   //connect the tweets to the handles using periodic commit 500load csv with headers from "file:<your path>/20150401.csv" as csv with csv where csv.Username<>[] merge (h:Handle {name: '@'+lower(csv.Username)}) merge (t:Tweet {text: lower(csv.Tweet), id: toint(csv.TweetID), time: csv.TweetTime, isretweet: toint(csv.IsReTweet), favorite: toint(csv.Favorite), retweet: toint(csv.ReTweet), url: csv.`Twitter URL`})<-[:TWEETS]-(h);

This file is also on Github, of course. I ran this query 30 times, replacing 20150401 with 20150402 etc etc... The result looked like this:
But obviously this is incomplete: we only have the tweets issued by specific riders now - and we would really like to know who and what they mentioned - in other words extract the handles and hashtags from the tweets. Let\s do that!

3. Processing the tweets: Extract the handles and the hashtags

I created two queries to do this - they are also on Github:
//extract handles from tweet text and connect tweets to handles match (t:Tweet) WITH t,split(t.text," ") as words UNWIND words as handles with t,handles where left(handles,1)="@"with t, handles merge (h:Handle {name: lower(handles)}) merge (h)-[:MENTIONED_IN]->(t);   //extract hashtags from tweet text and connect tweets to hashtags match (t:Tweet) WITH t,split(t.text," ") as words UNWIND words as hashtags with t,hashtags where left(hashtags,1)="#"with t, hashtags merge (h:Hashtag {name: upper(hashtags)}) merge (h)-[:MENTIONED_IN]->(t);
And that's when we start to see the twitter network unfold: multiple riders tweeting and mentioning eachother:

That's about it for this part 2. In the next section we will go into how we can enrich this dataset with more data about the connectedness between riders. Who is following who?

I hope you have liked this series so far. As always, feedback very welcome.