Friday 29 January 2016

Podcast Interview with Stuart Begg and Matt Byrne, Independent Contractors for Sensis

This week I was able to connect with two lovely gentlemen that have been doing some truly inspiring work with Neo4j on a project in Australia. Stuart and Matt have been talking and writing about the project for a while now, and have been very forthcoming about sharing their lessons learnt with the Neo4j community. That's always a great thing to watch, so I literally could not wait to chatting with them - and so we did. Here's the recording:

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone, my name is Rik. Rik Van Bruggen from Neo Technology and here I am recording the longest distance podcast recording ever in the history of this podcast, and that's all the way in Australia, Stuart and Matt. Thank you guys for joining us. 
SB 00:18 Thanks Rik. 
RVB: 00:19 It's great to have you on. I know it's early for you guys, it's late for me but it's great that you guys are making the time. I've read some of the blog posts that you guys have been posting on Neo, but it might be a good idea for you guys to introduce yourself, if you don't mind? 
SB: 00:36 Hi Rik, it's Stuart here, Stuart Begg. I'm a contractor developing software for Sensis. We've been using Neo for a while now, and I'm here with my colleague Matt Burn. 
MB: 00:47 I'm Matt Byrne, also contracting out of Sensis in Melbourne. Been on the same project as Stu. 
RVB: 00:54 Absolutely, very cool. And you guys have been using Neo for a long time? How long have you been working with Neo4j by now? 
SB: 01:01 Well the project kicked off about two years ago but we've been actively using Neo for the last 18 months after a selection process at the very beginning of the project to look at various technologies and Neo was the technology that was chosen. 
RVB: 01:16 And I suspect that my dear friend Jim Webber was part of that process, wasn't he? 
SB: 01:23 Yes, Jim was part of that process in the very early stages. He was the one that kind of seeded the idea in my mind in particular when he came to Sensis quite a few years ago actually and talked about the benefits of Neo Technology as an alternative to other more traditional database technologies. And I was very interested in what he had to say. Then when we started looking at this project, Neo was a very good fit for the style of application that we were building. 
RVB: 01:49 Well that gives me a perfect segue into the first question that I typically ask on this podcast, which is why? Why did you guys get into this? And what made it such a good fit? 
SB: 02:03 It's a good fit for a couple of different reasons. The project that we were working on is around content management within the organisation. We're a directory's company so we have a lot of content that's supplied by the various customers. But in many ways it's kind of a configuration and management system that the company sells products and those products are backed by the content. We've gone through a number of iterations of trying to work out a better and different way of managing the content within the backend system and have looked at various things from relational database systems, which there was some early work done on that but we ran into some modelling and performance issues with that. And then we also looked at some hierarchical data systems. But when we got down to it and looked at the data in more detail, it was more of a network model. We were trying to reduce duplication within the data. We were seeing that there was various pieces of data that we used in multiple different ways, the same piece of data used in multiple different contexts, and it's just that the graph made everything so much easier to model in the first instance, and also for the business to understand how we were actually using the data. And the visualisation tools that come with Neo as a standard feature make it really easy to get your message across to, not only the end users but also to the developers that are building the system itself. Anything you'd like to add to that? 
MB: 03:29 No, you were part of that selection [?] [chuckles]. 
RVB: 03:32 [chuckles] Well it's so cool right, because I don't know if you know these guys, but the way Neo got founded in late 90s, early 2000s was for a content management system as well. So it's history repeating itself so to speak, and exactly what you were saying right? It's a modelling fit, it's a performance fit. All of those types of things seem to come back quite regularly, which is super cool. So tell us a little bit more about these performance advantages that you guys wrote about in the blog post. There were some significant improvements there I gather, no? 
MB: 04:14 Yeah. So basically we're retrieving a sub-graph of information at a time. So we've got all this dynamically specified amount of data and for us to be able to model that in relational would just kill us. So in the blog post the motivation there was to really show how you could tune Neo and just to show all the tools available to you. So we put some stuff in there as an example of what we've done, but it was really to help people get started and have a bit more competence in it. I think at the time we only found Michael Hunger's article while we were writing the article, our one, because it was just a Google Doc I think. So he published his out and that helped us. It really helped get things going for us. 
RVB: 05:06 Your article was really good because it's also illustrating a common problem with time versioning and all of that stuff. You were inspired by Ian Robinson right? One of Jim's mates. 
MB: 05:20 Yeah. I came onto the project a bit after-- Stu did the hard work, him and another guy to select the technology, and obviously a bold move to do it with such a core system. And then when we came on, the time-based thing of Ian Robson was great, it fit us. But it also challenges you. You have to bake that in at a low level and it can affect performance. So we were tackling that. 
RVB: 05:53 I think everyone should read your blog post. There's some very valuable experiences and lessons to be learned from that I think. Thanks for writing it up. It's really helping our community I think. So guys, where is this going? Where is your project going? And where do you think the technology's going to be going in the next couple of years? Any perspectives on that? 
SB: 06:15 I think from the project's perspective, I think it's been particularly successful. We've met the business requirements and also I guess just the user requirements of providing a much richer experience for the users and a much easier and faster turn-around from concept to delivery of the sort of products that we're looking at delivering for Census as an organisation. But it would be good to have other companies and other individuals understand some more about the graph technology. And that's one of the reasons why Matt and I are keen to talk about our experience with the product, and I think that there is a good fit for this kind of technology in a lot of different areas that people may not have thought about. Even in these times of the NoSQL database, I think there's still a lot of people are very familiar SQL style databases and maybe don't think beyond that. So it would be good for people to have this as another option within their tool kit for providing different kinds of solutions to the sorts of problems that they're dealing with. 
MB: 07:24 Just to add to that, we've put the article out and a guy Mark in Melbourne's organised Melbourne meet-ups. We're trying to put out Sensis as a strong example of how it can work. But it'd be nice to see other examples in Australia come out, because it gives companies more confidence in Australia to adopt these technologies. And we're just contractors so we'd like to help the next company if that came up. But in terms of Neo, to have confidence in the technology we also wanted to highlight any drawbacks that there are, just so that you know what you're in for. But Cypher is a fantastic query language, it's very easy to use. But it's young compared to SQL. So that's still evolving. A lot more sugar in that, a lot more power coming, and there's always things that we hear. So I'm always keen to see the enhancements to that coming in new versions. And more indexing power. It'd be nice to have them on relationship because there are examples where we do want to query things that way. So some of those things. But all the work that's come in, it's just fantastic to see it improve so much between releases. 
SB: 08:49 And the-- I was just going to say, the feedback that we get from Neo for the sorts of requests that we make and just the ways that we using the technology, the feedback from the Neo guys has been terrific all the way through this process as well. 
RVB: 09:07 That's super great to hear. Just to address what you were just saying there Matt, there's so many things that we want to do in the product, it is a very long list of things. The cool thing is with Neo, we've actually now hired a very significant engineering team. We had an all notes call earlier today with all the Neo employees. And our engineering team is like over 40 people now, which is super super great, because that allows us to accelerate with those new feature developments as well, right? There's wonderful stuff coming. I can tell you that [chuckles]. 
MB: 09:46 And the good thing about dealing with the company Neo compared to our traditional database vendors or any other - what people love to call - enterprises is that we do get this great feedback and we do feel like our feedback's taken account. We realise there's only so much that can be done in releases, so we definitely don't feel hard done by. It's great. 
RVB: 10:11 Super. Cool. Well guys, thank you so much for coming on the podcast. I'm going to wrap it up here because I want to keep these episodes digestable and short for people's commutes. So thank you again, I really appreciate it and I look forward to meeting you guys in person one day. 
SB: 10:32 That will be great, thanks-- 
MB: 10:32 That'll be great. Cheers-- 
SB: 10:34 Look forward to it. 
RVB: 10:34 Thank you guys. Cheers, bye. 
MB: 10:36 Thanks, bye. 
SB: 10:37 Cheers.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Wednesday 20 January 2016

Podcast Interview with Paul Jongsma, Webtic

NO! The Graphistania Podcast is NOT DEAD, in spite of what you may have thought. We have lots of cool interviews coming up (and are always looking for more - so please contact me if you have more ideas), and we are going to start the "not so new year" with a nice conversation with a dear community member from the Netherlands, Paul Jongsma. Paul has been around the block a few times, is a seasoned web developer that runs his own consultancy doing project with lots of cool and innovative technology, among which Neo4j, of course. So we got together and talked and here's the new episode:

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik, Rik Van Bruggen. I am very happy to start the New Year with another beautiful Neo4j podcast. Graphistania needs to be populated right so we're going to do another chat here today. I've invited a very, very dear community member from the Netherlands from this chat and that is Paul Jongsma. Hi Paul.
PJ: 00:26 Hi.
RVB: 00:27 Hey. Thanks for coming online. I really appreciate it.
PJ: 00:31 Glad to be here.
RVB: 00:32 Yeah, cool. Paul, we always start the podcast with just very simply, who are you, what do you do, what's the relationship between you and the wonderful world of graphs?
PJ: 00:43 Try to do that in a couple of seconds. I'm Paul. In a life beyond me I was co-founder of XS4ALL, one of the first public Internet providers in Holland, and after doing that for a couple of years, I left that company to create my own, called WebTic in which I, ever since it was founded, tried to disclose information on the web via websites, web applications, whatever you want to name it, and I've been doing that since 1996, so I've been on the road for quite a bit.
RVB: 01:22 Yeah, absolutely. So when you say disclosing information on the web, that means building websites? What types of tools do you typically use for building your web applications?
PJ: 01:35 Until recently, mainly-- back in 96, there was only a couple of ways of building a website--
RVB: 01:46 CGI Scripts!
PJ: 01:48 CGI Scripts and no databases. So I was one of the first to actually try to hook up databases onto the web and there were few tools at that time and one of which comes to mind is Mini-SQL, which is not even a predecessor of MySQL, at least timeline wise. We've been doing projects involving databases almost exclusively. We weren't building websites which were static. There was always a technical challenge behind it.
RVB: 02:24 Yeah, an interactive component.
PJ: 02:26 Interactive components, or automation components. For instance, we've been working on the website for the Department of...
RVB: 02:38 Looking for the English word [laughter].
PJ: 02:40 The English word, exactly [laughter]. The Statistics Department of Amsterdam and they obviously have loads of information within their systems and they want to publish those on the web. What we did was build a system where basically all the information which they have already can be collected, a script can be run on that and then a website comes out on the other end and updates are all done automatically so they just add a few files which they want to publicize. They push the button and they're on the web and there is no need for internal knowledge of how to convert stuff for the web or it's all automatic. That kind of engine of converting it to the web and making that engine work is the area of interest we do mainly--
RVB: 03:32 Cover. So how did you get to graphs. Tell us about that.
PJ: 03:36 Yeah, that's quite an interesting story. Obviously we've been working with SQL databases for quite a bit of time. Back in the day, Mini-SQL. We've been using MySQL for ages now. A couple of projects we also did Postgress instead of MySQL, but mostly SQL-based, and for a larger project we've done recently called Historiana, which is a website about history and European context and trying to create new and effective ways of teaching history to both students and teachers, facilitate the teacher and give tools to both. We ran into some modeling problems because as you can imagine, there are a lot of relations to be documented when you talk about history. You've got persons. You've got events. You've got locations. And as anything in the human world, you could connect anything to anything basically and that requirement was also there, so we built a system where you actually could connect basically everything to everything within the context of SQL. And that gave--
RVB: 04:57 That must have been fun [laughter].
PJ: 04:59 That gave very interesting results and after a couple of hundred records on various entry, various parts of the database, the interface came to a grinding halt and obviously because if you want to relate 100 items to 100 items to 100 items, the queries which you have to build are horrendous and the user interface becomes horrendous and even worse the timing becomes horrendous.
RVB: 05:28 Now was that the project that you using Neo4j for then[crosstalk]?
PJ: 05:31 Oh yeah, I did not know Neo4j at that time, but my antennae picked up the keyword graph database so I started looking into those because obviously there should be a solution for a problem like the history problem. Graph databases seemed like it might be a solution.
RVB: 06:00 You're not alone there, right. I mean recently there was this beautiful video from the Codex example. I don't remember the guy's example but it was beautiful I thought.
PJ: 06:09 Yeah, which-- the Australian guy. You should add that link to the--
RVB: 06:15 I will. I will.

PJ: 06:17 -- because interestingly enough, there was a lot of overlap between the two, at least in the words used and the concepts used within both projects. It's amazing at how history--
RVB: 06:32 So what was the main benefit then? Was it modeling then or also performance? What was the main driver then for looking at the graph databases?
PJ: 06:41 The main driver was the ability to connect stuff to stuff, so a person should be able to connect to an event, and when you look at that person you should see the events, but when you are at the event, you should be able to ask which persons are connected to this event and you will see that relation. And doing that in SQL is sort of possible within certain limits but the ease in the graph database allows you to do that is very much more better than SQL database.
RVB: 07:21 When I first got to know you in the Amsterdam community, I will always quote you on saying that before Neo4j, when you work with Neo4j, sequel databases feel like a useful sin [laughter]. I think that's a fantastic way of putting it to be honest.
PJ: 07:36 They are. They are. Obviously there is room for other systems as well, not only data, but the relation between the web and a database, a graph database, is so much more a logical model than any other model that for any page you're at at a website which is explorative, it's always about ok, there is this thing and how does it relate to other things? And that question can be answered by Neo within a couple of milliseconds, and you'll be able to render the results of that page in real time instead of doing queries, or doing queries up front, and cache the results and stuff like that. There are all kind of technical tricks to make it work with SQL, but your life becomes so much more easier when you use the right technology for the right job.
RVB: 08:30 Agreed. So last but not least, where do you think it's going Paul? Where do you think the world of graph databases is headed and how do you plan to use it in the future?
PJ: 08:44 I think more and more people will discover the world of graph databases--
RVB: 08:51 See the light [laughter]?
PJ: 08:53 See the light and make their life easier to use them. That is one thing I think is going to happen. I've been always a runner-up from the technology. People always ask, "Why are you picking that?" Just because I thought it was a good tool. So I think behind me, there are a lot of people picking it up and I hope that the new Neos will take the web development even more in mind. There are a couple of things you might want to make easier, like relate to files in a file system and stuff like that that would make life easier, but I think we're getting there [crosstalk]. There are interesting developments like the binary interface and stuff like that.
RVB: 09:51 Absolutely. Well, thank you so much Paul. It was a joy to talk to you again. And I wish you lots of luck and happiness in 2016 and hopefully then we'll see this option of graphs take off together right? And I'll see you in the Amsterdam community very, very soon.
PJ: 10:11 Yes.
RVB: 10:12 Thank you, Paul.
PJ: 10:13 Okay. No problem.
RVB: 10:14 Thank you. Bye bye.
PJ: 10:14 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Wednesday 13 January 2016

The GraphBlogGraph: 3rd blogpost out of 3

Querying the GraphBlogGraph

After having created the GraphBlogGraph in a Google Spreadsheet in part 1, and having imported it into Neo4j in part 2, we can now start having some fun and analysing and querying that dataset. There are obviously a lot of things we could do here, but in this final blog post I am just going to explore some initial things that I am sure you could then elaborate and extend upon.

Let’s start with a simple query

// Which pages have the most links
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
return b.name, p.title, count(r)
order by count(r) desc
Run this in the Neo4j browser and we get:

or just return the graphical result with a slightly different query:

match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
with b,p,r,p2, count(r) as count
order by count DESC
limit 50
return b,p,r,p2

And then you start to see that Max De Marzi is actually the “king of linking”: he links his pages to other web pages a lot (which is actually very good for search-engine-optimization) .

A quick visit to one of Max’ pages does actually confirm that: there’s a lot of cool, bizarre, but always interesting links on Max’ blogposts:
So let’s do another query. Let’s look at the different links that exist between blogposts of our blog-authors. Are they actually quoting/referring to one another or not? Let’s do

//links between blogposts
MATCH p=((n1:Blog)--(p1:Page)-[:LINKS_TO]-(p2:Page)--(b2:Blog))
RETURN p;

and then we actually find that there are some links - but not that many.


Same thing if we look at this a different way: let’s do some pathfinding and check out the paths between different blogs, for example my blog and Michael’s

match (b1:Blog {name:"Bruggen"}),(b3:Blog {name:"JEXP Blog"}),
p2 = allshortestpaths((b1)-[*]-(b3))
return p2 as paths

Then we actually see a bit more interesting connections: we don’t refer to one another directly very often, but we both refer to the same pages - and those pages become the links between our blogs. At depth 4 we see these kinds of patterns:

Interesting, right? I think so, at least!

Then let’s do some more playing around, looking at the most linked to pages:

//Which pages are being linked to most
match ()-[r:LINKS_TO]->(p:Page)
return p.url, count(r)
order by count(r) DESC
limit 10;

That quickly uncovers the true “spider in the web”, my friend, colleague and graphista-extraordinaire: Michael Hunger:

Last but not least, I wanted to revisit an old and interesting way of running PageRank on Neo4j using Cypher (not using the Graphaware NodeRank module, therefore). I blogged about some time ago, and it’s actually really interesting and easy to do. Here’s the query:

UNWIND range(1,50) AS round
MATCH (n:Page)
WHERE rand() < 0.1
MATCH (n:Page)-[:LINKS_TO*..10]->(m:Page)
SET m.rank = coalesce(m.rank,0) + 1

This does 50 iterations of PageRank, using a 0,1 damping factor and a maximum depth of 10. Running it is surprisingly quick:

If you do that a couple of times, and even do a few hundred iterations at once, you will quickly see the results emerge with the following simple query:
match (n:Page)
where n.rank is not null
return n.url, n.rank
order by n.rank desc
limit 10;
Confirming the “spider in the web” theory that I mentioned above. Michael rules the links!


All of these queries are of course on Github for you to play around with. Would love to hear your thoughts on these three blogposts, and hope that they were as fun for you to read as they were for me to write.

All the best.

Rik

Monday 11 January 2016

The GraphBlogGraph: 2nd blogpost out of 3

Importing the GraphBlogGraph into Neo4j

In the previous part of this blog-series about the GraphBlogGraph, I talked a lot about creating the dataset for creating what I wanted: a graph of blogs about graphs. I was able to read the blog-feeds of several cool graphblogs with a Google spreadsheet function called “ImportFEED”, and scrape their pages using another function using “ImportXML”. So now I have the sheet ready to go, and we also know that with a Google spreadsheet, it is really easy to download that as a CSV file:

You then basically get a URL for the CSV file (from your browser’s download history):

and that gets you ready to start working with the CSV file:

I can work with that CSV file in Cypher’s LOAD CSV command, as we know. All we really need is to come up with a solid Graph Model to do what we want to do. So I went to Alistair’s Arrows, and drew out a very simple graph model:



So that basically get’s me ready to start working with the CSV files in Cypher. Let’s run through the different import commands that I ran to do the imports. All of those are on github of course, but I will take you through them here too...

First create the indexes

create index on :Blog(name);
create constraint on (p:Page) assert p.url is unique;

Then manually create the blog-nodes:

create (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"});
create (n:Blog {name:"Neo4j Blog", url:"http://neo4j.com/blog"});
create (n:Blog {name:"JEXP Blog", url:"http://jexp.de/blog/"});
create (n:Blog {name:"Armbruster-IT Blog", url:"http://blog.armbruster-it.de/"});
create (n:Blog {name:"Max De Marzi's Blog", url:"http://maxdemarzi.com/"});
create (n:Blog {name:"Will Lyon's Blog", url:"http://lyonwj.com/"});

I could have done that from a CSV file as well, of course. But hey - I have no excuse - I was lazy :) … Again…

Then I can start with importing the pages and links for the first (my own) blog, which is at blog.bruggen.com and has a feed at blog.bruggen.com/feeds/posts/default:

//create the Bruggen blog entries
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv
match (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"})
create (p:Page {url: csv.URL, title: csv.Title, created: csv.Date})-[:PART_OF]->(b);

This just creates the 20 leaf nodes from the Blog node. The fancy styff happens next, when I then read from the “Links” column, holding the “****”-separated links to other pages, split them up into individual links, and merge the pages and create the links to them. I use some fancy Cypher magic that I have also used before for Graph Karaoke: I read the cell, and then split the cell into parts and put them into a collection, and then unwind the collection and iterate through it using an index:

//create the link graph
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv
with csv.URL as URL, csv.Links as row
unwind row as linklist
with URL, [l in split(linklist,"****") | trim(l)] as links
unwind range(0,size(links)-2) as idx
MERGE (l:Page {url:links[idx]})
WITH l, URL
MATCH (p:Page {url: URL})
MERGE (p)-[:LINKS_TO]->(l);

So this first MERGEs the new pages (finds them if they already exist, creates them if they do not yet exist) and then MERGEs the links to those pages. This creates a LOT of pages and links, because of course - like with every blog - there’s a lot of hyperlinks that are the same on every page of the blog (essentially the “template” links that are used over and over again).
And as you can see it looks a little bit like a hairball when you look at it in the Neo4j Browser:
So in order to make the rest of our GraphBlogGraph explorations a bit more interesting, I decided that it would be useful to do a bit of cleanup on this graph. I wrote a couple of Cypher queries that remove the “uninteresting”, redundant links from the Graph:

//remove the redundant links
//linking to pages with same url (eg. archive pages, label pages...)
match (b:Blog {name:"Bruggen"})<-[:PART_OF]-(p1:Page)-[:LINKS_TO]->(p2:Page)
where p2.url starts with "http://blog.bruggen.com"
and not ((b)<-[:PART_OF]-(p2))
detach delete p2;
//linking to other posts of the same blog
match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"})<-[:PART_OF]-(p2:Page),
(p1)-[lt:LINKS_TO]-(p2)
delete lt;

//linking to itself
match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"}),
(p1)-[lt:LINKS_TO]-(p1)
delete lt;

//linking to the blog provider (Blogger)
match (p:Page)
where p.url contains "//www.blogger.com"
detach delete p;

Which turned out to be pretty effective. When I run these queries I weed out a lot of “not so very useful” links between nodes in the graph.
And the cleaned-up store looks a lot better and workable.

If you take a look at the import script on github, you will see that there’s a similar script like the one above for every one of the blogs that we set out to import. Copy and paste that into the browser one by one, the neo4j shell, or use LazyWebCypher, and have fun:
So that’s it for the import part. Now there’s only one thing left to do, in Part 3/3 of this blogpost series, and that is to start playing around with some cool queries. Look that post in the next few days.

Hope this was interesting for you.

Cheers

Rik