Thursday, 3 March 2016

The Neo4j Knowledge Graph

A couple of days ago, I wrote a graphgist about creating a true Knowledge Graph for the Neo4j ecosystem. Based on the fantastic Awesome Neo4j resource created by our friends at Neueda/Neueda4j. You can access it in a separate window over here.


In this post however, I will go into a bit more detail about how I went about creating that graph.

Google Spreadsheet is my friend

I mentioned already that I started from the awesome Awesome Neo4j github resource. And while it's a great idea to manage pages etc collaboratively on Github, I can't help but feel like there should be other and nicer ways of structuring that information. So I spent a couple of hours converting that information into a spreadsheet (which is publicly accessible over here):

This sheet contains 
  • info about the resource (name and comments)
  • the URL where you can find the resource
  • info about the author (individual or organisation) that created/manages the resource
So it's a very, very easy graph model:



So all I needed to do was import that sheet into Neo4j. Easy...

Importing the Google Spreadsheet with Load CSV

As we know by know, it's really easy to download a Google spreadsheet as a CSV file, and then it is pretty darn easy to import that CSV into Neo4j with Load CSV. I have two versions of that load script:

The result is not a very big graph of course:


And now we can do some nice querying on it - just for fun!

Querying the Neo4j KnowledgeGraph

Obviously there are many different queries that we could run on an interesting graph like this. I have put a couple of them on Github as well. Here they are:

//Find some Authors, Resources and Tags
MATCH p = ((a:Author)--(r:Resource)--(t:Tag))
return p
limit 25

Gives you an initial sample of the graph:

Then we can explore a couple of specific graph neighborhoods:

//Find some Authors, Resources and Tags connected to Rik or Max
MATCH (t:Tag)--(r:Resource)--(a:Author)
where a.name contains "Rik" or a.name contains "Max"
return t,r,a

this gets us this one:


And then we can also "recreate" a spreadsheet-like view of the graph:

//find some resources and authors
MATCH (r:Resource)--(a:Author)
where a.name contains "Rik" or a.name contains "Max"
return distinct a.name as Author, r.name as Resource, r.url as URL, r.comments as Description
order by Author;

This gets us (pitty that the url's don't get hyperlinked like they do on the graphgist):


And then finally, let's look at some pathfinding - always interesting:

//find some paths between books and blogs
match (t1:Tag {name:"book"}), (t2:Tag {name:"blog"}),
p = allshortestpaths ( (t1)-[*]-(t2))
return p
limit 10

As usual, we end up with Michael Hunger again :)) 


So there you go. A first attempt at creating another graph-based knowledge repository for all things Neo4j.  Hope you guys enjoyed that. I know I did :))

Cheers

Rik

3 comments:

  1. Hi, It's a great blog and I've learned a lot. But there's a problem confusing me.
    unwind row as text
    with Resource, [w in split(text,", ") | trim(w)] as words
    unwind range(0,size(words)-2) as idx
    MATCH (r:Resource {name: Resource})
    MERGE (t1:Tag {name:words[idx]})
    MERGE (t2:Tag {name:words[idx+1]})
    MERGE (r)-[:TAGGED_AS]->(t1)
    MERGE (r)-[:TAGGED_AS]->(t2);

    This is the part of code that I am confusing. How do you do the tag? How can this part of code ensure that all the items in tagged can be visited? Can you give me a little hint?Thank you

    ReplyDelete
  2. What this does is it looks at the "Tags" column of the spreadsheet (https://docs.google.com/spreadsheets/d/1X6DpFZoS01V1crgRED4dRz2UkbiYR8FJMPf9xey9Lwc) and it then created tags and relationships between tags and resources by iterating through the "tag cell" of the spreadsheet.

    So for example, if a "tag cell" contains the following tags

    code, rdbms, tool, integration, import

    separated by columns, then the script above splits them up into individual tags (using the split(text,", ") command), then looks at the number of tags available (using size(words)-2) as an index to iterate over), and then merges the individual tags and the relationships.

    Hope that's clearer?

    RIk

    ReplyDelete
    Replies
    1. Thank you very much ! I've figure it out!

      Delete