Thursday, 30 July 2015

Hierarchies and the Google Product Taxonomy in Neo4j

Quite some time ago, I wrote a blogpost about using Neo4j for managing and calculating hierarchies. That post was then also later used in my book as it proved very useful for explaining one of the key use-cases for Neo4j, Impact Analysis and Simulation. So it should be pretty clear by now that HIERARCHIES ARE GRAPHS right? I think so :) ...

Hierarchical Product Taxonomy

Recently, I was preparing for a very cool brown-bag session at a client's offices, when I wanted to include a demonstration around product taxonomies. These structures are typically presented as some kind of a hierarchy/tree on many eCommerce websites - and are very well known by online users. So I wanted to find a taxonomy, and here here, Google immediately came to the rescue. I found this page on the Google Merchant Center.

You can follow the link to the Excel file, and boom - there's your product Taxonomy for you.

In Excel I quickly converted it to CSV, so that I could import it into Neo4j pretty easily. The CSV file is of course on github.

Importing the Google Product Taxonomy

From the XL file, I could easily see that this taxonomy had 7 levels deep. So when I wanted to do the import, I chose to do 7 iterations through the CSV file to do the job. The import statements are all on github too (just copy/paste into the Neo4j-shell to try it out yourself - it will grab the CSV file for you). Here's the first couple of statements for you, generating the "top" of the hierarchy:

1:  //Create the taxonomy Top:  
2:  create (t:Taxonomy {name:"Google Product Taxonomy"});  
3:    
4:  //top of the tree: Cat1  
5:  load csv with headers from "file:/Users/rvanbruggen/Dropbox/Neo Technology/Demo/Product Categories/taxonomy-with-ids.en-US.csv" as csv fieldterminator ';'  
6:  with distinct csv.Cat1 as Cat1, csv.Cat2 as Cat2, toInt(csv.ID) as ID  
7:  where Cat2 is null  
8:  match (t:Taxonomy {name:"Google Product Taxonomy"})  
9:  merge (c:Cat1:Category {name: Cat1, id: ID})-[:PART_OF]->(t);  
10:    
11:  // (cat2)-[:PART_OF]->(cat1)  
12:  load csv with headers from "file:/Users/rvanbruggen/Dropbox/Neo Technology/Demo/Product Categories/taxonomy-with-ids.en-US.csv" as csv fieldterminator ';'  
13:  with distinct csv.Cat1 as Cat1, csv.Cat2 as Cat2, csv.Cat3 as Cat3, toInt(csv.ID) as ID  
14:  where Cat2 is not null AND Cat3 is null  
15:  match (c1:Cat1 {name: Cat1})  
16:  merge (c2:Cat2:Category {name: Cat2, id: ID})-[:PART_OF]->(c1);  

As you can see from the above, we are running through the CSV file, and then
  • connecting the top-level categories (labeled as Cat1) to the "taxonomy node". You could indeed imagine to have multiple taxonomies living happily side by side in your graphdb.
  • ensuring that the "current level" categories are not null, so that we don't need to worry about empty nodes.
  • ensuring that the "next level" categories are "null" so that we only have to create the nodes at that particular level of categorisation.
  • assigning two labels to every taxonomy element: one for the "level" that they are in ("Cat1", "Cat2", "Cat3", etc), and one ("Category") for clarifying that this part of the graph is in fact the taxonomy.

We have also created the appropriate indexes beforehand, so that the MERGE operations would go smoothly. 

Easy! As you can see below, the hierarchy was created very nicely:



Example queries on the Taxonomy Tree

I've also just done some trivial queries on the Taxonomy Tree. They are also on Github. Here's one where I grab a Cat7 (bottom of the tree leaf) node, and race up to the top:


And here's another love-ly example of how you could find the links between two parts of the taxonomy. In this particular case you are running from two totally unconnected parts of the taxonomy (from "Clarinet Barrels" to "Paintball Hoppers"), right up to the top:

But in the real world you could probably see how you could really link actual products into the taxonomy and find out meaningful information. Like in this example, where you try to find out the links between an Apple iPhone and a Samsung Chromebook.

That's about it for now. I hope you found this useful, and welcome any questions and comments as usual.

Cheers

Rik

2 comments:

  1. Hi Rik, this is great and really fun to play around with!
    But I'm curious about how you get the browser to display the different "Cat" levels in different colours? I only get the "Category" styling on all my nodes

    /torbjørn

    ReplyDelete
    Replies
    1. Normally you should be able to click on the Label names at the top of the "pane", and then at the bottom the different styling options would appear. Choose those and then you can normally customize them...

      See http://neo4j.com/developer/guide-neo4j-browser/#_styling_neo4j_browser_visualization for some info? Let me know if that helps.

      Rik

      Delete