Showing posts with label graph. Show all posts

Monday, 6 April 2020

Graphistania 2.0 - Episode 6 - The One with the CovidGraph

So, when I started working with Graphs in 2012, one of the first community use cases that I encountered was all about biotech. I met a few people from the University of Ghent, who were working on some amazing protein interaction networks - and it was fascinating. Over the years, we have done quite a few activities on this, and we have kind of built a nice life sciences and healthcare community around Neo4j. Some amazing work is being done there.

One of the most amazing cases out there, has been the use case of the German Center for Diabetes Research, who have been scouring the scientific universe for ways of finding cures against diabetes. Look at this brief video or read this article to know more about it:

Why am I telling you this? Well, with the global Covid-19 pandemic sweeping around the globe, and many of us being affected in small or big ways, our Neo4j Graph Community has been doing the most interesting things to try and apply the "power of the graph" to this complex and intricate problem. Take a look at covidgraph.org for their work. When I learned about it, I immediately thought about talking to some of the "chief instigators" and inviting them for a podcast interview - which we made happen at record speed :) ...

So here it is: a chat about Covid-19, and about how graphs will help us make sense of the data. Let's hope it proves to be useful.

Supply Chain Management with graphs: part 3/3 - some SCM analytics

I've been looking forward to writing this: this is the last of 3 blogposts that I have been planning to write for weeks about my experiments with a realistic Supply Chain Management Dataset. There's two posts before this one:

In the first post I found and wrangled a dataset into my favourite graph database, Neo4j
In the second post I got acquainted with the dataset in a bit more detail, and I was able to do some initial querying on it to figure out what patterns I might be able to expose.

In this this third and last post I would like to get a bit more analytical with the dataset, and do some more detail investigation in order to better understand some typical SCM questions. Note that I am far from a Supply Chain specialist - I barely understand the domain, and therefore I will probably be asking some silly questions initially. But bear with me - and let's explore and learn, right?

Supply Chain Management with graphs: part 2/3 - some querying

So in the previous post, we got introduced to a dataset that I have been wanting to get into Neo4j for a long time: a Supply Chain Management dataset. Read up about it over here, but the long and short of it is that we got ourselves into the situation where we have an up and running Neo4j database with 38 different multi-echelon supply chains. Result!

As a quick reminder, here's what the data model looked like after the import:

Or visually:

Data validation and profiling

The first thing to do when you have a new shiny dataset like that, is of course to get a bit of a feel for the data. In this case, it really helps to understand the nature of the different SupplyChains - as we know from the original Excel file that they are quite different between the 38 of them. So let's do some profiling:

match (n) return distinct labels(n), count(*)

Supply Chain Management with graphs: part 1/3 - data wrangling and import

Alright, I have been putting the writing of this blogpost off for too long. Finally, on this sunny Saturday afternoon where we are locked inside our homes because of the Covid-19 pandemic, I think I'll try to make a dent in it - I have a lot of stuff to share already.

The basic idea for this (series of) blogpost(s) is pretty simple: graph problems are often characterised by lots of connections between entities, and by queries that touch many (or an unknown quantity) of these entities. One of the prime examples is pathfinding: trying to understand how different entities are connected to one another, understanding the cost or duration of these connections, etc. So pretty quickly, you understand that logistics and supply chain management are great problems to tackle with graphs, if you think about it. Supply Chains are graphs. So why not story and retrieve these chains with a graph database? Seems obvious.

We've also had lots of examples of people trying to solve supply chain management problems in the past. Take a look at some of these examples:

And of course some of these presentations from different events that we organised:

Our friends at Caterpillar used Neo4j for this:
TransparencyOne actually built a business on it:

So I had long thought that it would be great to have some kind of a demo dataset for this use case. Of course it's not that difficult to create something hypothetical yourself - but it's always more interesting to work with real data - so I started to look around.

Graphistania 2.0 - Episode 5 - This Month in Neo4j

Friends.

These are interesting times. These are difficult times, but we can deal with it together, as a community, as a graph. So that's why we were super happy that, just as Belgium was going into lockdown last week, we were able to record another Graphistania podcast episode for you, talking about the world in general, but also covering some of the amazing graph use cases that drifted over our screens in the past month, in the This Week in Neo4j (TWIN4J) newsletter.

There were actually many things to talk about, in terms of fascinating graph use cases, and I will highlight only the most striking ones here.

Our friends at Kineviz did some really interesting and timely work on COVID-19 temporal and spatial data visualization. This stuff is really important to understand, as pandemic spreads clearly follow graph patterns. Read Connected if you are not convinced.

Worth highlighting: Bloodhound: Windows network penetration testing with Neo4j, had a new release that you might want to take a look at. If you are not familiar with Bloodhound yet, you may also want to check out my interview with the Bloodhound crew on this podcast a while back.

We published this fun little thing called a Neo4j Treasure Map - check it out!

Finally - we also have a a Winegraph! It's a great example of importing data from the web using Norconex.

Some interesting stuff on using Neo4j for Gene ID mapping: take a look!

Another examle of enriching graphs with Wikidata, from the one and only Mark Needham: look at Mark's blog over here!

Don't forget: we Introduced the Neo4j Graph Data Science plugin with examples from the "Graph Algorithms" book.

A really interesting tweet about a visualisation of the US Supreme court as a graph db... Would love to see more like that.

And for some fun: Pokégraph: Gotta Graph 'Em All!

Some important stuff: we did a great 4.0 webinar that is giving you a lot of info on what to expect in the new version of Neo4j.

There was a great update to NeoMap: Visualizing shortest paths with neomap ≥ 0.4.0 and the Neo4j Graph Data Science plugin.

Those were the most important ones. So let's talk about these now - I am sure there's a lot of cool stuff here fore everyone!

Graphs @ Radiolab

So I go for my morning run the other day, and I put on my 2nd dearest podcast (after Graphistania, of course) - Radiolab. They have the most amazing stories that make me laugh, cry, read and research - and guess what: this episode is about GRAPHS!

Listen to this episode:
telling the amazing story of connectedness between soil, fungi, trees, and animals... aka the Wood Wide Web. The "internet of fungus", as the Beeb calls it.

Check out this TED talk too:
or this article about the "Intelligent Plant" and the connections that exist there.

If ever we needed more proof:

(GRAPHS)-[:ARE]->(EVERYWHERE),

even on your daily podcast!

Cheers

Rik

Thursday, 16 June 2016

Roadtripping for openCypher

This week, me and Andrès Taylor have been on the road to talk to our beloved Neo4j community about openCypher, our effort to deliver a full and open specification of the industry’s most widely adopted graph database query language: Cypher. It's been a fun and crazy couple of days, with Amsterdam on Tuesday, Paris on Wednesday - and today, I believe is Thursday so we must be in London :) ... We are doing a similar talk tonight in our London office...

The GraphBlogGraph: 3rd blogpost out of 3

Querying the GraphBlogGraph

After having created the GraphBlogGraph in a Google Spreadsheet in part 1, and having imported it into Neo4j in part 2, we can now start having some fun and analysing and querying that dataset. There are obviously a lot of things we could do here, but in this final blog post I am just going to explore some initial things that I am sure you could then elaborate and extend upon.

Let’s start with a simple query

// Which pages have the most links
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)

return b.name, p.title, count(r)

order by count(r) desc

Run this in the Neo4j browser and we get:

or just return the graphical result with a slightly different query:

match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)

with b,p,r,p2, count(r) as count

order by count DESC

limit 50

return b,p,r,p2

And then you start to see that Max De Marzi is actually the “king of linking”: he links his pages to other web pages a lot (which is actually very good for search-engine-optimization) .

A quick visit to one of Max’ pages does actually confirm that: there’s a lot of cool, bizarre, but always interesting links on Max’ blogposts:

So let’s do another query. Let’s look at the different links that exist between blogposts of our blog-authors. Are they actually quoting/referring to one another or not? Let’s do

//links between blogposts

MATCH p=((n1:Blog)--(p1:Page)-[:LINKS_TO]-(p2:Page)--(b2:Blog))

RETURN p;

and then we actually find that there are some links - but not that many.

Same thing if we look at this a different way: let’s do some pathfinding and check out the paths between different blogs, for example my blog and Michael’s

match (b1:Blog {name:"Bruggen"}),(b3:Blog {name:"JEXP Blog"}),

p2 = allshortestpaths((b1)-[*]-(b3))

return p2 as paths

Then we actually see a bit more interesting connections: we don’t refer to one another directly very often, but we both refer to the same pages - and those pages become the links between our blogs. At depth 4 we see these kinds of patterns:

Interesting, right? I think so, at least!

Then let’s do some more playing around, looking at the most linked to pages:

//Which pages are being linked to most

match ()-[r:LINKS_TO]->(p:Page)

return p.url, count(r)

order by count(r) DESC

limit 10;

That quickly uncovers the true “spider in the web”, my friend, colleague and graphista-extraordinaire: Michael Hunger:

Last but not least, I wanted to revisit an old and interesting way of running PageRank on Neo4j using Cypher (not using the Graphaware NodeRank module, therefore). I blogged about some time ago, and it’s actually really interesting and easy to do. Here’s the query:

UNWIND range(1,50) AS round
MATCH (n:Page)
WHERE rand() < 0.1
MATCH (n:Page)-[:LINKS_TO*..10]->(m:Page)
SET m.rank = coalesce(m.rank,0) + 1

This does 50 iterations of PageRank, using a 0,1 damping factor and a maximum depth of 10. Running it is surprisingly quick:

If you do that a couple of times, and even do a few hundred iterations at once, you will quickly see the results emerge with the following simple query:

match (n:Page)
where n.rank is not null
return n.url, n.rank
order by n.rank desc
limit 10;

Confirming the “spider in the web” theory that I mentioned above. Michael rules the links!

All of these queries are of course on Github for you to play around with. Would love to hear your thoughts on these three blogposts, and hope that they were as fun for you to read as they were for me to write.

All the best.

Rik

Monday, 11 January 2016

The GraphBlogGraph: 2nd blogpost out of 3

Importing the GraphBlogGraph into Neo4j

In the previous part of this blog-series about the GraphBlogGraph, I talked a lot about creating the dataset for creating what I wanted: a graph of blogs about graphs. I was able to read the blog-feeds of several cool graphblogs with a Google spreadsheet function called “ImportFEED”, and scrape their pages using another function using “ImportXML”. So now I have the sheet ready to go, and we also know that with a Google spreadsheet, it is really easy to download that as a CSV file:

You then basically get a URL for the CSV file (from your browser’s download history):

and that gets you ready to start working with the CSV file:

I can work with that CSV file in Cypher’s LOAD CSV command, as we know. All we really need is to come up with a solid Graph Model to do what we want to do. So I went to Alistair’s Arrows, and drew out a very simple graph model:

So that basically get’s me ready to start working with the CSV files in Cypher. Let’s run through the different import commands that I ran to do the imports. All of those are on github of course, but I will take you through them here too...

First create the indexes

create index on :Blog(name);

create constraint on (p:Page) assert p.url is unique;

Then manually create the blog-nodes:

create (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"});

create (n:Blog {name:"Neo4j Blog", url:"http://neo4j.com/blog"});

create (n:Blog {name:"JEXP Blog", url:"http://jexp.de/blog/"});

create (n:Blog {name:"Armbruster-IT Blog", url:"http://blog.armbruster-it.de/"});

create (n:Blog {name:"Max De Marzi's Blog", url:"http://maxdemarzi.com/"});

create (n:Blog {name:"Will Lyon's Blog", url:"http://lyonwj.com/"});

I could have done that from a CSV file as well, of course. But hey - I have no excuse - I was lazy :) … Again…

Then I can start with importing the pages and links for the first (my own) blog, which is at blog.bruggen.com and has a feed at blog.bruggen.com/feeds/posts/default:

//create the Bruggen blog entries

load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv

match (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"})

create (p:Page {url: csv.URL, title: csv.Title, created: csv.Date})-[:PART_OF]->(b);

This just creates the 20 leaf nodes from the Blog node. The fancy styff happens next, when I then read from the “Links” column, holding the “****”-separated links to other pages, split them up into individual links, and merge the pages and create the links to them. I use some fancy Cypher magic that I have also used before for Graph Karaoke: I read the cell, and then split the cell into parts and put them into a collection, and then unwind the collection and iterate through it using an index:

//create the link graph

with csv.URL as URL, csv.Links as row

unwind row as linklist

with URL, [l in split(linklist,"****") | trim(l)] as links

unwind range(0,size(links)-2) as idx

MERGE (l:Page {url:links[idx]})

WITH l, URL

MATCH (p:Page {url: URL})

MERGE (p)-[:LINKS_TO]->(l);

So this first MERGEs the new pages (finds them if they already exist, creates them if they do not yet exist) and then MERGEs the links to those pages. This creates a LOT of pages and links, because of course - like with every blog - there’s a lot of hyperlinks that are the same on every page of the blog (essentially the “template” links that are used over and over again).

And as you can see it looks a little bit like a hairball when you look at it in the Neo4j Browser:

So in order to make the rest of our GraphBlogGraph explorations a bit more interesting, I decided that it would be useful to do a bit of cleanup on this graph. I wrote a couple of Cypher queries that remove the “uninteresting”, redundant links from the Graph:

//remove the redundant links

//linking to pages with same url (eg. archive pages, label pages...)

match (b:Blog {name:"Bruggen"})<-[:PART_OF]-(p1:Page)-[:LINKS_TO]->(p2:Page)

where p2.url starts with "http://blog.bruggen.com"

and not ((b)<-[:PART_OF]-(p2))

detach delete p2;

//linking to other posts of the same blog

match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"})<-[:PART_OF]-(p2:Page),

(p1)-[lt:LINKS_TO]-(p2)

delete lt;

//linking to itself

match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"}),

(p1)-[lt:LINKS_TO]-(p1)

delete lt;

//linking to the blog provider (Blogger)

match (p:Page)

where p.url contains "//www.blogger.com"

detach delete p;

Which turned out to be pretty effective. When I run these queries I weed out a lot of “not so very useful” links between nodes in the graph.

And the cleaned-up store looks a lot better and workable.

If you take a look at the import script on github, you will see that there’s a similar script like the one above for every one of the blogs that we set out to import. Copy and paste that into the browser one by one, the neo4j shell, or use LazyWebCypher, and have fun:

So that’s it for the import part. Now there’s only one thing left to do, in Part 3/3 of this blogpost series, and that is to start playing around with some cool queries. Look that post in the next few days.

Hope this was interesting for you.

Cheers

Rik

Thursday, 7 January 2016

The GraphBlogGraph: 1st blogpost out of 3

Making the GraphBlogGraph

For quite a few years now, I have been hosting my own blog at blog.bruggen.com. It’s been quite an interesting experience I must say. Long time ago, I started blogging as kind of a personal diary kind of thing, but… then Facebook and Twitter happened, and it seemed kind of redundant at the time. Then I got to work for Neo4j, and got stuck into the Neo4j community, and I “restarted” my blog to write about my life and work in the Neo4j community. It’s been a very, very fun ride.

So this past Christmas period I had to a bunch of work for my Orienteering club, and I do most of that work (registering club members, registering for races, managing billing etc) in Google Sheets. And I came across acouple of really interesting things that I did not knew existed - and that I thought would make a super cool Graph application. The two things were:

an easy and automated way to read a blog “feed” (in Atom or RSS) and put the items into a Google Sheet. This is called the “ImportFEED” function:
here's the manual - it’s a really interesting piece of functionality.
an easy and automated way to pars XML (and therefore, HTML pages) and extract information from that XML using XPATH. This function is called ImportXML:

So my idea was basically very simple: why don’t I use this functionality to read the feeds from a couple of Neo4j-centric blogs that I know (using ImportFEED), and then use the URLs of the pages in the feed to scrape the HTML of the blogpost page with ImportXML, and extract the hyperlinks (<a href=”...”> tags in HTML). That way I could basically look at the graph of links between the different blogs, and see if I could discover anything interesting...

So I did. I will publish a couple of blogposts (!) in the next few days to explain the story.

Reading the GraphBlog-feeds

I got to work. I created a google sheet (which is publicly available for you to view and copy if you want), and listed some of the blogs that I would be interested in.

Name	URL	Feed
Rik Van Bruggen	http://blog.bruggen.com	http://blog.bruggen.com/feeds/posts/default
Michael Hunger	http://jexp.de/blog	http://jexp.de/blog/feed/
Stefan Armbruster	http://blog.armbruster-it.de	http://blog.armbruster-it.de/feed/
Neo4j.com	http://neo4j.com/blog	http://neo4j.com/feed/
Max De Marzi	http://maxdemarzi.com	http://maxdemarzi.com/feed/
Mark Needham	http://www.markhneedham.com/blog/	http://feeds.feedburner.com/markneedham
Will Lyon	http://www.lyonwj.com/	http://www.lyonwj.com/atom.xml

I had some others on the list (Linkurio.us blog, GraphAware blog, GrapheneDB blog) but I could not immediately find the feeds of these blogs… maybe some day :)) …

So the next thing I did was I used ImportFEED to load the data of these feeds into a sheet of the workbook. The feeds actually look like this:

But with the ImportFEED function, it is really trivial to get that into a workable format. I used the following three formulae to load the created date (“items created”), the title (“items title”) and the URL (“items url”) of the last 20 posts in the feed into three colums:

=importfeed("http://blog.bruggen.com/feeds/posts/default","items created",TRUE, 20)

=importfeed("http://blog.bruggen.com/feeds/posts/default","items title",TRUE, 20)

=importfeed("http://blog.bruggen.com/feeds/posts/default","items url",TRUE, 20)

The result was actually super cool: a sheet for every blog that had date, title and url information for this particular blog.

Crawling the GraphBlog-pages

Then, next, I wanted to do some webpage crawling/scraping/whatever you want to call it with ImportXML. So that’s why I have the following formula:

=IMPORTXML(D2, "//a/@href")

Which is giving me and array like so:

Now, what I obviously want to do later on is import these things into a graph database, so I really wanted to get all of these links together into a “big” cell. So I decided to use a JOIN function to do that:

Whit the following JOIN I can actually put all these links into a cell of the spreadsheet, separated by a delimiter (“****” in this case):

=join("****",sort(unique(IMPORTXML(D2, "//a/@href"))))

By doing it this way, each of these cells we get a long piece of text string:

Which is fine, because I know how to split this cell into individual “blog links” again. What I have now is a spreadsheet containing the blog feed, and all of the links that go from the individual blog pages to other pages. Nice!

In the next section I will be importing that spreadsheet into Neo4j, and then we can start playing around with it.

I hope you enjoyed this blogpost so far. I will publish part 2 in a few days, for sure.

Cheers

Rik

Bruggen Blog

Pages

Monday, 6 April 2020

Graphistania 2.0 - Episode 6 - The One with the CovidGraph

Friday, 27 March 2020

Supply Chain Management with graphs: part 3/3 - some SCM analytics

Wednesday, 25 March 2020