Bruggen Blog: apoc

Showing posts with label apoc. Show all posts

Monday, 20 December 2021

Cognitive biases in Neo4j

I am an economist/engineer. I studied "Commercial Engineering" in Belgium in the nineties, and was quite an avid learner of economic theories large and small at the time. I did however, always kind of find myself uneasy at economists insistence on the rationality the homo economicus, as I knew, and observed all around me, that people were far from rational. That's why, ever since I learned of its existence, I have been a big fan of the field of behavioral economics - which actually tries to formulate ecomic theories that are real, and often times, irrational. I fondly remember first reading Dan Ariely's Predictably Irrational, and learning about some of the crazy biases that he observed and DESCribed. And Nobel-prize-winning Daniel Kahneman has been a hero for decades. I think about the Framing Effect) and Prospect Theory almost on a daily basis.

It all started with a tweet

So you can imagine my excitement when I learned about this tweet:

Should be taught to all at a young age pic.twitter.com/GlVkjcdhah
— Elon Musk (@elonmusk) December 19, 2021

Revisiting contact tracing with Neo4j 4.4's transaction batching capabilities

Yes! It's been a few months, but Saint Nicholas just brought us a brand new and shiny release of Neo4j 4.4 to play with. One of the key features is a generic transaction batching capability, similar to what we have been using in apoc.periodic.iterate but now built right into the core of the database. It is referred to as the CALL in Transaction capability - and of course it is a really interesting feature.

So in this article I will be revisiting this blogpost, but without the need for APOC's apoc.periodic.iterate feature. Let's see how that goes.

Create a synthetic contact tracing graph - size of Antwerp

The first step of course is going to be similar to, if not exactly the same as, the work I did in 2020 on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went. The key thing to recall there is that I was using the fantastic faker plugin. You can download it yourself from the github page. Install is super easy. Just need to make sure the config is updated too - and that you whitelisted fkr.* just like you do with gds.* and apoc.*.

As with the previous post, I will be pushing the scale up to the size of my home city of Antwerp, Belgium. And critically, we would not even use APOC - but use the transaction batching instead.

ReBeerGraph: importing the Belgian BeerGraph straight from a Wikipedia HTML page

I have written about beer a few times, also on this blog. I have figured out a variety of ways to import The Wikipedia Page with all the belgian beers into Neo4j over the years. Starting with a spreadsheet based approach, then importing it from a Google Sheet (with it's great automated .csv export facilities), and then building on that functionality to automatically import the Wikipedia page into a spreadsheet using some funky IMPORTHTML() functions in Google sheets.

But: all of the above have started to crumble recently. The Wikipedia page, is actually kind of a difficult thing to parse automatically, as it splits up the dataset into many different HTML tables (which makes me need to import multiple datasets, really), and then it also seems like Wikipedia has added an additional column to it's data (the "Timeframe" or "Period" in which a beer had been brewn), which had lots of missing, and therefore empty cells. All of that messes up the IMPORTHTML() Google sheet that I had been using to automatically import the page into a gsheet.

So: I had been on the lookout for a different way of doing the automated import. And recently, while I was working on another side project (of course), I actually bumped into it. That's what I want to cover and demonstrate here: importing the data, directly from the page, without intermediate steps, automatically, using the apoc.load.html functionality.

Making sense of the news with Neo4j, APOC and Google Cloud NLP

Recently I was talking to one of our clients who was looking to put in place a knowledge graph for their organisation. They were specifically interested in better monitoring and making sense of the industry news for their organisation. There's a ton of solutions to this problem, and some of them seem like a really simple and out of the box toolset that you could just implement by giving them your credit card details - off the shelf stuff. No doubt, that could be an interesting approach, but I wanted to demonstrate to them that it could be really much more interesting to build something - on top of Neo4j. I figured that it really could not be too hard to create something meaningful and interesting - and whipped out my cypher skills and started cracking to see what I could do. Let me take you through that.

The idea and setup

I wanted to find an easy way to aggregate data from a particular company or topic, and import that into Neo4j. Sounds easy enough, and there are actually a ton of commercial providers out there that can help with that. I ended up looking at Eventregistry.org, a very simple tool - that includes some out of the box graphyness, actually - that allows me to search for news articles and events on a particular topic.

So I went ahead and created a search phrase for specific article topics (in this case "Database", "NoSQL", and "Neo4j") on the Eventregistry site, and got a huge number of articles (46k!) back.

Part 1/3: Wikipedia Clickstream analysis with Neo4j - the data import

Alright, here's a project that has been a long time in the making. As you may know from reading this blog, I have had an interest, a fascination even, with all the wonderful use cases that the graph ecosystem holds. To me, it continues to be such a fantastic thing to be able to work in - graphs are everywhere, and more and more people are waking up to the fact that they really should look at their data as a network, and leverage the important relationships that are often hidden from plain sight.

One of these use cases that has been intriguing me for years, literally, is clickstream analysis. In fact, I wrote about this already back in 2013 - amazing when you think about it. Other people, like our friends at Snowplow Analytics, have been writing about this as well, but somehow the use case has been snowed under a little maybe. With this blogpost, I want to illustrate why I think that this particular use case - which is really a typical pathfinding application when you think about it, is such a great fit for Neo4j.

A real dataset: Wikipedia clickstream data

This crazy journey obviously started with finding a good dataset. There's quite a few of them around, but I wanted to find something realistic, representative and useful. So after some digging around I found the fantastic site of Wikimedia, where they actually structurally make all aggregated clickstream data of Wikipedia's pages available. You can just download them from this their website, and grab the latest zipped up files. In this blogpost, I worked with the February 2021 data, which you can find over here.

When you dowload that fine, you will find a tab-separated text file that includes the following 4 fields

prev: the previous page that the navigation came from
curr: the current page that the navigation came into
type: the description of the type of navigation that was occuring. There's different possible values here

link: a regular link between pages
external: a link from an external page to the current page
other: a different type - which can occur if people try to hide their navigation patterns

n: the number of occurrences of the (prev, curr) pair - so the number of times this navigation took place.

So this is the dataset that we want to import into Neo4j. But - we need to do one tiny little fix: we need to escape the “ characters that are in the dataset. To do that, I just opened the file in a text editor (eg. TextEdit on OSX) and did a simple Find/Replace of " with "". This take care of it.

Supply Chain Management with graphs: part 3/3 - some SCM analytics

I've been looking forward to writing this: this is the last of 3 blogposts that I have been planning to write for weeks about my experiments with a realistic Supply Chain Management Dataset. There's two posts before this one:

In the first post I found and wrangled a dataset into my favourite graph database, Neo4j
In the second post I got acquainted with the dataset in a bit more detail, and I was able to do some initial querying on it to figure out what patterns I might be able to expose.

In this this third and last post I would like to get a bit more analytical with the dataset, and do some more detail investigation in order to better understand some typical SCM questions. Note that I am far from a Supply Chain specialist - I barely understand the domain, and therefore I will probably be asking some silly questions initially. But bear with me - and let's explore and learn, right?

Supply Chain Management with graphs: part 2/3 - some querying

So in the previous post, we got introduced to a dataset that I have been wanting to get into Neo4j for a long time: a Supply Chain Management dataset. Read up about it over here, but the long and short of it is that we got ourselves into the situation where we have an up and running Neo4j database with 38 different multi-echelon supply chains. Result!

As a quick reminder, here's what the data model looked like after the import:

Or visually:

Data validation and profiling

The first thing to do when you have a new shiny dataset like that, is of course to get a bit of a feel for the data. In this case, it really helps to understand the nature of the different SupplyChains - as we know from the original Excel file that they are quite different between the 38 of them. So let's do some profiling:

match (n) return distinct labels(n), count(*)

Supply Chain Management with graphs: part 1/3 - data wrangling and import

Alright, I have been putting the writing of this blogpost off for too long. Finally, on this sunny Saturday afternoon where we are locked inside our homes because of the Covid-19 pandemic, I think I'll try to make a dent in it - I have a lot of stuff to share already.

The basic idea for this (series of) blogpost(s) is pretty simple: graph problems are often characterised by lots of connections between entities, and by queries that touch many (or an unknown quantity) of these entities. One of the prime examples is pathfinding: trying to understand how different entities are connected to one another, understanding the cost or duration of these connections, etc. So pretty quickly, you understand that logistics and supply chain management are great problems to tackle with graphs, if you think about it. Supply Chains are graphs. So why not story and retrieve these chains with a graph database? Seems obvious.

We've also had lots of examples of people trying to solve supply chain management problems in the past. Take a look at some of these examples:

And of course some of these presentations from different events that we organised:

Our friends at Caterpillar used Neo4j for this:
TransparencyOne actually built a business on it:

So I had long thought that it would be great to have some kind of a demo dataset for this use case. Of course it's not that difficult to create something hypothetical yourself - but it's always more interesting to work with real data - so I started to look around.

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 3 of this blogpost. Part 1 and Part 2 are also published.)

Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:

in the first post we focused on importing and refactoring the data in Neo4j
in the second post we spent some time analysing the dataset with some algorithms and some specific pattern matching queries

Now we are going to spent some time with the "heart of the matter", the actual content of the emails. We are going to do that in two steps: first we will do some "full text" querying of some data, using Neo4j's specific full text indexing capabilities. Then we are going to go a step further and try to extract more knowledge from this dataset in an automated way, by running some Natural Language Processing (NLP) algorithms and processes on it.

Let's get right to it.

Fulltext querying of Emails

Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.

So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do

create index on :Email(text)

and query that index afterwards. But the result was pretty obvious:

Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 2 of this blogpost. Part 1 and Part 3 are also published.)

In the previous post around the emails of Hillary Clinton, we were able to import the data from a CSV file, and use some really cool graph refactoring tools to make the database a little more easy to work with - bad data is bad data, and the less we have of that the better.

So we ended up in a reasonably stable state, where we could do some querying. In this post, we will do exactly that.

Exploring the graph with graph algos

It's fairly easy to get a good initial view of the structure and size of the graph. I just run a few queries like this:

//what nodes are in the db

match (n) return labels(n), count(n)

and:

//what rels are in the db

MATCH p=()-[r]->() RETURN type(r), count(r)

and we very quickly see that, while this is clearly not a "big" dataset, it's still big enough to start loosing some significant time sifting through data if you want to make some sense of it. This is where our fantastic graph algorithms come in. I installed the plugin into my database, restarted it, and then I also played around a bit with Neuler, a graph algo playground that basically allows you to quickly experiment with different algorithms. You can download Neuler from https://install.graphapp.io/ and install it into your Neo4j Desktop really quickly.

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 1 of this blogpost. Part 2 and Part 3 are also published.)

With lots of interesting political manoeuvring going on in the USA and in Europe, I somehow got into a rabbit hole where I came across the corpus of emails that were published in the aftermath of the 2016 US presidential elections. They have been analysed a number of times, both by citizens and the press: see the great site published by the Wall Street Journal and Ben Hamner's github repo (which is based on a Kattle dataset).

Some of my friends and colleagues have also done some work on this dataset in Neo4j - there's this graphgist, Linkurio.us' blogpost, as well as Ryan Boyd's older article on DeveloperAdvocate. But I decided I was interested enough to take it for a spin.

Importing the email corpus into Neo4j

I got the dataset from this url, and it looks pretty straightforward. There's a very simple datamodel that we can work with, which would look something like this:

Part 4/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here goes part 4 of 4 of my work on the Carrefour shopping receipts dataset. I realize we have come quite a way - and for me too there has been a lot to talk about and explore in these blogposts. Even then I feel like there's a ton of other interesting questions that we could ask and answer - but that would lead us too far.
Just to recap:

In part 1 I wrangled the data, and imported it into Neo4j.
In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset.
In part 3 we started doing some more interesting work - specifically around product combinations.

Now, in this final part of this series, I want to see if we can do some more analytical work with this dataset, for example by applying some algorithms to it. More specifically, I want to use some of our graph similarity algorithms to figure out which products are supposedly similar to one another - and do that along multiple axes.

People have written long and complicated doctorates about the best way to calculate and establish similarities in graphs - and most of it is very much beyond me and my reptile math brain. But one thing is clear: many of the algorithms have very different approaches to doing this, and there are good reasons for wanting to choose or abandon one or the other. However, in our daily Neo4j work, we have seen some particularly interesting results with the Jaccard similarity algorithm, which is part of the algos plugin to Neo4j.

Jaccard similarity

The simple explanation of what Jaccard similarity does, is that it calculates a coefficient that compares members of two sets to see which members are shared and which are very different. So it's a measure of similarity for two sets of data - with a range from 0% (not similar at all) to 100% (identical). Higher scores mean higher similarity between the two populations. Jaccard similarity is sometimes referred to as "Intersection over Union", as explained like this:

I borrowed most of this explanation from the inevitable Wikipedia of course. You can find the Neo4j algo library that contains this algorithm over here.

Part 3/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.

In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.

And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:

//PLEASE DON'T RUN THIS QUERY!!!
call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
",
{batchsize:50000, iterateList: true, parallel: false})

In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that

nCr = n(n - 1)(n - 2) ... (n - r + 1)/r! = n! / r!(n - r)!

I would have 53588! / (2! * 53586!) = 1435810078 combinations of products possible. See the StatTrek website for the calculator :) ... on top of that I realised that ALL of these combinations are probably not that interesting for us - maybe we should try to make this a bit more specific?

Part 2/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:

In summary, we have

about 585k shopping tickets in the dataset,
that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
from 2 different Carrefour malls,
from 66k different Carrefour clients
with about 53k different products

This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!

Part 1/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here we go again. In an effort to do some more writing, blogging, podcasting, for our wonderful Neo4j community, I wanted to get back into a routine of playing with some more datasets in Neo4j. A couple of weeks ago I was able to play a bit with a small dataset from Colruyt Group, and I wrote about it over here. And I don't know exactly how it happened, but in some weird way I got my hand on another retailer's data assignment - this time from Carrefour.

You will notice that this will be another series of blogs: there's just too much stuff here to put into one simple post. So after having done all the technical prep for this article, it seems most logical to split it into 4 parts:

part 1 (this article) will cover the the data modeling, the import of the dataset, and some minor wrangling to get the dataset into a workable format.
part 2 (to follow) will cover a couple of cool queries to acquaint ourselves with the dataset.
part 3 (to follow) will cover a specific - and quite complicated - investigation into the product combinations that people have been buying at Carrefour - to see if we can find some patterns in there.
part 4 (to follow - and this is the final part) will look at some simple graph algorithms for analytics that we ran.

That should be plenty of fun for all of us. So let's get right into it.

The Carrefour Basket dataset

As I finished up the Colruyt article referenced above, I was actually originally just looking for some spatial information on other supermarket chain's positioning of shops in Belgium. I wanted to see if I could create some simple overlay views of where which shops were - and started browsing the interweb for data on supermarket locations. That very quickly lead to something completely different: I found this website for TADHack Global ("Telecom Application Developer Hackathon", apparently is what it stands for), a 2016 event where people could investigate different datasets and use it to hack together some cool stuff. In that 2016 event, there was an assignment from Carrefour: the Carrefour Delighting Customers Challenge Basket Data set.

Working with the ICIJ Medical Devices dataset in Neo4j

Just last weekend our friends at the ICIJ published another really interesting case of investigative journalism - tracking down and publishing the quite absurd and disturbing practices of the medical devices industry. The entire case with all of the developing stories can be found at https://medicaldevices.icij.org/ - take a look as it really is quite fascinating. Of course that meant that I wanted to see what that data looked like in Neo4j, and if I could have a play. I didn't have time for a full detailed exploration yet - but hopefully this will also give others the opportunity to chime in. So let's see.

The Medical Devices dataset as a graph

This turned out to be surprisingly easy. Just download the Zip file from the ICIJ website: https://medicaldevices.icij.org/download/icij-imddb-2018-11-25.zip, unzip this, and then we get 3 comma-separated-values files:

one for the Devices that are being reported on
one for the Events that are being reported (whenever something happens to a device (eg. a recall) then that is logged and reported)
one for the Manufacturers of the medical devices.

That's easy enough.

Data Lineage in Neo4j - an elaborate experiment

For the past couple of years, I have had a LOT of conversations with users and customers of Neo4j that have been looking at graph databases for solving Data Lineage problems. Now, at first, that seemed like a really fancy new word used only by hipster technovangelists to try to appear interesting, but once I drilled into it, I found that it’s actually something really interesting and a really cool application of graph databases. Read more on the background of it on wikipedia (as always), or just live with this really simple definition:

“Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.”

That’s easy enough. Fact is that it’s a really big problem for large organisations - specifically financial institutions as they have to comply with regulations like the Basel Committee on Banking Supervision's standard number 239 - which is all about assuring data governance and risk reporting accuracy.

Here’s a couple of really nice articles and videos that should really give you quite a bit of background.

Podcast Interview with Marco Falcier and Alberto d'Este, Neo4j Versioner

Just before GraphConnect, I thought I would publish another podcast on a subject that many people have been pondering about - and even struggling with. Ever since Ian Robinson wrote about on his blog and in the O'Reilly Graph Databases book, it's been a great topic of interest for many people in many different use cases: how can I keep track of versions in a graph? How can I look at the state of the graph at a particular moment on time, in other words, travel through time. Aleksa Vukotic presented some of this in a real-world application too, but the guests on this episode of the podcast decided they wanted a more generic solution - and so they got out their coding hats and got cracking. Here's my conversation with them:
Here's the transcript of our conversation:

RVB: 00:02.467 Hello everyone. My name is Rik. Rik Van Bruggen from Neo Technology. And I keep making the same mistake. It's no longer Neo Technology. It's Neo4j now. That's our name. And here we are recording another weekly podcast for the Graphistania podcast. And tonight, I have two people from Italy on the other side of this Skype call. And I'm really jealous of them because they're in the lovely Venice, north of Italy. And that's Marco Falcier and Alberto d'Este. Hello guys.

Podcast Interview with Tomasz Bratanic

This following podcast recording is an funny one. First of all because it took me and Tomasz like 3 retries and multiple scheduling rounds to get this thing done - and secondly because of who Tomasz is and what he is doing in our graph community. As you will read below, Tomasz is nog just a novice to Graphs - he's a novice to IT in general. A year ago, he was still filling his days with online poker - and now he is implementing and enhancing some of the coolest parts of Neo4j's APOC libraries. DEFINITELY worth a listen - it's an inspiring story.

Exploring the Paris Terrorist Attack network - part 3/3

Previously, on this blog, I had started writing about how we could get some of the data published by a local Belgian newspaper, De Standaard, on the Paris Terrorist Attack Network into Neo4j. In

Part 1, we talked about loading the raw JSON data into Neo4j, and then in
Part 2, we cleaned up some of the data for easy querying in Neo4j.

So that's where we are. To wrap things up, I just wanted to illustrate some of the results and queries in Neo4j around some of the most interesting figures in this Terrorist network. I started some of my explorations around a widely reported terrorist, and Belgian national, called Salah Abdeslam.

So let's take a look at Salah in Neo4j.

Pages

Monday, 20 December 2021

It all started with a tweet

Monday, 6 December 2021

Create a synthetic contact tracing graph - size of Antwerp

Tuesday, 5 October 2021

Saturday, 24 April 2021

The idea and setup

Monday, 29 March 2021

A real dataset: Wikipedia clickstream data

Friday, 27 March 2020

Wednesday, 25 March 2020

Data validation and profiling

Saturday, 21 March 2020

Monday, 16 December 2019

Fulltext querying of Emails

Exploring the graph with graph algos

Importing the email corpus into Neo4j

Monday, 2 December 2019

Jaccard similarity

Friday, 29 November 2019

Thursday, 28 November 2019

Wednesday, 27 November 2019

The Carrefour Basket dataset

Wednesday, 28 November 2018

The Medical Devices dataset as a graph

Wednesday, 31 October 2018

Friday, 20 October 2017

Thursday, 28 September 2017

Friday, 2 December 2016

Metricool