Bruggen Blog: pole

Showing posts with label pole. Show all posts

Monday, 28 September 2015

Part 3/3: Querying the Global Terrorism Database (aka my POLE database) in Neo4j

In part 1 and part 2 of this blog series, I talked about how we at Neo4j are seeing more and more customers using graphs to model, store and manage their Person-Object-Location-Event relationships. It's a really great use case, and I have been toying with the Global Terrorism Database (GTD) to try and illustrate what graphs bring to the table there. In the first post I explained some of the concepts, and in the second post I was able to import the GTD into Neo4j without too many hickups. So now, in this third and final post, we are ready to do some POLE querying. Excited to see what we can find out.

Start Simple - in Antwerp

You will find all of these queries on Github, of course. But let's take you through some of my experiments, one by one.

We start out by trying to find my hometown, Antwerp, Belgium, in the GTD. In order to do that, we just want to make sure that we have all the indexes in place on our database:

That looks ok. We have indexes on :city(name) in place, so we can do something like this:

match (c:City {name:"Antwerp"})-[r]-()
return c,r;

Looks pretty simple enough, but the result was a bit disturbing. Look at the graph below:

Seems like there are multiple "Antwerps" in the Database, and that is a bit of a pity.

This caused me to look a bit into some of the data quality aspects of the GTD, and I did uncover a bit of an issue. If I run this query:

match (c1:City)--(ps1:ProvState)-->(c:Country),
(c2:City)-->(ps2:ProvState)-->(c)
where c1.name = c2.name
and ps1.name <> ps2.name
return distinct c1.name, c2.name ps1.name, ps2.name, c.name;

Then I unfortunately get this HUGE resultset with 27274 rows of essentially badly entered data.

As part of this blog series I did not have the time or inclination to try and correct the data in any way, but it feels like we could do some data cleansing there.

So let's proceed to find some more terrorist events in my home country:

match (e:Event)-->(ci:City)-->(ps:ProvState)-->(c:Country {name:"Belgium"})
return c.name, e.id;

Good! Only 21 terrorist events in 44 years of history - that's ok. Happy to live in a safe place like that, especially if I compare it to some of the other countries: let's group them and count the number of events per country:

match p=(e:Event)-->(ci:City)-->(ps:ProvState)-->(co:Country)
return distinct co.name as Country, count(e) as NRofEvents
order by NRofEvents DESC

Clearly Iraq and Pakistan are not the safest places, but I was actually surprised to see the UK, the USA and the Philippines in the "top 10". Interesting.

Then of course you can also look at some interesting stats.

Terrible Terrorism stats - the graph does not lie

Let's say we would like to know more about specific dates and find out which dates had the MOST terrorist events, or which had the deadliest events. Querying for that is pretty easy, if we understand that the event-ID that is part of the event-nodes is structured as YYYYMMDDxxxx where xxxx is a sequentially incremented event id... Let's look at a query tp find the deadliest day in this GTD:

match (e:Event)
return left(str(e.id),8), sum(e.nkill), sum(e.nwound), count(e)
order by sum(e.nkill) desc
limit 10

The result is

Or if I just want to look at the sheer number of events, then I just change the ordering:

So it's pretty darn clear that June 14th and June 15th of 2014 were pretty darn special. Let's take a closer look.

match (e:Event)
where left(str(e.id),8)="20140615" and e.nkill is not null
return e
order by e.nkill desc
limit 1

Exploring this a bit further in the Browser gives me:

Now I don't want to bore you with too many details, but... this one is quite amazing: the Camp Speicher massacre, committed by ISIS in Tikrit, Iraq. The story is unbelievable and sad, as told by one of its survivers here and here. Warning: shocking stuff.

After that depressing find, I actually started to dig around with some specific attacks in mind. So let's look at these.

Digging into some specific terrorist events

So the first one I decided to take a look at - not sure why, but anyway - was the Oklahoma City Bombing. I don't know that much about that event, but I do vividly remember the pictures on TV and the unbelievable story behind it - as if it was taken from a road movie. So let's look that one up.

First we try to find the city:

match (c:City {name:"Oklahoma City"}) return c;

By then using the Browser we can very quickly identify the event (which has an id of 199504190004 ) and look at some of the details:

match (e:Event {id: 199504190004}) return e;

And then we can start to zoom into the surrounding graph:

Interesting, and intuitive.

So then, of course, we can take a look at another event - one of the most hideous and cruel attacks ever: 9/11. To do this, we need to take another look at the data and find the event. Using the same mechanism as above, I will take a look at the specific events on that sad day:

match (e:Event)
where left(str(e.id),8)="20010911"
return e

So very quickly, we then find out that there were a number of different attaches on that day:

So let's focus on the ones that happened in the USA by simply expanding our pattern:

match (e2:Event)
where left(str(e2.id),8)="20010911"
with e2 as Event
match p = allshortestpaths ((Event)-[*..3]-(c:Country {name:"United States"}))
return p

Immediately we see the 4 related attacks on the same day:

Now last but not least, let's see if we can find interesting connections. Because of the data quality issues that we uncovered above, the connections highlighted here may not be perfect or that interesting, but the principle I think could be super interesting in a true, well-maintained POLE database.

match (e2:Event)
using scan e2:Event
where left(str(e2.id),8)="20010911"
with e2 as Event
match p = allshortestpaths ((Event)-[*..3]-(c:Country {name:"United States"}))
With Event
limit 1
match p2 = allshortestpaths((e1:Event {id: 199504190004})-[*]-(Event))
return p2
limit 10

As you can see below, you do get an interesting view and a number of "suggested" links between different events - look at the "similar" events in the middle for example.

No doubt that in a true POLE database application, that would be a great source of inspiration for further exploration.

That's about it what I have to share on my journey with POLE databases, the Global Terrorism Database and Neo4j. I hope it was useful for you - I certainly found it a very interesting excercise.

As always - feedback and comments would be very welcome.

Cheers

Rik

Friday, 25 September 2015

Part 2/3: loading the Global Terrorism Database into Neo4j

In the previous post in this series, I explained what we were trying to do by taking the Global Terrorism Database and using it to create a Neo4j graph database based POLE database.

As you may remember, the GTD is a really big Excel file. So the first thing that I did was to do a minor clean-up operation on the XL file (there were a few columns in the file that did not make sense to import, and that were really causing me a lot of pain during the import process), and then I basically converted it into a nice CSV file that I could use with Cypher's LOAD CSV commands.

Creating the import script

Now, I have done this sort of thing before, and I can tell you - importing a dataset like this is never trivial. One thing that I have learned over the years with lots of trial and error, is that it is usually a good idea to import data in lots of smaller incremental steps. Rather than trying to do everything at once, I essentially for this particular GTD import task created 50 queries: 50 iterations through a csv file, piecing together the different parts of my graph model with different columns of my CSV file.

The entire import statement is on github, you can run through it pretty easily.

Running the import script

Now, you could run this import script very nicely in the Neo4j browser. There's one problem though: the browser only accepts one statement at a time, and that would mean a lot of copying and pasting. That's why I very often times still revert to the good-old neo4j-shell. Copying and pasting the entire import statement works like a charm:

Although I really should mention one thing that I learned during this excercise. The thing is, that because I actually import the data piece by piece, run by run, I had originally also created the relevant Neo4j indexes right before every query. Waw. That turned out to be kind of a problem.

Schema await!!!

The thing is, that the Neo4j Cypher query planner relies on these indexes to create the most efficient query execution plan for an operation. And what I noticed in this excercise is that my import queries were sometimes literally taking F-O-R-E-V-E-R, in simple situations where really they should not. I really had to do a bit of digging there and ask a little help from my friends, but I ended up finding out that the problem was simply that the INDEXES, which I had created in the statement right before the import operation, where not online yet at the moment when I was doing the import. The Cypher query planner, in the absense of index information, the proceed to do the import in the most inefficient way possible, doing full graph scans time and time again for the simplest of operations.

Two really easy solutions to this:

create your indexes all the way at the top of your import script, not between import statements. Probably a best practice.
add a very simple command after you create the index, to force for it to be "ONLINE" before the next statement is executed:

neo4j-sh (?)$ schema await;

That was a very useful trick to learn - I hope it will be useful for you too.

The result of the Import

After running this script, we end up loading data into the model that we sketched earlier.

After a few minutes of importing, the meta-graph is exactly the same as the model we set out to populate, and is now ready for querying.

That's what we will do in the next and final blogpost of this series: do some real POLE database queries on the Global Terrorism Database. Should be fun. Already looking forward!

Hope this was useful for you.

Cheers

Rik

Monday, 21 September 2015

Part 1/3: Experimenting with a POLE, the Global Terrorism Database, and Neo4j

In the past couple of weeks and months, I have been having a lot of fun at Neo4j working with different clients. One thing struck me however (maybe it's a coincidence, but still): we have come across an impressive amount of customers that all had very similar requirements: they were looking to use Neo4j as the foundation architecture for a next-generation POLE database. A what? A P-O-L-E database.

What is a POLE, exactly?

I guess everyone has their own definition and wants to create yet-another-vague-acronym, but the common case seems to be that it's like a "case management" tool for specific types of government agencies that want to look at the links between Persons, Objects, Locations, and Events. Some of the cases are to be found in police forces, government (tax / social service) agencies, immigration authorities, etc ... They all have that same requirement of being able to analyse and link different entities together, like so (or similar):

Naturally, most of these clients are not about to share their privacy-sensitive data with us very often. And I would still want to have some kind of a story and demonstration to explain how we could help. So I went looking for some interesting datasets, and ... before I knew it I found something really interesting.

The Global Terrorism Database

As mentioned above, one of the key areas where people will try to understand the connections between the POLEs, is in police/intelligence work. In fact, we have noticed that many of the Neo4j use cases that we have worked on are in this domain. So where to find interesting data around topics like that...

Like in so many cases I can't exactly reconstruct how I got there, but in the end I found the Global Terrorism Database (GTD). They seem to be very strict about their ownership of the data, so here's some legalese for you:

the data was provided by the National Consortium for the Study of Terrorism and Responses to Terrorism (START). (2015). Global Terrorism Database [Data file]. Retrieved from http://www.start.umd.edu/gtd.

And I must say: they did an unbelievable job. The interface below is super interesting to play around with in the first place.

Then after some playing around I quickly noticed I could actually download the dataset from this page over here.

As you can see, it provides a couple of different documents. The most important ones are

a big, tall and wide Excel file.
a Codebook that explains the meaning of the different data elements in the Excel file.

Opening up the file takes a bit longer than on average, but works fine on my machine. It's about 140000 lines long, and I-don't-know-how-many columns (a lot) wide.

So that's when I started to take a few good looks at the data, and found that actually it is a pretty great example of a POLE database. It contains information about

Events: the 140000 terrorist attacks from 1970 until 2014.
Objects: the weapons / systems / objects used during these attacks
The Persons / Groups of persons (usually) performing the attacks
The Location of the attacks (by region, country, province/state, city, gps-coordinates)

And actually a bit more than that. So the data is actually a bit more than a "simple" POLE, and so I thought that it would be an even better fit for a a potential Graph Model then.

Creating a GTD POLE model for Neo4j

So after a bit of examination and experimentation in Excel, I ended up drawing out the following Graph Model for the Global Terrorism Database:

As usual, with a graph, it feels like a very natural and simple way to talk about the data. So then all I needed to do was to convert the Excel file into a Neo4j database. That should be interesting. So in part 2, we will attempt to load this data into Neo4j.

Hope this was interesting so far!

Cheers

Rik

Bruggen Blog

Pages