Monday, 21 September 2015

Part 1/3: Experimenting with a POLE, the Global Terrorism Database, and Neo4j

In the past couple of weeks and months, I have been having a lot of fun at Neo4j working with different clients. One thing struck me however (maybe it's a coincidence, but still): we have come across an impressive amount of customers that all had very similar requirements: they were looking to use Neo4j as the foundation architecture for a next-generation POLE database. A what? A P-O-L-E database.

What is a POLE, exactly?

I guess everyone has their own definition and wants to create yet-another-vague-acronym, but the common case seems to be that it's like a "case management" tool for specific types of government agencies that want to look at the links between Persons, Objects, Locations, and Events. Some of the cases are to be found in police forces, government (tax / social service) agencies, immigration authorities, etc ... They all have that same requirement of being able to analyse and link different entities together, like so (or similar):
Naturally, most of these clients are not about to share their privacy-sensitive data with us very often. And I would still want to have some kind of a story and demonstration to explain how we could help. So I went looking for some interesting datasets, and ... before I knew it I found something really interesting.

The Global Terrorism Database

As mentioned above, one of the key areas where people will try to understand the connections between the POLEs, is in police/intelligence work. In fact, we have noticed that many of the Neo4j use cases that we have worked on are in this domain. So where to find interesting data around topics like that...

Like in so many cases I can't exactly reconstruct how I got there, but in the end I found the Global Terrorism Database (GTD). They seem to be very strict about their ownership of the data, so here's some legalese for you:
the data was provided by the National Consortium for the Study of Terrorism and Responses to Terrorism (START). (2015). Global Terrorism Database [Data file]. Retrieved from http://www.start.umd.edu/gtd.
And I must say: they did an unbelievable job. The interface below is super interesting to play around with in the first place.

Then after some playing around I quickly noticed I could actually download the dataset from this page over here.



As you can see, it provides a couple of different documents. The most important ones are

  • a big, tall and wide Excel file. 
  • a Codebook that explains the meaning of the different data elements in the Excel file.

Opening up the file takes a bit longer than on average, but works fine on my machine. It's about 140000 lines long, and I-don't-know-how-many columns (a lot) wide.

So that's when I started to take a few good looks at the data, and found that actually it is a pretty great example of a POLE database. It contains information about

  • Events: the 140000 terrorist attacks from 1970 until 2014.
  • Objects: the weapons / systems / objects used during these attacks
  • The Persons / Groups of persons (usually) performing the attacks
  • The Location of the attacks (by region, country, province/state, city, gps-coordinates)
And actually a bit more than that. So the data is actually a bit more than a "simple" POLE, and so I thought that it would be an even better fit for a a potential Graph Model then.

Creating a GTD POLE model for Neo4j

So after a bit of examination and experimentation in Excel, I ended up drawing out the following Graph Model for the Global Terrorism Database:



As usual, with a graph, it feels like a very natural and simple way to talk about the data. So then all I needed to do was to convert the Excel file into a Neo4j database. That should be interesting. So in part 2, we will attempt to load this data into Neo4j.

Hope this was interesting so far!

Cheers

Rik

No comments:

Post a Comment