Today is August 4th, 2014. For most people, that date probably does not mean a lot - but for many people in Europe it probably does - especially if you are from Germany, Belgium, the UK, France - or any of the countries in Central/Western Europe. And for most people across the globe, it probably should mean more than it does - because it is the 100th birthday of the start of World War 1, when Germany invaded Belgium, violated it's neutrality, and Britain declared war on Germany.
I have never lived a war. I have lived a very comfortable, safe life in Antwerp, Belgium for the past 40 years. But every now and then I go to the Saint Sixtus abbey in Westvleteren to ... indulge on some beer, of course, and then we almost pass through Flanders Fields.
Yes indeed, home of the poppy remembrance symbol - not that we really have that many, at least not today.
Some of these War remembrance monuments are truly, truly moving symbols of pacifism. I took my kids to visit Tyne Cot, as an example - and it was a day to remember. You cannot unsee 20000 graves.
We also took them to the "trench of death" this spring, and to the war remembrance museum in Diksmuide: the Museum on the Yser. Call me stupid, but I also want to take them to the In Flanders Fields this summer. I have been putting it of because I know it will scare the cr@p out of them - but I think it's one of those little things that I need to tell my kids: War is NOT nice. Not.
Especially with the daily pictures that we see of the Ukraine Unrest, or worse still, the atrocious war in Gaza... War seems not-so-far away. Will we ever learn?
So earlier this month, I was idling around on the net, thinking about stuff like this, and I came across the Correlates of War website. This project was founded in 1963 by J. David Singer, a political scientist at the University of Michigan, and has been documenting the different wars (or "disputes", as he calls them) in a structured way. Which of course brings me to the meat of this blogpost: the WarGraph. Wouldn't this be a great dataset to look into in Neo4j?
Working the data
I started of course looking into the publicly available Correlates of War datasets. There's more than one that we need to import: one for Countries, one for Disputes, and then there's some interesting "meta-data" around religions (in the countries) and Material Capabilities (of the countries). Of course, to start working with this data, I put everything together into a bigger google spreadsheet, which really is turning out to be my go-to-tool these days.
The Graph of Wars Model
Needless to say, we needed to create a graph model of the data before we could do anything meaningful. Here's what I ended up with.
Let me take you through this to explain:
- Country nodes have a lot of interesting metadata: the CoW people have assembled a lot of data on countries' economic and military capabilities. They have yearly data since the early 1800s, but for simplicity's sake I only imported the 2007 data:
- Iron and Steel Production,
- Military Expenditures (GBP or USD)
- Military Personal (thousands)
- Primary Energy Consumption
- Total Population
- Urban Population
- Composity Index of National Capability score: computed measure by summing all observations on each of the 6 capability components above, converting each state's absolute component to a share of the international system, and then averaging across the 6 components.
- Some metadata needs to be imported in order to make the model more graphy / normalised:
- Different kinds of Outcomes are a subgraph, and labeled as such
- Different kinds of Settlements are a subgraph, and labeled as such
- Different fatality levels are a subgraph, and labeled as such
- Different kinds of "Highest levels of Action" in the dispute (HiAct) are a subgraph, and labeled as such
- Different kinds of hostility levels - which are related to the "highest levels of action" - are also a subgraph, and labeled as such.
- In order to work with the Years in the graph, I have also connected them to one another in "in-graph-index", aka a timeline. I did something similar with my beergraph a while ago.
- Religion data also imported: again, there's a lot of interesting data since the 1800s, but I only imported recent data from 2010.
If you think you need a more detailed explanation of what the different data elements mean, you can always go back to the codebook for more info.
Import the WarGraph into Neo4j
Then all I had to do was import the data. I used a combination of two techniques here:
- I used the spreadsheet method to import the metadata. It just did not make sense to go through the motions of prepping the CSV files for these small metadata-sets.
- I used the loadcsv straight from google spreadsheet for the actual data.
The detailed overview of the import process is in the gist over here. It's not difficult at all.
Querying the WarGraph
Once the data is in neo4j, we can start using the Neo4j browser to start looking at some data using simple Cypher queries:
Let's see if we can find the USA (as a country) in this dataset:
MATCH (n:Country {short:"USA"})-[r]-()
RETURN n,r
LIMIT 10
Seems correct. Now let's start looking at some "war" related information. Let's find the countries that have been involved in the most disputes:
MATCH (n:Country)-[r:PARTICIPATES_IN]->(d:Dispute)
RETURN n.name, count(r)
ORDER BY count(r) desc
LIMIT 10;
Interesting. The US and the UK are up there. But so are Germany, France... and Israel (a country that did not exist until that long ago).
We can then slice and dice this data a bit more. Let's look at the countries with most disputes "per capita", ie relative to their population size. Here's the query:
MATCH (n:Country)-[r:PARTICIPATES_IN]->(d:Dispute)
WHERE n.totalpop is not null
WITH n, count(r) as NrOfDisputes
RETURN n.name, n.totalpop, NrOfDisputes, 1.0*NrOfDisputes/n.totalpop as DisputesPerCapita
ORDER BY DisputesPerCapita desc
LIMIT 10
There's a little trick here with the "1.0*" in the query. This is to force cypher into a floating point operation before it gets to the floating point operation... If you do it any later the query will file. Thanks to Alistair for helping me with that.
Let's now take a look at the time dimension, by running along my in-graph year-index. Let's look at the disputes in first half of 20th century:
MATCH (y1:Year {name:1900})-[:PRECEDES*..51]->(y2),
(d:Dispute)-[:STARTED_IN]->(y2)
RETURN distinct d.name as Dispute, y2.name as StartYear;
Or let's take a look if Religion would have anything to do with warfare. Would it?
Here's the top 10 of countries with most religious adherents that participated in disputes:
MATCH (n:Religion)<-[r:HAS_ADHERENTS]-(c:Country)
WHERE n.name <> "Non-religious"
WITH distinct c.short as country, r.number as nrofadherents
ORDER BY nrofadherents DESC
LIMIT 10
WITH country
MATCH (c:Country {short: country})-[:PARTICIPATES_IN]->(d:Dispute)
RETURN distinct c.short, c.name, count(d);
And then last but not least, let's see if we can explore some paths. Just as an example, nothing more, we can take a look at some links between two countries, like for example the USA and Israel:
MATCH (u:Country {short:"USA"}), (i:Country {short:"ISR"}),
p = allshortestpaths((u)-[r*]-(i))
RETURN p;
No world shocking data in any of these queries, but still very interesting stuff to play around with. All the queries are in the gist over here.
As usual, I would welcome your feedback. I hope this was useful, and that we will all make graphs, not war.
Rik
No comments:
Post a Comment