Monday 22 October 2018

Poring over Power Plants: Global Power Emissions Database in Neo4j

In the past couple of weeks, I have been looking to some interesting datasets for the Utility sector, where Networks or Graphs are of course in very, VERY abundant supply. We have Electricity Networks, Gas Networks, Water Networks, Sewage Networks, etc etc - that all form these really interesting graphs that our users can. Lots of users have specialised, GIS based tools to manage these networks - but when you think about it there are so many interesting things that we could do if ... we would only store the network as a network - in Neo4j of course.
So I started looking for some datasets, and maybe I am not familiar with this domain, but I did not really find anything too graphy. But I did find a different dataset that contained a lot of information about Power Plants - and their emissions. Take a look at this website:
and then you can download the Excel workbook from over here. It's not that big - and of course the first thing I did was to convert it into a Google Sheet. You can access that sheet over here:

There's two really interesting tabs in this dataset:
  1. the sheet containing the fuel types: this gives you a set of classifications of the fuel that is used in the different power plants around the globe
  2. the list of 30,5k power plants from around the world that generate different levels of power from different fueltypes. While doing so, they also generate different levels of emissions, of course, and that data is also clearly mentioned in this sheet. Note that the dataset does not include any information on Nuclear plants - as they don't really have any "emissions" other than water vapour and... the nuclear waste of course.
So let's get going - let's import this dataset into Neo4j.

The Global Power Emissions Database data mode


Here's the model that I created based on the dataset:

It's a pretty simple structure, as you can see. So the first thing I did before I wanted to proceed is to create the indexes to back up this model:
//create indexescreate index on :Classification(name);create index on :SubClassification(name);create index on :FuelType(name);create index on :Country(name);create index on :Plant(name);create index on :Plant(id);
That's easy enough. Thanks to the multi-statement editor in the Neo4j browser I can just copy and paste that:
and be done with it. Now let's start the import.

Importing the Fuel Types

Let's start with the FuelType tab of the workbook / Google Sheet. As we know, we can actually use Cypher's LOAD CSV to download and import the Google Sheet. The two tabs that we are interested in are to be found at
So let's get that going. Here's the script to import the different fuel types and their classifications and sub-classifications in Neo4j:

//FROM FUELTYPE TAB 

//add Top-level classificationload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=1877461938" as csvwith distinct csv.Classification as Classificationmerge (c:Classification {name: Classification});
 

//add FuelTypesload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=1877461938" as csvmerge (ft:FuelType {name: csv.IEA_Fuel_Type});
 

//add Sub-level Classification and connect to the Classificationload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=1877461938" as csvmatch (c:Classification {name: csv.Classification})merge (sc:SubClassification {name: csv.SubClassification, fullname: csv.Full_Name})-[:SUBCLASSIFICATION_OF]-(c);
 

//connect the FuelTypes to the SubClassificationload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=1877461938" as csvmatch (sc:SubClassification {name: csv.SubClassification}), (ft:FuelType {name: csv.IEA_Fuel_Type})merge (sc)-[:HAS_FUEL_TYPE]->(ft);
The import only takes a very short while:
And then we immediately get some interesting subgraphs:
Now we can start importing the second tab of the sheet.

Importing Countries, Plants and their Classifications

Now we apply the same mechanism, from a different .csv URL -  to get the countries in:


//Import the Countriesload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=321984711" as csvwith distinct csv.Country as countrymerge (c:Country {name: country});
And then import the energy plants and connect them to the countries. Because there are more than 30k plants, we will use a periodic commit for this:
//Connect the plants in countriesusing periodic commitload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=321984711" as csvmatch (c:Country {name: csv.Country})create (c)<-[:LOCATED_IN]-(p:Plant {id: toInt(csv.ID), name: csv.Plant_Name, number_of_units: toInt(csv.Number_of_units), capacity: toFloat(csv.Capacity_in_MW), co2: toFloat(csv.CO2_in_Mg), so2: toFloat(csv.SO2_in_Mg), NOx: toFloat(csv.NOx_in_Mg), PM25: toFloat(csv.PM25_in_Mg)});
And then the final import step is going to be to connect the Plants to the top-level Classifications:
//Connect plants with classificationusing periodic commitload csv with headers from "https://docs.google.com/spreadsheets/d/1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk/export?format=csv&id=1iBfZ52mQ5FFq0Mg3Q3WICb9vfzL1aAVQNjhUDXPa3Sk&gid=321984711" as csvmatch (cl:Classification {name: csv.Classification})match (p:Plant {id: toInt(csv.ID)})create (p)-[:HAS_CLASSIFICATION]->(cl);
And then we get the following graph stats loaded:
and we can start looking at it a little bit.
Now for some queries.

Querying the Global Power Emission Database in Neo4j

Of course it's kind of tough to look at this data and get any meaningful data without first doing some aggregations. As luck will have it, graph databases and Neo4j don't necessarily have a great reputation for these types of aggregations - as these types of queries are not very "graph local" usually. But hey, wanted to try it anyway, and see what it would give.

Let's start with a simple one:

//summary of plants per countrymatch (c:Country)--(p:Plant)--(cl:Classification)return c.name, cl.name, count(p)order by c.name, cl.name asc;
This returns in no time - even more so when you make it specific to a Country. Amazing to learn that we still have/had (the dataset is kind of outdated, from 2010) coal-fueled plants in Belgium:

Another great report that I thought was interesting, as it is featured in the original spreadsheet (on a tab named "GPED_v1.0_Country Level") is the summary of emissions per country. It's a ridiculously simple Cypher query to surface that:

//summary of emissions per countryMATCH (p:Plant)--(n:Country)return n.name, sum(p.co2), sum(p.so2), sum(p.NOx), sum(p.PM25)order by n.name;
And we get a really nice, real-time aggregation back from our database: 
This leads me to believe that - at least at this scale - aggregations and Neo4j are not necessarily such a bad match - it literally takes milliseconds.

Where are the big polluters?

Next, I wanted to get a feel - for my own education of where the big polluters are, at least in terms of emissions for the power plants in this database. So a couple of interesting queries will give us some insight:
//largest co2 polluting plantsMATCH (p:Plant)--(conn)return p,connorder by p.co2 desclimit 25;

Look at this:
Exactly: COAL SUCKS.

You can do the same or similar queries for the NOx and SO2 emissions, and will see very similar results.

Where are the most / least efficient plants?

Then we try to get a feel for the plant efficiency: what kind of power (in terms of MegaWatts) are these plants producing, for what kind of pollution emissions? Let's look at the least efficient, and the most efficient plants:
//least efficient plantsmatch path = ((c:Classification)--(p:Plant)--(co:Country))return co.name, c.name, p.name, p.co2, p.capacity, p.co2/p.capacityorder by p.co2/p.capacity desclimit 25;
And look at the results - some surprisingly close ones:

Then we look at the most efficient plants - and we can also look at this in a graphy way:

//most efficient but co2 > 0match path = ((c:Classification)--(p:Plant)--(co:Country))where p.co2 > 0return pathorder by p.co2/p.capacity asclimit 25;
And again, we get some interesting results - USA leading the charge here!

There's a lot of other stuff that we could do with this dataset, but for now I will keep it at this.

Where to find more?

All of these load scripts and queries are nicely available on Github:

I even created a graphgist - but unfortunately the dataset is a bit to big to be rendered as such. That's why my friends at the Neo4j Cloud team (thanks!) actually allowed me to create a publicly available Neo4j instance that has the data in there already. You can browse to

https://bit.ly/globalpoweremissionsdb

and use the following username / password:

  • user: reader
  • pwd: reads
As you can imagine, this user does not have any write authorisations, but you can hammer it with queries as much as you want.

Hope this was a useful post - and look forward to hearing your feedback!

Cheers

Rik

No comments:

Post a Comment