Wednesday 24 April 2013

Gmail Email analysis with Neo4j - and spreadsheets

A bunch of different graphistas have pointed out to me in recent months that there is something funny about Graphs and email. Specifically, about graphs and email analysis. From my work in previous years at security companies, I know that Email Forensics is actually big business. Figuring out who emails whom, about what topics, with what frequency, at what times - is important. Especially when the proverbial sh*t hits the fan and fraud comes to light - like in the Enron case. How do I get insight into email traffic? How do I know what was communicated to who? And how do I get that insight, without spending a true fortune?


So a couple of days ago I came across an article or two that features Gmail Meter. This is of course not one of these powerful enterprise-ready forensics tools, but it is interesting. Written by Romain Vialard, it provides you with a really straightforward way to get all kinds of stats and data about your use of Gmail. It does so using a Google Apps Script, that is available to anyone using Google Docs’ spreadsheet functionality. In this blog post, we’ll take a look at how we can actually use Romain’s output, and generate a neo4j database that will allow us to visually and graph-ically explore the email traffic in our gmail inboxes - without doing doing any coding of course. Because I don’t know how to do that - but I do do spreadsheets, as you know by now.


Using Gmail Meter to create the dataset



The first thing you need to do to get going is to get Gmail Meter installed. To do that, you just create a Google Doc Spreadsheet, and insert Gmail Meter from the Script Gallery.




You will need to give it permission to analyse your mailbox (and potentially remove that when you want to uninstall it), but that’s easy. More instructions can be found over here should you need that. But really it’s dead easy. Once you have given it permission that, the Apps Scripts starts churning away at your mailbox.


The end result is a Google spreadsheet with two tabs:
  • the first tab (Sheet1) contains information about which email addresses you have been exchanging with, and how many emails you have been exchanging with them (sending, and receiving)
  • the second tab (Sheet2) contains more information about the conversations, number of words, etc.
Now all we need to do is create a neo4j database based on this data - and that too is really easy.

Importing the Gmail Meter data into Neo4j


For the graph import that I will illustrate here, we will only be using the first sheet. Basically just getting to grips with
  • the people that we are emailing to
  • the people that we are receiving emails from
  • the frequency that we are emailing to/from these contacts
There’s definitely more data here - but let’s start with this.

The way I have done it - which is probably not the only way to do it, but still - is to add two sheets to the workbook coming from Gmail Meter.
  • “Graph”: to convert the data from GmailMeter’s first worksheet into nodes and relationships
  • “Cypher”: to generate the Cypher statements that we can use to generate the Neo4j database and start playing around .
I have shared the worksheet over here - so please take a look and customize it whereever necessary. You should end up with something like this sheet:


Note that - as mentioned above - in the Graph sheet we have, additional to the nodes and relationships, also added the number of emails to the “EMAIL” relationship as a property/weight. This will come in handy later when visualising the email traffic. In a larger graph you could actually use these weights for pathfinding algorithms as well - in case you would want to find out the volume of email traffic between two persons at different places in the graph, through other people.

You see that I am using the exact same techniques of my spreadsheet import blog post to generate the Cypher statements required. All I need to do after that is to put all statements into the “Cypher” worksheet, wrap it with a transaction - and copy paste that into the neo4j shell of my empty neo4j database. That will execute the queries, insert all nodes, relationships and properties, and voila - we have our Gmail Meter Graph!

Exploring the Gmail Meter Graph


Once we have generated the database, we can explore it in all the ways that we can with Neo4j’s traditional visualisation/querying tools. First thing I did of course is to look at it with Webadmin. That gives you some ideas already, but things get a lot more interesting when you can visualise the weights (= numbers of emails between two persons).


To do that, I plugged another visualisation tool on top of the database to get a feel for the weight of certain relationships. Our friends at Linkurio.us actually have some very neat  (and above all: simple AND powerful) ways to do this - as you can see below it immediately gave me an idea of where the traffic is coming from and going to.




Obviously we can explore the network, and also query it with Cypher for interesting relationships:
  • find out who I am emailing,

START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail-[email:EMAIL]->otherperson
RETURN
id(otherperson), email.number, otherperson.name;

  • find out who is mailing me more than 4 emails,

START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail<-[email:EMAIL]-otherperson
WHERE
email.number > 4
RETURN
id(otherperson), email.number, otherperson.name;

  • find out who is at the same time sending mail to me, and receiving mail from me.

START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail<-[email:EMAIL]-otherperson,
mymail-[email2:EMAIL]->otherperson
WHERE
email.number > 2
AND
email2.number > 2
RETURN
id(otherperson), email.number as From, email2.number as To,

otherperson.name;

  • And of course Cypher offers many more interesting possibilities…

I hope you understand that - because we are only looking at data from one and only one mailbox, the dataset’s power is quite limited. But I am hoping you get the point that the graph exploration of this dataset is great. I did a little experiment where I actually put my professional email data and my personal email data (both use Gmail) into one neo4j database - and that was really interesting.

For those of you wanting more detailed info on this topic, I would encourage you to take a look at the wonderful Graph Databases book, that has a specific chapter about email analysis.

Hoping this was useful - it sure was a learning experience (again) for me.