Wednesday 4 November 2020

Making sense of 2020's mad calendar with Neo4j


As we enter November 2020, I - like many people I assume - can't help but feel quite "betwattled" by all of the events taking place this year. I took some time last weekend to look at all the crazy events that happened ... starting with pretty normal January and February, moving slowly to ominous March, and then living the weird, (semi-) locked down lives that we have been living until this very day I write this, which is the day after the bizarre US elections.

In any case, I decided to have some fun while reflecting about all this. And in my world, that means playing with data, using my favourite tools... CSV files, Google Sheets, and of course, Neo4j. Let me take you for a ride.

Starting out with my calendar

The starting point of all this is of course my Google Calendar - which is buried in online calls and meetings these days. 

So in order to work with that data, I had to go to the Google support page and learn how to export it into an "ICS" file. This turned out to be easy enough. ICS file format takes a little getting used to:But it is pretty easy to convert that into a CSV file using a tool like this one. The result is a flat file structure that is very useable:

So in order to work with that, I just put that file into the <your neo4j directory>/import directory to start playing around with it. As you can imagine, I will not be sharing that file with you publicly, but I think you can understand the process if you want to try this for your own calendar.

However, to make it more straightforward to follow this blogpost, I have created a sample file that completely corresponds to my actual import. It's in a Google Sheet over here, and that then obviously also corresponds to a CSV download that you could work with straight away in your queries.  

So let's start with the import queries first.

Importing the calendar into Neo4j

Once the calendar is in CSV format, the import is super straightforward. I just run this query:
load csv with headers from "file:/calendarfull.csv" as csv
create (e:Event)
set e = csv
set e.starttime = datetime(e.starttime)
set e.endtime = datetime(e.endtime)
set e.createdtime = datetime(e.createdtime)
set e.lastmodifiedtime = datetime(e.lastmodifiedtime);
and
a few short whiles later the data is in Neo4j. Note that I converted the date/time fields into datetime data types in Neo4j, so that we can do some more manipulations on it later.

Once the data is in, we can also add our indexes:
create index on :Event(starttime);
create index on :Event(endtime);
create index on :Event(createdtime);
create index on :Event(lastmodifiedtime);
create index on :Event(summary);
create index on :Event(description);
create index on :Company(name);
create constraint on (p:Person)
assert p.email is unique;
create index on :ConferenceProvider(name);
and now we should be ready to do some data wrangling. Currently we only have one entity/label in the graph, our "Events", but we obviously want to extract some more graphy data out of this. Let's see how we can do that. I call it "Graphifying". You can quote me on that.

Graphifying the imported data

After having done the basic import, there's a couple of clean up operations that we need to do first. The first thing I am doing for my purposes here, is to throw out all the older and personal events. We do that with a simple query:
MATCH (e:Event)
WHERE e.starttime.year < 2020 OR e.calendartype = "personal"
DELETE e;
Then there's some clean-up that needs to be done. First, I will correct the fact that there are Events in the data that don’t have an organiser. I will just manually give them my own details as organiser with this query:
match (e:Event)
where e.organizer is null
set e.organizer = "mailto:rik@neo4j.com";
And then lastly before we start the actual "graphification" process, let me consolidate all the email addresses, remove the "mailto:" prefix, and make these updates both in the "organizer" and the "attendee" fields of the Event nodes:
match (e:Event)
set e.organizer = replace(e.organizer,"mailto:","")
set e.attendee = replace(e.attendee,"mailto:","")
set e.organizer = replace(e.organizer,"@neotechnology.com", "@neo4j.com")
set e.organizer = replace(e.organizer,"@neo4j.org", "@neo4j.com")
set e.attendee = replace(e.attendee,"@neotechnology.com", "@neo4j.com")
set e.attendee = replace(e.attendee,"@neo4j.org", "@neo4j.com");
These three steps proceed very quickly as you can imagine:


Now we can proceed to create a true calendar graph. Let's create the people nodes now, and link them to the events. We use this query to create the "organizers":
match (e:Event)
merge (p:Person {email: e.organizer})
create (p)-[:ORGANIZES]->(e);
and the we just need to look at the "attendee" field of the event. This field contains a list of people's email addresses, and therefore we need to split the field up, and unwind that collection. This is how we remove the attendees from the events and connect them:
match (e:Event)
with e, split(e.attendee,";") as CollectionOfAttendees
unwind CollectionOfAttendees as IndividualAttendee
merge (p:Person {email: IndividualAttendee})
merge (p)-[:ATTENDS]->(e);
The second query took a bit longer on my dataset, but still very doable:


and now the schema starts to look a bit more interesting:


So let's continue this a bit.

Next thing we could graphify, is to look what the companies are that we have in the attendees. This is not that easy, as all we have to work with are the email addresses. But let's take a stab at it. I am first going to split the email addresses into the user (before the @) and company (after the @). We are only going to care about the company info - but we are going to exclude the domain names that are clearly not companies. We do this by making a list of "non corporate" domain names that we will exclude. 

The query for this process looks like this:
MATCH (p:Person)
WITH distinct split(p.email,"@")[1] as domain, p
WHERE not domain in ['email.com','163.com','free.fr','gmail.com','group.calendar.google.com','resource.calendar.google.com','qq.com','hotmail.com','mail.com','outlook.com','skynet.be','vanbruggen.be']
merge (c:Company {name: toUpper(domain)})
merge (p)-[:WORKS_FOR]->(c);
And then we just need to remove the top level domain names from the Company nodes, with the following query:
MATCH (c:Company)
set c.name = replace(c.name,".COM","")
set c.name = replace(c.name,".ORG","")
set c.name = replace(c.name,".BE","")
set c.name = replace(c.name,".NL","")
set c.name = replace(c.name,".FR","")
set c.name = replace(c.name,".CO.UK","")
set c.name = replace(c.name,".GOV.UK","")
set c.name = replace(c.name,".CO","")
set c.name = replace(c.name,".DK","")
set c.name = replace(c.name,".NO","")
set c.name = replace(c.name,".IO","");
and then merge some companies that could still possibly have a duplicate. For example in my data I had people from "deloitte.com" and "deloitte.dk" in my data - and as a consequence I ended up with two DELOITTE nodes that needed to be merged. That's where the apoc library can really help, specifically the apoc.refactor.mergeNodes procedure. Here's what I did:
match (c1:Company), (c2:Company)
where id(c1) < id(c2)
and c1.name = c2.name
with collect([c1,c2]) as companies
unwind companies as company
call apoc.refactor.mergeNodes(company,{properties:"discard",mergeRels:true}) yield node
return node;
So now we look at the the updated model, and see the graphification to take hold:


Another thing that I was quite interested in, is to see which meetings were actually happening online, using an online conferencing tool. So I decided to try to pull these out of the "description" fields in the Events - that's where you typically find the details of the conferencing provider (a URL, a meeting id, or something like that). Looking at a sample of the Events, I decided that the URL would be the most reliable thing to look for, so I wrote a few queries that allow me to inspect the description field and then extract the ConferenceProvider nodes. Here's how that works:
match (e:Event)
where e.description contains "meet.google.com"
merge (cp:ConferenceProvider {name:"Google Meet"})
merge (e)-[:USES_CP]->(cp);

match (e:Event)
where e.description contains "neo4j.zoom.us"
merge (cp:ConferenceProvider {name:"Zoom"})
merge (e)-[:USES_CP]->(cp);

match (e:Event)
where e.description contains "teams.microsoft.com"
merge (cp:ConferenceProvider {name:"Microsoft Teams"})
merge (e)-[:USES_CP]->(cp);

match (e:Event)
where toUpper(e.description) contains "GOTOMEETING"
merge (cp:ConferenceProvider {name:"LogMeIn GoToMeeting"})
merge (e)-[:USES_CP]->(cp);

match (e:Event)
where toUpper(e.description) contains "WEBEX"
merge (cp:ConferenceProvider {name:"Cisco WebEx"})
merge (e)-[:USES_CP]->(cp);
Running these queries is really quick:
And gives us a new extended model, as well as some very cool queries that we can start to run.

Querying the CalendarGraph

For example, let's look at the number of meetings that I did per ConferenceProvider
match (e:Event)--(cp:ConferenceProvider)
return cp.name, count(cp);
You can see how I started switching more to Zoom.
and there's only a handful of meetings without a ConferenceProvider:
match (e:Event)
where not ((e)--(:ConferenceProvider))
return count(e);

It's also interesting to see how this evolved over the year: we can look at the number of Conference calls per month:
match (e:Event)--(cp:ConferenceProvider)
return distinct e.starttime.month as month, count(cp)
order by month asc;
As you can tell: I went from 60-70 calls in january/february, to more than double that as the Covid-19 lockdowns started to take hold.
Fun fact for you: August 24th was not a great day in the life of Rik: when I run
match (e:Event) return distinct date(e.starttime), count(e) order by
count(e) desc limit 10;
I see
And that takes me back to a nightmare series of 20 meetings, starting at 8am until 10pm. Lovely.

Inferring new relationships

One last simple set of queries that I wanted to take a look at, was to look at the relationships between the attendees. If I see the same attendees on a calendar invite, that means that I can infer that these attendees have "met" eachother during the meeting. We can also then tally the number of meetings, and create a new weighted relationship between the Person nodes. Let me show you how I tried this, and how I created the new MEETS_WITH relationship:
call apoc.periodic.iterate(
"match (p1:Person)-[:ATTENDS]->(e:Event)<-[:ATTENDS]-(p2:Person)
where id(p1)<id(p2)
return p1,p2,e",
"merge (p1)-[mwr:MEETS_WITH]->(p2)
    on create set mwr.frequency = 1
    on match set mwr.frequency = mwr.frequency + 1",
{batchsize:1000, parallel:false});
This takes a little time to run - as the number of combinations between all attendees can be quite large. But once it finished, we can use it to then query that to find the most frequent meeters:
match (p1:Person)-[mw:MEETS_WITH]->(p2:Person) return p1.email,
mw.frequency, p2.email order by mw.frequency desc limit 10;
and I can see the "usual suspects" at the top:
I have also tried to run some data science queries on the dataset, but did not find anything really interesting to share with you. I will just point you to Neuler, so that you can play around with this yourself really easily.

Hope this was an interesting post for you - any feedback or questions just lat me know!

All the best

Rik

No comments:

Post a Comment