Bruggen Blog: Cycling Tweets Part 2: Importing into Neo4j

Thursday, 14 May 2015

Cycling Tweets Part 2: Importing into Neo4j

So after completing the first part of this blogpost series, I had put together a bit of infrastructure to easily import data into Neo4j. All the stuff was now in CSV files and ready to go:

So I got out my shiny new Neo4j 2.2.1, and started using Load CSV for getting the data in there. Essentially there were three steps:

Importing the metadata about the riders and their twitter handles: importing the metadata
Importing the actual tweets
Processing the actual tweets

So let's go through this one by one. We will be using the following model to do so:

1. Importing the Cycling metadata into Neo4j

I wrote a couple of Cypher statements to import the data from CQ ranking:


//add some metadata
//country info
load csv with headers from 
"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=1390098748" as csv
create (c:Country {code: csv.Country, name: csv.FullCountry, cq: toint(csv.CQ), rank: toint(csv.Rank), prevrank: toint(csv.Prev)});
 
//team info
load csv with headers from"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=1244447866" as csv
merge (tc:TeamClass {name: csv.Class})
with csv, tc
match (c:Country {code: csv.Country})
merge (tc)<-[:IN_CLASS]-(t:Team {code: trim(csv.Code), name: trim(csv.Name), cq: toint(csv.CQ), rank: toint(csv.Rank), prevrank: toint(csv.Prev)})-[:FROM_COUNTRY]->(c);
 
//twitter handle info
using periodic commit 500load csv with headers from 
"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=0" as csv
match (c:Country {code: trim(csv.Country)})
merge (h:Handle {name: trim(csv.Handle), realname: trim(csv.Name)})-[:FROM_COUNTRY]->(c);
 
//rider info
load csv with headers from 
"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0/export?format=csv&id=1lLD2I_czto1iA1OjCMAZZxnYLAVsngBgjT5c0xuvpJ0&gid=1885142986" as csv
match (h:Handle {realname: trim(csv.Name)}), (t:Team {code: trim(csv.Team)})
set h.Age=toint(csv.Age)
set h.CQ=toint(csv.CQ)
set h.UCIcode=csv.UCIcodeset h.rank=toint(csv.Rank)
set h.prevrank=toint(csv.Prev)
create (h)-[:RIDES_FOR_TEAM]->(t);
 
//add the index on Handle
create index on :Handle(name);
create index on :Hashtag(name);
create index on :Tweet(text);
create index on :Handle(nodeRank);
create constraint on (h:Handle) assert h.twitterId is unique;

As you can see, I also added some indexes. The entire script is also on Github.

The graph surrounding Tom Boonen now looked like this:

Once I had this, I could start adding the actually twitter info. That's next.

2. Importing the tweet data into Neo4j

As we saw previously, I had one CSV file for every day now. So how to iterate through this? Well, I did it manually, and create a version of this query for every day between April 1st and 30th.


//get the handles from the csv file
//this should not do anything - as the handles have already been loaded above
using periodic commit 500load csv with headers from "file:<yourpath>/20150401.csv" as csv
with csv
where csv.Username<>[]
merge (h:Handle {name: '@'+lower(csv.Username)});
  //connect the tweets to the handles
using periodic commit 500load csv with headers from "file:<your path>/20150401.csv" as csv
with csv
where csv.Username<>[]
merge (h:Handle {name: '@'+lower(csv.Username)})
merge (t:Tweet {text: lower(csv.Tweet), id: toint(csv.TweetID), time: csv.TweetTime, isretweet: toint(csv.IsReTweet), favorite: toint(csv.Favorite), retweet: toint(csv.ReTweet), url: csv.`Twitter URL`})<-[:TWEETS]-(h);

This file is also on Github, of course. I ran this query 30 times, replacing 20150401 with 20150402 etc etc... The result looked like this:

But obviously this is incomplete: we only have the tweets issued by specific riders now - and we would really like to know who and what they mentioned - in other words extract the handles and hashtags from the tweets. Let\s do that!

3. Processing the tweets: Extract the handles and the hashtags

I created two queries to do this - they are also on Github:


//extract handles from tweet text and connect tweets to handles
match (t:Tweet)
WITH t,split(t.text," ") as words
UNWIND words as handles
with t,handles
where left(handles,1)="@"with t, handles
merge (h:Handle {name: lower(handles)})
merge (h)-[:MENTIONED_IN]->(t);
 
//extract hashtags from tweet text and connect tweets to hashtags
match (t:Tweet)
WITH t,split(t.text," ") as words
UNWIND words as hashtags
with t,hashtags
where left(hashtags,1)="#"with t, hashtags
merge (h:Hashtag {name: upper(hashtags)})
merge (h)-[:MENTIONED_IN]->(t);

And that's when we start to see the twitter network unfold: multiple riders tweeting and mentioning eachother:

That's about it for this part 2. In the next section we will go into how we can enrich this dataset with more data about the connectedness between riders. Who is following who?

I hope you have liked this series so far. As always, feedback very welcome.

Cheers

Rik

Bruggen Blog

Pages

Thursday, 14 May 2015

Cycling Tweets Part 2: Importing into Neo4j

1. Importing the Cycling metadata into Neo4j

2. Importing the tweet data into Neo4j

3. Processing the tweets: Extract the handles and the hashtags

No comments:

Post a Comment

Labels

Blogarchive

Metricool