Sunday 17 May 2015

Cycling Tweets Part 3: Adding "Friends" to the CyclingGraph

In this 3rd part of this blogpost series about Cycling (you can find part 1 and part 2 in earlier posts) we are going to take the existing Neo4j dataset a bit further. We currently have the CQ Ranking metadata in there, the tweets that we exported all connected up to the riders' handles, and then we analysed the tweets for @handle and #hashtag mentions. We got this:

Now my original goal included having a social graph in there too: the "friends" relationships for different twitterers could be interesting too. Friends are essentially two-way follow-relationships - where two handles follow each other, thereby indicating some kind of closer relationship. It's neatly explained over here. So how to get to those?

Well, I did some research, and while there are multiple options, my conclusion was that really you would need to have a script that would talk to the twitter API. And since we also know that IANAP (I AM Not A Programmer), I would probably need a little help from my friends.

Friends to the rescue: my first python script

Turns out that my friend and colleague Mark Needham had already done some work on a very similar topic: he had developed a set of Python scripts that used the Tweepy library for reading from the Twitter API, and Nigel Small's Py2Neo for writing to Neo4j.  So I started looking at these and found them surprisingly easy to follow.

So I took a dive at the deep end, and started to customize Mark's scripts. I actually spent some time going through a great python course at Codecademy, but really my tweaks to Mark's script could have been done without that too. His original script had two interesting arguments that I decided to morph:

  • --download-all-user-profiles
    I tweaked this one to "download all user friends" from the users.csv file. The new command is below.
  • --import-profiles-into-neo4jI tweaked this one to "import all friends into neo4j" from the .json files in the ./friends directory. The new command is also below.

In order to use this, you do need to put in placeI have put my final script over here for you to take a look. In order to use it, you have to register an App at Twitter, and generate credentials for the script to work with:

That way, our python script can read stuff directly from the Twitter API. Don't forget to "source" the credentials, as explained on Mark's readme.

2 new command-line arguments

Mark's script basically uses a number of different command-line arguments to do stuff. I decided to add two arguments. The first argument I added was

python twitter.py --download-all-user-friends 

This one talks to the Twitter API, and downloads the friends of all the users that it found in the users.csv file.  I generated that file based on the CQ ranking spreadsheet that I had created earlier.
As you can see, it pauses when the "rate limit" is reached - this is standard Tweepy functionality. The output is a ./friends directory full of .json files. Here's an example of such a file (tomboonen1.json)
In these .json files there is a "friends" field. Using the second part of the twitter.py script, we can then import these friends to our existing Neo4j CyclingTweets database using the following Cypher statement (note that the {variables} in green are Python variables, the rest is pure Cypher):
"""
MATCH (p:Handle {name: '@'+lower(
{screenName})})
SET p.twitterId =
{twitterId}
WITH p
WHERE p is not null
UNWIND {friends} as friendId
MATCH (friend:Handle {twitterId: friendId})
MERGE (p)-[:FOLLOWS]->(friend)
"""
So essentially this finds the screenName (aka the "handle"), adds the twitterId to the screenName, and then adds the "FOLLOWS" relationships between that handle and the friends of that handle. Pretty sweet.

So let's run the script, but this time with a different command line argument, and with a running Neo4j server in the background that the script could talk to:

python twitter.py --import-friends-into-neo4j


After a couple of minutes (if that), this is done, and we have a shiny new graph that includes the FOLLOWS relationships:
This is pretty much what I set out to create in the first place, but thanks to the combination of the import (part 2) and this Python script - I have actually got a whole lot more info in my dataset. Some very cool stuff.

Hope you liked this 3rd part of this blogpost series. There's so much more we could do so - so look out for part 4 soon!

Cheers

Rik

No comments:

Post a Comment