Thursday, 7 January 2016

The GraphBlogGraph: 1st blogpost out of 3

Making the GraphBlogGraph

For quite a few years now, I have been hosting my own blog at blog.bruggen.com. It’s been quite an interesting experience I must say. Long time ago, I started blogging as kind of a personal diary kind of thing, but… then Facebook and Twitter happened, and it seemed kind of redundant at the time. Then I got to work for Neo4j, and got stuck into the Neo4j community, and I “restarted” my blog to write about my life and work in the Neo4j community. It’s been a very, very fun ride.


So this past Christmas period I had to a bunch of work for my Orienteering club, and I do most of that work (registering club members, registering for races, managing billing etc) in Google Sheets. And I came across acouple of really interesting things that I did not knew existed - and that I thought would make a super cool Graph application. The two things were:
  • an easy and automated way to read a blog “feed” (in Atom or RSS) and put the items into a Google Sheet. This is called the “ImportFEED” function:
    here's the manual - it’s a really interesting piece of functionality.
  • an easy and automated way to pars XML (and therefore, HTML pages) and extract information from that XML using XPATH. This function is called ImportXML:

So my idea was basically very simple: why don’t I use this functionality to read the feeds from a couple of Neo4j-centric blogs that I know (using ImportFEED), and then use the URLs of the pages in the feed to scrape the HTML of the blogpost page with ImportXML, and extract the hyperlinks (<a href=”...”> tags in HTML). That way I could basically look at the graph of links between the different blogs, and see if I could discover anything interesting...


So I did. I will publish a couple of blogposts (!) in the next few days to explain the story.

Reading the GraphBlog-feeds

I got to work. I created a google sheet (which is publicly available for you to view and copy if you want), and listed some of the blogs that I would be interested in.

NameURLFeed
Rik Van Bruggenhttp://blog.bruggen.comhttp://blog.bruggen.com/feeds/posts/default
Michael Hungerhttp://jexp.de/bloghttp://jexp.de/blog/feed/
Stefan Armbrusterhttp://blog.armbruster-it.dehttp://blog.armbruster-it.de/feed/
Neo4j.comhttp://neo4j.com/bloghttp://neo4j.com/feed/
Max De Marzihttp://maxdemarzi.comhttp://maxdemarzi.com/feed/
Mark Needhamhttp://www.markhneedham.com/blog/http://feeds.feedburner.com/markneedham
Will Lyonhttp://www.lyonwj.com/http://www.lyonwj.com/atom.xml

I had some others on the list (Linkurio.us blog, GraphAware blog, GrapheneDB blog) but I could not immediately find the feeds of these blogs… maybe some day :)) …


So the next thing I did was I used ImportFEED to load the data of these feeds into a sheet of the workbook. The feeds actually look like this:
But with the ImportFEED function, it is really trivial to get that into a workable format. I used the following three formulae to load the created date (“items created”), the title (“items title”) and the URL (“items url”) of the last 20 posts in the feed into three colums:


=importfeed("http://blog.bruggen.com/feeds/posts/default","items created",TRUE, 20)
=importfeed("http://blog.bruggen.com/feeds/posts/default","items title",TRUE, 20)
=importfeed("http://blog.bruggen.com/feeds/posts/default","items url",TRUE, 20)


The result was actually super cool: a sheet for every blog that had date, title and url information for this particular blog.


Crawling the GraphBlog-pages

Then, next, I wanted to do some webpage crawling/scraping/whatever you want to call it with ImportXML. So that’s why I have the following formula:


=IMPORTXML(D2, "//a/@href")


Which is giving me and array like so:


Now, what I obviously want to do later on is import these things into a graph database, so I really wanted to get all of these links together into a “big” cell. So I decided to use a JOIN function to do that:
Whit the following JOIN I can actually put all these links into a cell of the spreadsheet, separated by a delimiter (“****” in this case):


=join("****",sort(unique(IMPORTXML(D2, "//a/@href"))))


By doing it this way, each of these cells we get a long piece of text string:


//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3****//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1****//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=Label&widgetId=Label1&action=editWidget&sectionId=sidebar-right-1****//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=PageList&widgetId=PageList1&action=editWidget&sectionId=crosscol****http://blog.bruggen.com/****http://blog.bruggen.com/2013_01_01_archive.html****http://blog.bruggen.com/2013_03_01_archive.html****http://blog.bruggen.com/2013_04_01_archive.html****http://blog.bruggen.com/2013_05_01_archive.html****http://blog.bruggen.com/2013_06_01_archive.html****http://blog.bruggen.com/2013_07_01_archive.html****http://blog.bruggen.com/2013_08_01_archive.html****http://blog.bruggen.com/2013_09_01_archive.html****http://blog.bruggen.com/2013_10_01_archive.html****http://blog.bruggen.com/2013_11_01_archive.html****http://blog.bruggen.com/2013_12_01_archive.html****http://blog.bruggen.com/2014_01_01_archive.html****http://blog.bruggen.com/2014_02_01_archive.html****http://blog.bruggen.com/2014_03_01_archive.html****http://blog.bruggen.com/2014_04_01_archive.html****http://blog.bruggen.com/2014_05_01_archive.html****http://blog.bruggen.com/2014_06_01_archive.html****http://blog.bruggen.com/2014_07_01_archive.html****http://blog.bruggen.com/2014_08_01_archive.html****http://blog.bruggen.com/2014_09_01_archive.html****http://blog.bruggen.com/2014_10_01_archive.html****http://blog.bruggen.com/2014_11_01_archive.html****http://blog.bruggen.com/2014_12_01_archive.html****http://blog.bruggen.com/2015_01_01_archive.html****http://blog.bruggen.com/2015_02_01_archive.html****http://blog.bruggen.com/2015_03_01_archive.html****http://blog.bruggen.com/2015_04_01_archive.html****http://blog.bruggen.com/2015_05_01_archive.html****http://blog.bruggen.com/2015_06_01_archive.html****http://blog.bruggen.com/2015_07_01_archive.html****http://blog.bruggen.com/2015_08_01_archive.html****http://blog.bruggen.com/2015_09_01_archive.html****http://blog.bruggen.com/2015_10_01_archive.html****http://blog.bruggen.com/2015_11_01_archive.html**** etc etc etc


Which is fine, because I know how to split this cell into individual “blog links” again. What I have now is a spreadsheet containing the blog feed, and all of the links that go from the individual blog pages to other pages. Nice!


In the next section I will be importing that spreadsheet into Neo4j, and then we can start playing around with it.


I hope you enjoyed this blogpost so far. I will publish part 2 in a few days, for sure.


Cheers

Rik

3 comments:

  1. Hello Rik!
    Thanks for mentioning us. There is an issue with the atom feed link on our blog. We'll fix it ASAP.
    In the meantime the URL is http://www.graphenedb.com/blog/atom.xml ;)

    ReplyDelete
  2. Thanks for the link, Alberto! I have added it to the spreadsheet - and will add the import statements soon... Cheers!

    ReplyDelete