Showing posts with label google spreadsheet. Show all posts
Showing posts with label google spreadsheet. Show all posts

Thursday, 6 October 2022

A Graph Database and a Dadjoke walk into a bar...


I just publised a blogpost series with 6 different articles about me having fun with Dadjokes, in an unusual sort of way. Here are the links to the articles:

All of the queries etc are put together in this markdown document. I plan to make a Neo4j Guide out of this as well in the next few days so that it would become easier to use. 

Hope you will have as much fun with it as I did. 

Rik

DadjokeGraph Part 6/6: Closing: some cool Dadjoke Queries

A Graph Database and a Dadjoke walk into a bar...

Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.

How many times does a joke get tweeted?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
WITH dj.Text AS Joke, count(r) AS NrOfTimesTweeted
RETURN Joke, NrOfTimesTweeted
ORDER BY NrOfTimesTweeted DESC
LIMIT 10;

How many times does a joke get favorited?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted
ORDER BY NrOfTimesFavorited DESC
LIMIT 10;

DadjokeGraph Part 2/6: Importing the Dadjokes into the Dadjoke Graph

A Graph Database and a Dadjoke walk into a bar...

This means that we want to convert the spreadsheet that we created before, or the .csv version of it, into a Neo4j Database.

Here's how we go about this. First things first: let's set up the indexes that we will need later on in this process:

CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;

Assuming the .csv file mentioned above is in the import directory of the Neo4j server, we can use load csv to create the initial dataset:

LOAD CSV WITH HEADERS FROM "file:/vicinitas_alldadjokes_user_tweets.csv" AS csv
CREATE (t:Tweet)
SET t = csv;

Import the Vicinitas .csv file

Or: if you want to create the graph straight from the Google Sheet:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;

DadjokeGraph Part 3/6: Taking on the real disambiguation of the jokes

A Graph Database and a Dadjoke walk into a bar...

We noticed that many of the Tweet nodes referred to the same jokes - and resolved that already above. But this query makes us understand that we actually still have some work to do:

MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
    RETURN path;

The Amazon Dadjokes

We will come back to that example below.

We now notice that there are quite a few Dadjoke nodes that are a bit different, but very similar. We would like to disambiguate these too. We will use a couple of different strategies for this, but start with a strategy that is based on String Metrics.

DadjokeGraph Part 4/6: Adding NLP and Entity Extraction to prepare for further disambiguation

A Graph Database and a Dadjoke walk into a bar...

As we can see in the pyjamazon example from before, the disambiguation of our Dadjokes has come a long way - but is not yet complete. Hence we we call the graph to the rescue here, and take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.

Here's what we are going to do:

  1. we are going to use Natural Language Processing to extract the entities that are mentioned in our Dadjokes. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke.
  2. then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.

So let's start with the start.

DadjokeGraph Part 5/6: Disambiguation using Graph Data Science on the NLP-based Entities

A Graph Database and a Dadjoke walk into a bar...

The next, and final, step our dadjoke journey here, is going to be taking the disambiguation to the next level by applying Graph Data Science metrics to the new, enriched (using NLP), structure that we have in our graph database. The basic idea here is that, while the TEXT similarity of these jokes may be quite far apart, their structural similarity may still be quite high based on the connectivity between the joke and its (NLP based) entities.

Calculating the Jaccard similarity metric using GDS

To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia. The index is defined as the size of the intersection divided by the size of the union of the sample sets, which is very well illustrated on that Wikipedia page. I have used Neuler (the no-code graph data science playground that you can easily add to your Neo4j Desktop installation) to generate the code below - but you can easily run this in the Neo4j Browser as well.

DadjokeGraph Part 1/6: Building a Dadjoke database - from nothing

A Graph Database and a Dadjoke walk into a bar...

I am a dad. Happily married and 3 wonderful kids of 13, 17 and 19 years old. So all teenagers - with all the positive and not so positive experiences that can come with that. And in our family, I have been accused of using dadjokes - preferably at very awkward or inconvenient times. I actually like dadjokes. A lot.

So I follow a few accounts on social media that post these jokes. Like for example baddadjokes, dadjokeman, dadsaysjokes, dadsjokes, groanbot, punsandoneliner, randomjokesio, and thepunnyworld and there are many others. These are all in this list, should you be interest. It's a very funny list.

Dadjokers List on Twitter

Wednesday, 4 November 2020

Making sense of 2020's mad calendar with Neo4j


As we enter November 2020, I - like many people I assume - can't help but feel quite "betwattled" by all of the events taking place this year. I took some time last weekend to look at all the crazy events that happened ... starting with pretty normal January and February, moving slowly to ominous March, and then living the weird, (semi-) locked down lives that we have been living until this very day I write this, which is the day after the bizarre US elections.

In any case, I decided to have some fun while reflecting about all this. And in my world, that means playing with data, using my favourite tools... CSV files, Google Sheets, and of course, Neo4j. Let me take you for a ride.

Starting out with my calendar

The starting point of all this is of course my Google Calendar - which is buried in online calls and meetings these days. 

Friday, 27 March 2020

Supply Chain Management with graphs: part 3/3 - some SCM analytics

I've been looking forward to writing this: this is the last of 3 blogposts that I have been planning to write for weeks about my experiments with a realistic Supply Chain Management Dataset. There's two posts before this one:
  • In the first post I found and wrangled a dataset into my favourite graph database, Neo4j
  • In the second post I got acquainted with the dataset in a bit more detail, and I was able to do some initial querying on it to figure out what patterns I might be able to expose.
In this this third and last post I would like to get a bit more analytical with the dataset, and do some more detail investigation in order to better understand some typical SCM questions. Note that I am far from a Supply Chain specialist - I barely understand the domain, and therefore I will probably be asking some silly questions initially. But bear with me - and let's explore and learn, right?

Wednesday, 25 March 2020

Supply Chain Management with graphs: part 2/3 - some querying

So in the previous post, we got introduced to a dataset that I have been wanting to get into Neo4j for a long time: a Supply Chain Management dataset. Read up about it over here, but the long and short of it is that we got ourselves into the situation where we have an up and running Neo4j database with 38 different multi-echelon supply chains. Result!

As a quick reminder, here's what the data model looked like after the import:

Or visually:


Data validation and profiling

The first thing to do when you have a new shiny dataset like that, is of course to get a bit of a feel for the data. In this case, it really helps to understand the nature of the different SupplyChains - as we know from the original Excel file that they are quite different between the 38 of them. So let's do some profiling:

match (n) return distinct labels(n), count(*)

Saturday, 21 March 2020

Supply Chain Management with graphs: part 1/3 - data wrangling and import

Alright, I have been putting the writing of this blogpost off for too long. Finally, on this sunny Saturday afternoon where we are locked inside our homes because of the Covid-19 pandemic, I think I'll try to make a dent in it - I have a lot of stuff to share already.

The basic idea for this (series of) blogpost(s) is pretty simple: graph problems are often characterised by lots of connections between entities, and by queries that touch many (or an unknown quantity) of these entities. One of the prime examples is pathfinding: trying to understand how different entities are connected to one another, understanding the cost or duration of these connections, etc. So pretty quickly, you understand that logistics and supply chain management are great problems to tackle with graphs, if you think about it. Supply Chains are graphs. So why not story and retrieve these chains with a graph database? Seems obvious.

We've also had lots of examples of people trying to solve supply chain management problems  in the past. Take a look at some of these examples:
And of course some of these presentations from different events that we organised:
So I had long thought that it would be great to have some kind of a demo dataset for this use case. Of course it's not that difficult to create something hypothetical yourself - but it's always more interesting to work with real data - so I started to look around.

Monday, 22 October 2018

Poring over Power Plants: Global Power Emissions Database in Neo4j

In the past couple of weeks, I have been looking to some interesting datasets for the Utility sector, where Networks or Graphs are of course in very, VERY abundant supply. We have Electricity Networks, Gas Networks, Water Networks, Sewage Networks, etc etc - that all form these really interesting graphs that our users can. Lots of users have specialised, GIS based tools to manage these networks - but when you think about it there are so many interesting things that we could do if ... we would only store the network as a network - in Neo4j of course.
So I started looking for some datasets, and maybe I am not familiar with this domain, but I did not really find anything too graphy. But I did find a different dataset that contained a lot of information about Power Plants - and their emissions. Take a look at this website:
and then you can download the Excel workbook from over here. It's not that big - and of course the first thing I did was to convert it into a Google Sheet. You can access that sheet over here:

There's two really interesting tabs in this dataset:
  1. the sheet containing the fuel types: this gives you a set of classifications of the fuel that is used in the different power plants around the globe
  2. the list of 30,5k power plants from around the world that generate different levels of power from different fueltypes. While doing so, they also generate different levels of emissions, of course, and that data is also clearly mentioned in this sheet. Note that the dataset does not include any information on Nuclear plants - as they don't really have any "emissions" other than water vapour and... the nuclear waste of course.
So let's get going - let's import this dataset into Neo4j.

Wednesday, 20 July 2016

Graphing the Tour de France - part 3/3

In the past two blogposts I have been creating and importing some nice Tour de France 2016 data. It's a small dataset, for sure, and this is by no means a realistic graph application - but perhaps we can still have some fun exploiting the data with some cypher queries. That's what we'll try now. I have put all of the example queries together in this gist, so please feel free to play around with it :) ... let's take you through it.

Is the model really there?

First and foremost, let's verify the model that we wanted to put in place, with yet another AAPOC (Awesome APOC). We thought we were going to get this model:

Monday, 18 July 2016

Graphing the Tour de France - part 2/3

In a previous blog post, I created a couple of Google spreadsheets with some of the results data of the 2016 Tour de France. These spreadsheets can be very easily downloaded as two comma-separated files that hold the data:
I will be updating the stages.csv files as the Tour progresses, so we can keep updating the graph as well.

Creating a model

To import these CSV files into Neo4j, I actually went through multiple iterations of the model. Here's two of them that I wanted to share with you - not because of the fact that one of them would be "right" and the other one would be "wrong", but because it really reflects the fact that your use case - the questions that you want to ask of your data and what you want to be doing with the data - is going to determine the model. Underlined. In Bold. Because it's so important.

Thursday, 14 July 2016

Graphing the Tour de France - part 1/3

Alright, it's time to come out of the closet. I have to admit, over the past couple of years, I have turned into a bit of a cycling geek. I love watching the races in Flanders in spring, the legendary "ride through hell" from Paris to Roubaix, and of course, now, in summertime, the big tours of Italy, France and Spain. I have grown quite addicted to it - and have taken to riding my own bike a couple of times a week as well... it's a ton of fun. Last year I did a fun experiment in a series of 5 blog posts about the Professional Cycling twitterverse, but this year, I had something else thrown into my lap. Here's what happened.

Thursday, 3 March 2016

The Neo4j Knowledge Graph

A couple of days ago, I wrote a graphgist about creating a true Knowledge Graph for the Neo4j ecosystem. Based on the fantastic Awesome Neo4j resource created by our friends at Neueda/Neueda4j. You can access it in a separate window over here.


In this post however, I will go into a bit more detail about how I went about creating that graph.

Google Spreadsheet is my friend

I mentioned already that I started from the awesome Awesome Neo4j github resource. And while it's a great idea to manage pages etc collaboratively on Github, I can't help but feel like there should be other and nicer ways of structuring that information. So I spent a couple of hours converting that information into a spreadsheet (which is publicly accessible over here):

This sheet contains 
  • info about the resource (name and comments)
  • the URL where you can find the resource
  • info about the author (individual or organisation) that created/manages the resource
So it's a very, very easy graph model:



So all I needed to do was import that sheet into Neo4j. Easy...

Importing the Google Spreadsheet with Load CSV

As we know by know, it's really easy to download a Google spreadsheet as a CSV file, and then it is pretty darn easy to import that CSV into Neo4j with Load CSV. I have two versions of that load script:

The result is not a very big graph of course:


And now we can do some nice querying on it - just for fun!

Querying the Neo4j KnowledgeGraph

Obviously there are many different queries that we could run on an interesting graph like this. I have put a couple of them on Github as well. Here they are:

//Find some Authors, Resources and Tags
MATCH p = ((a:Author)--(r:Resource)--(t:Tag))
return p
limit 25

Gives you an initial sample of the graph:

Then we can explore a couple of specific graph neighborhoods:

//Find some Authors, Resources and Tags connected to Rik or Max
MATCH (t:Tag)--(r:Resource)--(a:Author)
where a.name contains "Rik" or a.name contains "Max"
return t,r,a

this gets us this one:


And then we can also "recreate" a spreadsheet-like view of the graph:

//find some resources and authors
MATCH (r:Resource)--(a:Author)
where a.name contains "Rik" or a.name contains "Max"
return distinct a.name as Author, r.name as Resource, r.url as URL, r.comments as Description
order by Author;

This gets us (pitty that the url's don't get hyperlinked like they do on the graphgist):


And then finally, let's look at some pathfinding - always interesting:

//find some paths between books and blogs
match (t1:Tag {name:"book"}), (t2:Tag {name:"blog"}),
p = allshortestpaths ( (t1)-[*]-(t2))
return p
limit 10

As usual, we end up with Michael Hunger again :)) 


So there you go. A first attempt at creating another graph-based knowledge repository for all things Neo4j.  Hope you guys enjoyed that. I know I did :))

Cheers

Rik

Wednesday, 24 February 2016

What's a graph without a laugh? Multi-lingual Graph Karaoke is HERE!

Indeed. If you follow this blog and some of my work with Neo4j a little bit, then I am hoping you know that I like to have FUN with Neo4j as well. It's actually quite amazing that you CAN in fact have fun with something as "boring" as a database - but hey, I really think you can.

I have been doing Graph Karaoke for a long time (I still blame Nigel) - and that has been a ton of fun along the way. Over the years, I gathered quite a playlist:
And now with GraphConnect Europe coming up in April, I felt like we would really need some more songs to spice up the conference. But: being in Europe - WHY would we limit ourselves to boring ENGLISH songs all the time. So I figured, there has to be a way to do Graph Karaoke in a multi-lingual way. And that's what I will be doing in the next few months - and today is the first episode in "MultiLingual Graph Karaoke".

Google Sheets to the rescue!

We'll start with a nice little Dutch song that I really like, by the world-famous (!) group "Doe Maar!". The song is "Dansmuziek", and has a lovely vibe to it, I think.

I grabbed the lyrics online, and put it in a google spreadsheet. It looks like this:
As you can see (access the sheet yourself over here), I have one column in there with the lyrics in the original, Dutch language, but I also have a couple of other columns that are automatically translated into English, German and French using the Googletranslate function of Google Sheets. I simply do:

=GOOGLETRANSLATE(C2,"nl","en")
=GOOGLETRANSLATE(C2,"nl","fr")
=GOOGLETRANSLATE(C2,"nl","de")


and these lyrics get automatically translated into the other languages. Obviously the translations will not be perfect, but hey, at least you get to do multi-lingual graph karaoke then! Yey!

Once I have this, I can download the spreadsheet as a CSV file, and then I can start working with it in Cypher. I have created an Import gist on Github - so you can basically run it yourself and import the graph at your own convenience.

Creating the Karaoke video

Once we have that, it's child's play to actually create the Karaoke video. A few very simple queries suffice:

match (w:Dutch {seq:1})
return w;




to find a sentence in a specific language. Or

match (w {seq:1})
return w;



to find a sentence in any language. And then finally also a "tabular" representation that would be easy to read:

match (w {seq:4})
where not ("SongSentence"in labels(w))
return w.seq as Sequence, labels(w) as Language, collect(w.name) as Sentence
order by Language[0]


All of that then brings me to the following - slightly stupid, I agree - slightly wonderful result:


That's it! I plan make a few more of these before GraphConnect, and publish them here as well. If you have any "song requests" in your own language, then PLEASE let me know.

Cheers

Rik

Monday, 11 January 2016

The GraphBlogGraph: 2nd blogpost out of 3

Importing the GraphBlogGraph into Neo4j

In the previous part of this blog-series about the GraphBlogGraph, I talked a lot about creating the dataset for creating what I wanted: a graph of blogs about graphs. I was able to read the blog-feeds of several cool graphblogs with a Google spreadsheet function called “ImportFEED”, and scrape their pages using another function using “ImportXML”. So now I have the sheet ready to go, and we also know that with a Google spreadsheet, it is really easy to download that as a CSV file:

You then basically get a URL for the CSV file (from your browser’s download history):

and that gets you ready to start working with the CSV file:

I can work with that CSV file in Cypher’s LOAD CSV command, as we know. All we really need is to come up with a solid Graph Model to do what we want to do. So I went to Alistair’s Arrows, and drew out a very simple graph model:



So that basically get’s me ready to start working with the CSV files in Cypher. Let’s run through the different import commands that I ran to do the imports. All of those are on github of course, but I will take you through them here too...

First create the indexes

create index on :Blog(name);
create constraint on (p:Page) assert p.url is unique;

Then manually create the blog-nodes:

create (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"});
create (n:Blog {name:"Neo4j Blog", url:"http://neo4j.com/blog"});
create (n:Blog {name:"JEXP Blog", url:"http://jexp.de/blog/"});
create (n:Blog {name:"Armbruster-IT Blog", url:"http://blog.armbruster-it.de/"});
create (n:Blog {name:"Max De Marzi's Blog", url:"http://maxdemarzi.com/"});
create (n:Blog {name:"Will Lyon's Blog", url:"http://lyonwj.com/"});

I could have done that from a CSV file as well, of course. But hey - I have no excuse - I was lazy :) … Again…

Then I can start with importing the pages and links for the first (my own) blog, which is at blog.bruggen.com and has a feed at blog.bruggen.com/feeds/posts/default:

//create the Bruggen blog entries
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv
match (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"})
create (p:Page {url: csv.URL, title: csv.Title, created: csv.Date})-[:PART_OF]->(b);

This just creates the 20 leaf nodes from the Blog node. The fancy styff happens next, when I then read from the “Links” column, holding the “****”-separated links to other pages, split them up into individual links, and merge the pages and create the links to them. I use some fancy Cypher magic that I have also used before for Graph Karaoke: I read the cell, and then split the cell into parts and put them into a collection, and then unwind the collection and iterate through it using an index:

//create the link graph
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv
with csv.URL as URL, csv.Links as row
unwind row as linklist
with URL, [l in split(linklist,"****") | trim(l)] as links
unwind range(0,size(links)-2) as idx
MERGE (l:Page {url:links[idx]})
WITH l, URL
MATCH (p:Page {url: URL})
MERGE (p)-[:LINKS_TO]->(l);

So this first MERGEs the new pages (finds them if they already exist, creates them if they do not yet exist) and then MERGEs the links to those pages. This creates a LOT of pages and links, because of course - like with every blog - there’s a lot of hyperlinks that are the same on every page of the blog (essentially the “template” links that are used over and over again).
And as you can see it looks a little bit like a hairball when you look at it in the Neo4j Browser:
So in order to make the rest of our GraphBlogGraph explorations a bit more interesting, I decided that it would be useful to do a bit of cleanup on this graph. I wrote a couple of Cypher queries that remove the “uninteresting”, redundant links from the Graph:

//remove the redundant links
//linking to pages with same url (eg. archive pages, label pages...)
match (b:Blog {name:"Bruggen"})<-[:PART_OF]-(p1:Page)-[:LINKS_TO]->(p2:Page)
where p2.url starts with "http://blog.bruggen.com"
and not ((b)<-[:PART_OF]-(p2))
detach delete p2;
//linking to other posts of the same blog
match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"})<-[:PART_OF]-(p2:Page),
(p1)-[lt:LINKS_TO]-(p2)
delete lt;

//linking to itself
match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"}),
(p1)-[lt:LINKS_TO]-(p1)
delete lt;

//linking to the blog provider (Blogger)
match (p:Page)
where p.url contains "//www.blogger.com"
detach delete p;

Which turned out to be pretty effective. When I run these queries I weed out a lot of “not so very useful” links between nodes in the graph.
And the cleaned-up store looks a lot better and workable.

If you take a look at the import script on github, you will see that there’s a similar script like the one above for every one of the blogs that we set out to import. Copy and paste that into the browser one by one, the neo4j shell, or use LazyWebCypher, and have fun:
So that’s it for the import part. Now there’s only one thing left to do, in Part 3/3 of this blogpost series, and that is to start playing around with some cool queries. Look that post in the next few days.

Hope this was interesting for you.

Cheers

Rik

Thursday, 7 January 2016

The GraphBlogGraph: 1st blogpost out of 3

Making the GraphBlogGraph

For quite a few years now, I have been hosting my own blog at blog.bruggen.com. It’s been quite an interesting experience I must say. Long time ago, I started blogging as kind of a personal diary kind of thing, but… then Facebook and Twitter happened, and it seemed kind of redundant at the time. Then I got to work for Neo4j, and got stuck into the Neo4j community, and I “restarted” my blog to write about my life and work in the Neo4j community. It’s been a very, very fun ride.


So this past Christmas period I had to a bunch of work for my Orienteering club, and I do most of that work (registering club members, registering for races, managing billing etc) in Google Sheets. And I came across acouple of really interesting things that I did not knew existed - and that I thought would make a super cool Graph application. The two things were:
  • an easy and automated way to read a blog “feed” (in Atom or RSS) and put the items into a Google Sheet. This is called the “ImportFEED” function:
    here's the manual - it’s a really interesting piece of functionality.
  • an easy and automated way to pars XML (and therefore, HTML pages) and extract information from that XML using XPATH. This function is called ImportXML:

So my idea was basically very simple: why don’t I use this functionality to read the feeds from a couple of Neo4j-centric blogs that I know (using ImportFEED), and then use the URLs of the pages in the feed to scrape the HTML of the blogpost page with ImportXML, and extract the hyperlinks (<a href=”...”> tags in HTML). That way I could basically look at the graph of links between the different blogs, and see if I could discover anything interesting...


So I did. I will publish a couple of blogposts (!) in the next few days to explain the story.

Reading the GraphBlog-feeds

I got to work. I created a google sheet (which is publicly available for you to view and copy if you want), and listed some of the blogs that I would be interested in.

NameURLFeed
Rik Van Bruggenhttp://blog.bruggen.comhttp://blog.bruggen.com/feeds/posts/default
Michael Hungerhttp://jexp.de/bloghttp://jexp.de/blog/feed/
Stefan Armbrusterhttp://blog.armbruster-it.dehttp://blog.armbruster-it.de/feed/
Neo4j.comhttp://neo4j.com/bloghttp://neo4j.com/feed/
Max De Marzihttp://maxdemarzi.comhttp://maxdemarzi.com/feed/
Mark Needhamhttp://www.markhneedham.com/blog/http://feeds.feedburner.com/markneedham
Will Lyonhttp://www.lyonwj.com/http://www.lyonwj.com/atom.xml

I had some others on the list (Linkurio.us blog, GraphAware blog, GrapheneDB blog) but I could not immediately find the feeds of these blogs… maybe some day :)) …


So the next thing I did was I used ImportFEED to load the data of these feeds into a sheet of the workbook. The feeds actually look like this:
But with the ImportFEED function, it is really trivial to get that into a workable format. I used the following three formulae to load the created date (“items created”), the title (“items title”) and the URL (“items url”) of the last 20 posts in the feed into three colums:


=importfeed("http://blog.bruggen.com/feeds/posts/default","items created",TRUE, 20)
=importfeed("http://blog.bruggen.com/feeds/posts/default","items title",TRUE, 20)
=importfeed("http://blog.bruggen.com/feeds/posts/default","items url",TRUE, 20)


The result was actually super cool: a sheet for every blog that had date, title and url information for this particular blog.


Crawling the GraphBlog-pages

Then, next, I wanted to do some webpage crawling/scraping/whatever you want to call it with ImportXML. So that’s why I have the following formula:


=IMPORTXML(D2, "//a/@href")


Which is giving me and array like so:


Now, what I obviously want to do later on is import these things into a graph database, so I really wanted to get all of these links together into a “big” cell. So I decided to use a JOIN function to do that:
Whit the following JOIN I can actually put all these links into a cell of the spreadsheet, separated by a delimiter (“****” in this case):


=join("****",sort(unique(IMPORTXML(D2, "//a/@href"))))


By doing it this way, each of these cells we get a long piece of text string:


//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3****//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1****//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=Label&widgetId=Label1&action=editWidget&sectionId=sidebar-right-1****//www.blogger.com/rearrange?blogID=4466865603389367352&widgetType=PageList&widgetId=PageList1&action=editWidget&sectionId=crosscol****http://blog.bruggen.com/****http://blog.bruggen.com/2013_01_01_archive.html****http://blog.bruggen.com/2013_03_01_archive.html****http://blog.bruggen.com/2013_04_01_archive.html****http://blog.bruggen.com/2013_05_01_archive.html****http://blog.bruggen.com/2013_06_01_archive.html****http://blog.bruggen.com/2013_07_01_archive.html****http://blog.bruggen.com/2013_08_01_archive.html****http://blog.bruggen.com/2013_09_01_archive.html****http://blog.bruggen.com/2013_10_01_archive.html****http://blog.bruggen.com/2013_11_01_archive.html****http://blog.bruggen.com/2013_12_01_archive.html****http://blog.bruggen.com/2014_01_01_archive.html****http://blog.bruggen.com/2014_02_01_archive.html****http://blog.bruggen.com/2014_03_01_archive.html****http://blog.bruggen.com/2014_04_01_archive.html****http://blog.bruggen.com/2014_05_01_archive.html****http://blog.bruggen.com/2014_06_01_archive.html****http://blog.bruggen.com/2014_07_01_archive.html****http://blog.bruggen.com/2014_08_01_archive.html****http://blog.bruggen.com/2014_09_01_archive.html****http://blog.bruggen.com/2014_10_01_archive.html****http://blog.bruggen.com/2014_11_01_archive.html****http://blog.bruggen.com/2014_12_01_archive.html****http://blog.bruggen.com/2015_01_01_archive.html****http://blog.bruggen.com/2015_02_01_archive.html****http://blog.bruggen.com/2015_03_01_archive.html****http://blog.bruggen.com/2015_04_01_archive.html****http://blog.bruggen.com/2015_05_01_archive.html****http://blog.bruggen.com/2015_06_01_archive.html****http://blog.bruggen.com/2015_07_01_archive.html****http://blog.bruggen.com/2015_08_01_archive.html****http://blog.bruggen.com/2015_09_01_archive.html****http://blog.bruggen.com/2015_10_01_archive.html****http://blog.bruggen.com/2015_11_01_archive.html**** etc etc etc


Which is fine, because I know how to split this cell into individual “blog links” again. What I have now is a spreadsheet containing the blog feed, and all of the links that go from the individual blog pages to other pages. Nice!


In the next section I will be importing that spreadsheet into Neo4j, and then we can start playing around with it.


I hope you enjoyed this blogpost so far. I will publish part 2 in a few days, for sure.


Cheers

Rik