The source data that we found in part 1 is in a .csv
format - so that means that it basically looks tabular:
Luckily, we nowadays have some fantastic tools to import these files, without writing any code at all using the all new Neo4j Data Importer. After drawing a few nodes and relationships, I was able to do the basic import: It was super quick to return after a few seconds:
I am of course sharing the Data Importer config (model and data) as a zip file as well.
As usual, there is a bit of messyness in the data still, so I had to do some wranging to get a better/richer model.
First, we would want to split the two parents of a Scholar into different fields:
:auto MATCH (s:Scholar)
CALL {
WITH s
SET s.parent1 = trim(split(s.parents,"/")[0])
SET s.parent2 = trim(split(s.parents,"/")[1])
} IN TRANSACTIONS of 1000 ROWS;
<!-- remove the brackets, introduce comma -->
:auto MATCH (s:Scholar)
CALL {
WITH s
SET s.parent1 = replace(s.parent1," [",",")
SET s.parent1 = replace(s.parent1,"]","")
SET s.parent2 = replace(s.parent2," [",",")
SET s.parent2 = replace(s.parent2,"]","")
} IN TRANSACTIONS of 1000 ROWS;
<!-- extract the IDs -->
:auto MATCH (s:Scholar)
CALL {
WITH s
SET s.parent1_id = trim(split(s.parent1,",")[1])
SET s.parent1 = trim(split(s.parent1,",")[0])
SET s.parent2_id = trim(split(s.parent2,",")[1])
SET s.parent2 = trim(split(s.parent2,",")[0])
} IN TRANSACTIONS of 1000 ROWS;
This then allows us to create relationships between Scholar
s that have other Scholar
s as parents:
MATCH (s:Scholar)
WHERE s.parent1_id IS NOT NULL
WITH s
MATCH (parent:Scholar)
WHERE parent.scholar_indx = s.parent1_id
MERGE (s)-[:CHILD_OF]->(parent);
MATCH (s:Scholar)
WHERE s.parent2_id IS NOT NULL
WITH s
MATCH (parent:Scholar)
WHERE parent.scholar_indx = s.parent2_id
MERGE (s)-[:CHILD_OF]->(parent);
Next step is to create the marriage relationships between Scholar
s. To do that, we first have to split the s.spouse
property and store that as a s.listofspouses
:
:auto MATCH (s:Scholar)
CALL {
WITH s
SET s.listofspouses = split(replace(s.spouse," ",""),",")
} IN TRANSACTIONS OF 1000 ROWS;
Next, we UNWIND
the s.listofspouses
and get a list of scholar_indx
properties that we can match and use to create the [:MARRIED_TO]
relationships.
MATCH (s:Scholar)
UNWIND s.listofspouses as scholarspouse
WITH s, replace(split(scholarspouse,"[")[1],"]","") as scholarspouse_id
WHERE scholarspouse_id IS NOT NULL
MATCH (scholarspousenode:Scholar {scholar_indx: scholarspouse_id})
MERGE (s)-[:MARRIED_TO]->(scholarspousenode);
And then finally, we can create the teacher/student relationships between Scholar
s:
MATCH (s:Scholar)
WITH s, s.students_inds as students_of_scholar
UNWIND students_of_scholar as student
MATCH (st:Scholar {scholar_indx: student})
MERGE (st)-[:STUDENT_OF]->(s)
WITH s, s.teachers_inds as teachers_of_scholar
UNWIND teachers_of_scholar as teacher
MATCH (tea:Scholar {scholar_indx: teacher})
MERGE (tea)-[:TEACHER_OF]->(s);
After having done all of these manipulations, we actually can look at some really interesting subgraphs:
Note: there are some additional data in the dataset (and included in the (:Scholar) nodes) like areas of interest and tags. For the purpose of this exercise - the Narrator networks and the chains of narration for each Hadith - this is not as interesting and therefore we are not splitting that information off into separate nodes and relationships. It would be trivial to do so - but unnecessary at this point.
In the next blogpost, we will go and import the actual Hadiths that are being narrated into our graph.
Looking forward already!
Rik
PS: as always all the code/queries are available on github!
PPS: you can find all the parts in this blogpost on the following links
No comments:
Post a Comment