Monday, 6 April 2020

Graphistania 2.0 - Episode 6 - The One with the CovidGraph

So, when I started working with Graphs in 2012, one of the first community use cases that I encountered was all about biotech. I met a few people from the University of Ghent, who were working on some amazing protein interaction networks - and it was fascinating. Over the years, we have done quite a few activities on this, and we have kind of built a nice life sciences and healthcare community around Neo4j. Some amazing work is being done there.

One of the most amazing cases out there, has been the use case of the German Center for Diabetes Research, who have been scouring the scientific universe for ways of finding cures against diabetes. Look at this brief video or read this article to know more about it:

Why am I telling you this? Well, with the global Covid-19 pandemic sweeping around the globe, and many of us being affected in small or big ways, our Neo4j Graph Community has been doing the most interesting things to try and apply the "power of the graph" to this complex and intricate problem. Take a look at covidgraph.org for their work. When I learned about it, I immediately thought about talking to some of the "chief instigators" and inviting them for a podcast interview - which we made happen at record speed :) ...

So here it is: a chat about Covid-19, and about how graphs will help us make sense of the data. Let's hope it proves to be useful.




As always, you can find the transcription of our conversation below:
RVB 00:00:00.853 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and here we are recording a very, very special episode of our Graphistania podcast. We haven't done this in quite some time. For the past couple of months, I've been doing this podcast together with my dear friend Stefan Wendin, who is also on this recording. But today, we also have some guests invited, and those guests are going to talk to us a little bit about a topic that I think most of us are very much involved with today. We're recording this episode on April 3rd, and that means that most of Europe is in a lockdown situation for COVID-19. Not the most joyous experience for most of us, especially in view of the human tragedy behind it, but we have actually some really interesting stuff happening in our graph community on this topic, and that is the fact that some of our community members have launched some initiatives to provide more insights into this epidemic, into this pandemic, and we're going to talk about some of that. So with that, I'd like to invite our dear guests-- I'll introduce our dear guests here. That's Alexander Jarasch and Martin Preusse. And maybe you guys can introduce yourself. Maybe I'll start with you, Alexander; can you introduce yourself a little bit? 
AJ 00:01:29.129 Yeah. So my name is Alexander. I'm the head of data and knowledge management at the German Center for Diabetes Research, and I'm working with Neo4j graphs. 
RVB 00:01:40.589 And as you've introduced yourself to me personally, you're a little bit of a showhorse for Neo4j, in general. I've seen you on stage a couple of times, and you've got some great use cases with Neo4j, but you're not alone today. You've got Martin with you, right? Martin Preusse. Martin, can you introduce yourself?


MP 00:02:00.288 Sure. Good to be here. I'm Martin. I'm a computational biologist. I started using Neo4j in my PhD and that was, I think, 2013 or '14; I don't really recall, actually. I've been working with Neo4j ever since. And right now, I have an independent consultancy implementing Neo4j in data integration projects in medical research and pharmaceutical companies. And coincidentally, Alex is one of my clients. 
RVB 00:02:26.667 Okay. Fantastic. Very good. And so the reason why we're kind of doing this special edition of the Graphistania podcast is this initiative that you guys started called CovidGraph, right? Covidgraph.org, if I'm not mistaken; that's where you guys are sharing some of your work. But it might be good to understand a little bit where that came from. What's the background here? Where did you guys start, and why did you start doing this? Can one of you maybe explain that? I don't know who wants to take it first. 
MP 00:03:05.568 Maybe I start because two or three weeks ago, I started collecting some datasets in Neo4j randomly, actually, because collecting data in Neo4j, this is what I do. And I started with the case data from John Hopkins University. I suppose everyone knows that the global case data by country and by region in some countries, etc., and just put it in and combined it with population data to see things like percentage of infected, etc. And this is how it started, and then more and more of these public datasets popped up. And the most interesting one was a publication dataset, so a curated set of around 50,000 cited publications from different sources that are relevant for COVID-19. And it's super interesting, and then I contacted Alex because in the project we're working on together, we're also working with text and getting text into a graph and getting something out of the text, etc. And then, yeah, it exploded. 
RVB 00:04:08.265 Fantastic. And this is what we now know as CovidGraph, right? So this is the-- it's a public project that people can take a look at, if I'm understanding this correctly? And what do you aim to do with this? What's the goal here? Why do you think this is relevant? 
MP 00:04:27.155 I mean, there's so much data now out there, but the people interested in these datasets, they cannot actually use it. I mean, clinical researchers, etc., they don't have time and not the IT and coding and data skills to get the stuff they need out of these datasets. So I think it's important to integrate everything and to make it accessible, to actually give the right piece of information that's encoded there somehow to the people who need it right now. And I think this is what we want to do. 
RVB 00:04:57.392 What type of information is that? Or what type of information are you putting into this graph? 
MP 00:05:03.385 The community grew quickly, and we have a couple of different people loading different datasets. So the next step after the publications were patents. I mean, there are a lot of patents around that mention the coronavirus and mention potential drugs for the coronavirus and doing things like connecting relevant information from publication, so certain genes that are important for the virus and the process and drugs from patents that target these genes. These are the kind of links that we need to understand the mechanisms behind the disease and to get new ideas and new hypotheses how to treat it. 
RVB 00:05:42.509 Interesting. And what's been the reaction? Have you had some interesting reactions from the industry, the people that have been reaching out to you?
MP 00:05:51.959 Yeah, yeah, sure. Of course, we try to elaborate the network we have and contact a lot of people, but also random, unknown people from pharmaceutical companies popped up, contacted us, and they're definitely interested. And also, they started contributing. 
RVB 00:06:09.965 Maybe this is my ignorance talking, but how is this similar to diabetes research [laughter]? Is it really that similar? 
AJ 00:06:20.392 On a technical level, yes, it's pretty similar, and this what Martin already mentioned. We are working together on another use case for diabetes, and we're doing more or less the same, studying texts of diabetes and learn something from it. And crucial point is here not only to integrate data but also to connect it so that the researchers get the information much easier and on a visual basis. And so, technically, it's the same, and there are even some evidence for it that diabetes has something to do or-- let's say the infection rate or infection chance is higher if you are a diabetic patient. 
RVB 00:07:08.204 I think I read somewhere that people with existing conditions like diabetes are much more vulnerable for COVID, right? Is that true? 
AJ 00:07:19.054 That's correct, and the problem is that we don't know why that is, and this is one of the use cases. We want to find evidence or any clues that we can find in the text data that we can explain why people, in general, are getting infected, why diabetic patients have more problems. Yeah, so this is one of the goals. 
RVB 00:07:45.669 Interesting. So what's the plan for CovidGraph? Do you have a plan, or are you just going along as you go, or are there any specific milestones that you're going after? 
MP 00:07:59.822 Of course, we have a plan. Ha-ha. No. We have a plan. We have a couple of ideas. And I think one thing that we want to do and I think that sets us apart from a lot of different-- from some of the other data initiatives right now is we want to work with users. So we don't just want to collect stuff in Neo4j because it's cool technology and to market a certain technology or something. We want to work with actual users who need these datasets right now. We started doing this already, and the prototype application that is able to answer specific questions is on its way, and we're going to publish it soon. 
RVB 00:08:38.618 Wow. That's fantastic. And what does that application do? Is it understandable for a non-scientific person like myself? 
MP 00:08:47.586 I think so, yes. So it's a network search thing. So in our dataset, we have text datasets, right, publications and patents. And you can do is you can search for genes that are mentioned in these texts and publications and patents because in the end, a gene is sort of the central concept of biology. In the end, a gene defines mechanisms, etc., and what happens when the virus enters a cell. So the gene is the point of reference and connecting, for example, a publication through a gene to a pattern would mean that the draft that is mentioned in the pattern is somehow connected to what's explained in this publication. 
RVB 00:09:26.296 Oh, yeah. And that might then-- for example, that might lead you to finding drugs that are relevant to COVID which we may already have but may not use yet? 
MP 00:09:38.499 Exactly. Because repurposing of existing drugs is a big thing because if it's FDA-approved already, then you know that it's a safe and you know the side effects, etc. So if there's something around that is approved already that you might test or use against COVID-19, that would be great. 
RVB 00:09:57.937 There were a lot of reports, at least, in the Belgium press of a malaria drug that was suspected to be of use for COVID, right? 
AJ 00:10:07.790 Exactly. Yeah. So, as sad as it is, but coronavirus-19 is just a coronavirus like any other, and there are drugs already out there or some other patents or some other publications giving information, and this is where we try to learn something from it. I always compare a little bit to the use case from the NASA where they wanted to learn something from the failed missions to the moon. So there's a lot of text data, and you have to basically read them and nobody is able to read all this stuff anymore. We have more than 50,000 publications, so nobody is going read that. And so you have to have a technique or a certain-- yeah, a certain visualisation where you can automatically analyse and then provide the user with some more condensed information. 
RVB 00:11:01.925 Are there any specific techniques that you use for this? I'm thinking NLP or maybe even some graph algorithms. Are there any things like that that you're specifically thinking of using in this? 
AJ 00:11:17.928 In the last two weeks, during the last weeks, we sort of sorted the data and connected the data. And now, we have the databases and now, we are starting to use the Lucene index for fast lookup of genes in the text. We tried some NLP techniques, and yes, the graph data science package would be now the next thing to test and to find some cool insights that we haven't seen so far. 
RVB 00:11:52.458 We can only look forward to the results of that, of course. Some really interesting things there. So you mentioned this prototype application. Are there other things afterwards, like end-user facing, you said? But any major things that you're looking for or things that you might need help for, potentially? I don't know. 
AJ 00:12:15.893 Yeah. So I think we definitely need help embedding applications. So we have a couple of people that are going to collect more use cases with potential end-users to define that, but embedding applications is key. And if there's someone, I don't know, who knows GRANDstack maybe and would be able to implement a GRANDstack interface on top of the graph, I think that would make building applications much easier, and that would be great. And any help in this direction would be fantastic. 
RVB 00:12:50.170 I saw that there were a bunch of other partners involved with CovidGraph, right, people like yWorks and PRODYNA and who was-- Structr, I think, was also involved, wasn't it? 
AJ 00:13:03.461 [crosstalk]. Yeah. All three of them from the very beginning. And without them, we wouldn't be here. We wouldn't have. So yWorks is building the application, the prototype application, and they managed to do that next to their day job in a couple of days; structure, build the website, and PRODYNA worked a lot on data loading and data integration. So they did the work with us. 
RVB 00:13:30.287 Impressive, impressive. Great. Well, I mean, I don't have any other questions, but the one big, silent person in this room has been Stefan [laughter]. He's always so vocal otherwise and now, all of a sudden, he's just stupefied. 
SW 00:13:47.266 Yeah. I don't know what happened. I just practised on being quiet, for once. No, but I think for me also, just shining a little bit on this is thank you for doing this, not only for me as a person but for mankind and also, again, showing again how it is to do things, right? And I think this is why I also reached out to you guys before I went and we talked about this. And I think it's so beautiful that you just started doing this, and then people come to you because you do things. And this is how a network scales. And this is also the call-out, so you ask for people familiar in GRANDstack helping you out. So, yeah, let's leverage our networks, and let's do things because doing things is not about talking about doing things. It's about actually doing. So, yeah, I think it's super interesting to see how this develop, and I'm super interested to see the data science library going to use on this. Immediately when you asked about it, I started to come up with ideas of which one you want to run and so on. So, now, I'm basically going to derail my whole afternoon thinking of this [laughter], but that's going to be fun, right? This is what life is about, coming together and saving the world. So thank you. 
RVB 00:15:04.628 Thank you also from my side, and not just for doing this podcast, but for the work, in general. I think we're going to wrap up here. We're going to put some links to all of your work in the transcription of the podcast, obviously, as we do always. And then, I'm going to wish you and colleagues and your friends a lot of health but also good luck with this project. Thank you for doing that, and I hope to talk to you soon again. 
AJ 00:15:32.645 Thank you very much. 
MP 00:15:32.819 Thank you so much. 
RVB 00:15:33.481 Thanks, guys. 
SW 00:15:34.548 Thank you.
Subscribing to the podcast is easy: just add the rss feedfind us on Spotify or add us in iTunes! Hope you'll enjoy it!

All the best, stay safe, keep your distance - but STAY CONNECTED!

Rik

No comments:

Post a comment