But for now we will just have a great conversation about his work. More interesting links below in the transcription - as usual.
Here's the transcript of our conversation:
RVB: 00:02.882 Hello, everyone. My name is Rik Van Bruggen from Neo Technology and today I am recording another episode for our Graphistania podcast. And I'm being joined by Daniel Himmelstein from University of Pennsylvania. You're a postdoc fellow there, right, Daniel?
DH: 00:19.828 That is true. I just got my PhD in San Francisco and then moved east to Philadelphia.
RVB: 00:26.980 Fantastic, why don't you introduce yourself a little bit, Daniel, and your work and your relationship to the wonderful world of graphs.
DH: 00:34.620 Okay, I guess I could introduce myself with my Twitter description which is, "Digital craftsman of the biodata revolution."
RVB: 00:44.286 Wow, that sounds great [chuckles].
DH: 00:48.558 Wow [laughter]. What I really do is-- I'm a scientist working on integrating a lot of medical data and making predictions about biology and disease. It's an exciting time, because there's so much data that's becoming available, and we need ways to organize and store that data and learn from it. And that's where Neo4j has filled the gap for us.
RVB: 01:15.906 How did you get into the world of Neo4j? How did you get to know us?
DH: 01:20.650 I work with what I call hetnets, and a hetnet I define as a network with multiple node or relationship types. And when I started doing this research about four or five years ago, I looked at Neo4j a little bit, but it didn't quite suit my needs then. I don't think Cypher was mature at that point, which is a query language. So I wrote a little package in Python to work with graphs with multiple types of relationships, because a lot of the built-in Python packages, or more mature packages, didn't really do a good job representing types on a network. So that's how I got interested in it. Then several years down the road, I reevaluated Neo4j and I said this will solve a lot of the problems we were having. It'll take a huge development burden off of our shoulders, and we're going to be part of this great ecosystem.
RVB: 02:29.816 You met some of our people at a meet-up in San Francisco - Nicole White and those types of people, right?
DH: 02:39.084 That's right. It was a fun meet-up. And it just really clicked with me, because Nicole was going over the basic concepts, like how each relationship has a type, each node has one or more labels, edges are directed. I was like, "Wow, this is what we need. This is a database for hetnets." Even though I don't think anyone-- I asked Nicole, "Do you know the term hetnet?" and she didn't. I think in Neo4j speak, you call it a property graph.
RVB: 03:09.286 Yes. Well, hetnet - I'm from Belgium, and my mother tongue is Dutch, and hetnet means "the net." [laughter] So "het" is the-- how do you say it here? Is the equivalent of "the." [chuckles] So that's a bit funny in my language, but-- [laughter]
DH: 03:30.122 I like it.
RVB: 03:31.121 Yeah, exactly - "the net." So, can you tell us a little bit more about it? Why is it such a good fit for hetnets? You describe it in your GraphGists, and you made a public instance of Neo4j available which I'll obviously link to from the podcast, but why is it such a good fit, Daniel?
DH: 03:51.768 Yeah, so, I guess, to answer that I'll tell you a little bit more about what we're doing. We're trying to encode as much of the knowledge produced by biomedical research in the past 50 years as possible. So we take data from millions of medical and scientific studies and we condense it into a network. And traditionally, people have done this, but they've done it with a single type of node and single type of relationship. So, for example, people would make networks with genes and they would connect the genes if they interacted inside of a cell. But obviously, biology's very complex. And given that complexity, it helps to model it with the actual diversity of types that are involved in health and disease.
RVB: 04:42.522 Can you give an example of the different types of interactions?
DH: 04:46.171 Yeah. So what we've created is something we call Hetionet. Version 1.0 has 11 different types of nodes and 24 types of relationships. So what these would be, would be like a compound or a drug, so that's something like Aspirin. Then we have diseases, so a disease would be Multiple Sclerosis, diabetes, et cetera. We have the symptoms of diseases. We have the side effects of diseases. And those are all node types. But then we'll have relationships. So, for example, the compound is known to cause different side effects. And that's information that's actually extracted from the drug labels - the little package you get on the inside of your medication when you pick it up from the pharmacy. And then, of course, we have genes. So in the past decade, there's been a lot of research on how different compounds affect genes in your body. Does giving someone a drug or compound make more or less of a given gene? So we have that type of relationship. We also have a relationship for which genes does a compound target in the body. So, how are the compounds designed to act?
RVB: 06:11.524 So, you model all this information in a graph-- in a property graph, in a hetnet? And what are the types of questions that you want to ask of that? Is it about drug interaction, or is it about new treatment paradigms, or what's the end goal there?
DH: 06:28.729 Yes. The question that we've been asking most recently is, "Can we systematically learn why drugs work?" So, traditionally drug development is often very serendipitous. So, people observe that a drug has a certain effect. Oftentimes, a lot of the main pharmaceutical therapies is not entirely known why they work, just that they were observed to have a positive effect on a disease. Traditional pharmacology, when actually looking at why compounds work, or why drugs work, is done on a single drug disease level. So, they look at a single therapy and try to understand why it works. But we're looking for patterns across all drugs that work. So, from a machine-learning perspective, what makes compound disease pairs that actually are efficacious? What makes them different from non-efficacious compound disease pairs?
RVB: 07:34.190 Wow, that sounds like there could be a lot of potential there. A lot of new drugs that could be re-purposed or new applications. Is that what you're looking for?
DH: 07:46.543 Totally. So, the end result of our algorithm is we make about 200,000 predictions, and each one of those predictions is for a compound disease pair and we give a probability we think that that compound disease pair represents a treatment. So, if you're interested, you can go to our website and you can browse by a compound or disease and see all of the predictions. Actually, what's cool is that when you have a specific prediction you're interested in, you can click on it and it takes you to a guide in our public Neo4j browser. So you can see what parts of the network contribute to that prediction. The specific network paths that we think provide evidence or support that a drug treats a disease.
RVB: 08:37.137 I've seen that. I thought that was so well done. Congratulations on that. Really, really cool, actually. So this sounds like a mountain of gold. Is this all on the public domain, or is this just academic research, or does it have business applications as well?
DH: 08:55.247 So, we're part of an open science movement where we release all the code for what we do under open source licenses, we release all the data as openly as possible. So everything, if possible, is put into the public domain, and we're really looking to get people to use the research we make. It's fine if they profit off of it, that would be great. We just want to produce something that people find useful. I guess, because I'm a publicly funded scientist, I get to do [chuckles] what I want and make it available for free.
RVB: 09:29.303 I think that is just so admirable and we really, really applaud that for you. We were talking about it earlier, right? So this podcast is going to be published on the Creative Commons license as well because that's how you want to publish your work. I really applaud that; that's fantastic. Really, we appreciate it.
DH: 09:47.535 Thanks, yeah. I guess [chuckles] it may just be a selfish thing that I like when my work is reused [laughter].
RVB: 09:54.100 No, I think it's a-- especially in the type of data that you're dealing with and this type of research that you're doing. I mean, this could save lives, right? I think it's important that people do stuff like that and congratulate you on that. Really, we do.
DH: 10:12.112 Thanks, yeah. Well, I've also experienced from both sides, because we had to take data from about 30 different resources to integrate it into Hetionet?. And a lot of them would have licenses, even though they were publicly funded academic research projects, that made it really hard to integrate the data. So that taught me the hard way the importance of having permissible open licenses.
RVB: 10:38.555 So let's talk about the future, Daniel. Where is this going, what are your plans with graphs and with Neo4j? Where do you want to take this?
DH: 10:48.302 Yeah, so right now, Hetionet has about 2.5 million edges or relationships. And I'd like to not only grow that number but start to get more meaningful edges. So I think we can grow the network quite a bit, and we can look at new applications. So we were predicting whether a compound treats a disease, but we could also predict, say, new side-effects of compounds, or we could start to get a more nuanced algorithm. So part of my work is developing algorithms on these hetnets, so that's also of interest. As far as Neo4j goes, I've been really excited about the guide technology. So you briefly mentioned that, but we have this public Neo4j instance which lets anyone just go to the URL - which is neo4j.het, which is H-E-T, dot I-O, and then immediately see a Neo4j browser with our network in it, and we have guides which are like a little kind of web page, or HTML tutorial that just shows up naturally in the browser and can inform you about the network. So, I think that really will help biologists and pharmacologists interact with their network to have these guides.
RVB: 12:23.428 Well, I'll put some links to this, with the podcast transcription. So hopefully you'll get some people visiting it. And I really thought it was very impressive what you did there and much more impressive than-- I did a beer guide [chuckles]--
DH: 12:40.783 I think I've seen that.
RVB: 12:44.274 Which is a lot less interesting, but that's the only thing I know anything about. So--
DH: 12:52.192 I did see on one of the previous podcasts. I think it was a network of movies, it was like date night. Two people would put in the movies they liked and they would find an intermediate movie. That was cool.
RVB: 13:06.729 Yeah. That actually got a Webby award recently and the guy is from-- I forgot the name. Ben Nussbaum was the guy that I interviewed about it, there on the west coast. Well, Daniel, thank you so much for coming online and doing this interview with me and I wish you best of luck with all your research, and hopefully it will lead to lots of new treatments and new interesting research. Thank you so much and hope to meet you one day at one of the GraphConnect conferences, perhaps.
DH: 13:43.515 Yeah, totally. I'm excited. I think the whole community is developing so quickly. We use doctors to deploy our Cloud instance and the Neo4j support there is good. Just a really fast moving project. So, exciting.
RVB: 14:02.672 Great. Thank you so much. I want to keep this digestible and short, so I'm going to wrap up here and I'll talk to you soon.
DH: 14:10.268 Okay. Toodle-oo.
RVB: 14:11.482 Toodle-oo [chuckles], exactly. Bye.Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!
All the best