Wednesday, 18 November 2015

Podcast Interview with Felienne Hermans, TU Delft

Almost two years ago, my colleagues at Neo alerted me to the fact that there was this really cool lady in the Netherlands that had an even cooler story about Neo4j and Spreadsheets. Curious as I am, I invited Felienne to come and talk about this at our Amsterdam meetup, and we have been in touch with her and her department ever since. So naturally, we also invited her to come on our Podcast, and as usual :), we had a great conversation. Hope you enjoy:

Here's the transcript of our conversation:
RVB: 00:01 Hello everyone. This is Rik, Rik Van Bruggen from Neo Technology. Today, we're recording another podcast episode that I've been looking forward to for a long time, actually, because I'm a true fan, and not a groupie yet, but a fan of Felienne Hermans from Delft University. Hi, Felienne.
FH: 00:19 Hi, Rik. 
RVB: 00:20 Hey. Great to have you on the podcast. It's fantastic. We've seen you talk at a number of conferences and at the meet-up before. But most people won't have seen that yet, so maybe you can introduce yourself?
FH: 00:33 Sure. I'm Felienne Hermans. I'm assistant professor at Delft University of Technology where I run the Spreadsheet Lab. That's a research group of the university that researches spreadsheets, obviously. 
RVB: 00:45 Obviously, exactly. That also sort of hints at the relationship with graphs, I think, right? 
FH: 00:53 Yeah. People that have seen my talk or maybe googled me know that the title of my talk that I usually give is "Spreadsheets are graphs".

So, we use Neo4j to do graph analysis on spreadsheets. We do all sorts of analysis, actually. The whole idea of our research is that spreadsheets are actually "code". And then if it's code, you need an IDE. So you need to analyze all sorts of constructions within your code so you can see, maybe, do you have code smells in your spreadsheets? Maybe does your spreadsheet need refactoring? All the things you typically do on source code for analysis, you should also do on your spreadsheet, and this is where we use Neo4j. 
RVB: 01:34 If I recall, you actually did some work on proving that spreadsheets are Turing complete, right? 
FH: 01:41 Yeah, I did. To make my point that spreadsheets are code because then, people think it's funny. If I say, "Hey, I do research on spreadsheets and I've wrote a PhD dissertation on spreadsheets," people laugh in my face often [chuckles]. "Really, can you do dissertation on that?" It's software engineering. I'd say, "Yeah." But actually, people don't believe me. To prove my point - indeed I implemented it - a Turing machine in a spreadsheet using only formulas, ensure that they are Turing complete and that makes them as powerful as any other programming language. That should stop people from laughing at me. 
RVB: 02:17 [chuckles] Exactly. It's funny, but it's also really, really interesting. I think fundamentally, it's a very interesting approach and that's why I think people also love your talks. It's a very interesting topic. So, why did you get into graphs then, Felienne? Tell us more about that. How did you get into that relationship between spreadsheets and graphs? 
FH: 02:40 As I said, we do smell detection, as well. And for that, we initially stored information in a database and we stored information, for instance, what cell relates to what other cell? Because if you want to calculate a smell like "feature envy", and source code feature envy would be a method that uses a lot of fields from a different class. So you can see in a similar way that in a spreadsheet, a formula that uses a lot of cells from a different worksheet has the feature and the smell. It should actually go in the other worksheet. So in order to save that type of information, you need to store what cell relates to what other cell. And initially, I never thought about what database do I use. 
FH: 03:26 In my mind, a few years ago, database was just a synonym for a SQL server. I was in Microsoft world where we make plugins for Excel, so database is just SQL. Same thing. I didn't think about it. I just dropped all my stuff in database, aka SQL. And initially, that worked fine. Some analyses you can really easily do but at one point, you want to really deeply understand how all the cells relate to each other because you want to measure the health of the spreadsheet. So we got horrible queries. SQL queries of like one A4 sheet of paper. Very, very complicated. But still I thought databases are just hard. I didn't really think about it until I saw a talk from one of your colleagues, Amanda, at Build Stuff in Vilnius. I saw her talk, and then it was really like a light bulb above my head. Bing. This is what I need. 
FH: 04:23 And a few weeks later, I was onsite at a customer for weeks so I wasn't bothered by students or colleagues so I could really program for a while.  I thought, "Okay. This is my chance. Let's try and get my data into Neo4j and see how it will improve, what type of analysis that it would make easier." So that's what I did. When I tried, lots of the analyses - how many hopes are there between these two cells or what is the biggest span within the graph - obviously are very easy to answer in Neo4j. So that's how we changed some of our analyses queries to run on the Neo4J database because it was so much easier to explore the data in a graph way because spreadsheets are graphs. There's a whole graph layer underneath the grid-like interface. That was really easy to analyze. 
RVB: 05:16 That is such a great story and such a great summary of why it's such a great fit. I guess most people don't think of it that way. But effectively, what you're doing is like dependency analysis, right? 
FH: 05:28 Yeah. That's what we're doing. How did cells depend on each other? 
RVB: 05:31 Super interesting. Is that something that you currently already use? I know you've been developing software on top of Neo4j, right? Is that already something that people can look at? 
FH: 05:45 No. Currently we only look at it within our own research group. The analyses we do is for us as researchers. It's not user-facing, so we have a smell detection tool that is somewhat user-facing where spreadsheet users can upload their spreadsheet and they get some analysis in the browser, but that is no less advanced than the analysis we use. They're still using the SQL back-end because users typically don't really want to explore their spreadsheet in a way that we want to explore spreadsheets if we're researching. A spreadsheet user is not going to ask himself the question, "Hey, how wide are my cells connected?" That's really more a research tool. 
RVB: 06:31 I understand. Totally. What are your plans around that, Felienne? Are you still expanding that work or is that something that is still under development then? Where is this going, you think? 
FH: 06:42 Obviously, if you say smells and you say refactoring. We've done lots of work on the smells. Even though we keep adding new smells, we feel that we have covered the smells area pretty nicely. Then, the next step of course would be refactoring the smells. If I know that this cell suffers from feature envy, it is jealous of that nice worksheet where all the cells are-- that he is using that formula. You want to move the formula to the other worksheet so that it's nicely close together to the cells that he's using. So these type of refactorings - moving cells in order to improve the graph structure - is something that we are looking at. 
FH: 07:21 One of my PhD students is currently looking at comparing the underlying graph, so where are the cells connected to each other? Compare that look on the spreadsheet to where are the cells in the worksheet? If you have a big cluster of cells, they're all referring to each other but they are physically located on two different worksheets, that's probably not ideal for the user because then you have to switch back and forth. And the other way around is true, as well. If you have a worksheet where there are two clusters of cells relating to each other, maybe it would be better to give each of these clusters their own worksheets. So these are the type of refactorings that we are looking into. If you have a big disparity between how your spreadsheet is lay-outed, and how your graph connections are, then this is very smelly and you should do something to improve the structure. There are still a lot of graphs also, in the refactoring future that we see. 
RVB: 08:23 That, again, sounds so interesting. I think we could have a lot of joy - spreadsheet joy - because of that. I would love to see that. Very cool. Any other topics that you think would be relevant for our podcast listeners, Felienne?  Or anything else that comes to mind? 
FH: 08:44 Yes. One other thing, one final thing. I like to pitch my research a little bit if-- 
RVB: 08:49 Of course. Yeah. 
FH: 08:51 One of the things that we're also looking at is looking at what in a spreadsheet are labels and what in a spreadsheet is data? And especially how do they relate to each other? If you want to, for instance, generate documentation from a spreadsheet, or help a user understand a spreadsheet, it's very important to know if you have-- you take a random formula from a spreadsheet, what is it calculating? Is this the turnover of January, or is this the sales of blue shoes? And sometimes it's easy because, again, your layout matches the formulas. Sometimes you can just walk up the column or down the row to get the label. Sometimes the layout is a little bit more complicated. One of the things that we are working on is trying to make an algorithm - semiautomatic, and maybe with some user assistance, or entirely automated - where you can pick a random cell and then it will give you what it is, what is semantically happening in that cell. Can I add links to your story, as well? 
RVB: 09:57 Yes, you can. Yeah. 
FH: 09:59 Okay. So let's share a link of-- we did an online game where we gave people a random spreadsheet and a random cell, and they had to click the labels. We used that in an online course that I taught and we got 150,000 data points out of that game. We're currently analyzing that data to see what patterns are there in labeling. What usually is described by users as the labels of cells, and we hope that from that, we can generate or synthesize an algorithm that can do that for us. 
RVB: 10:28 Super cool. Let's put together a couple of links to your talks, but also to your research on the blog post that goes with this podcast. And then I'm sure people will love reading about it and will also love to hear about your future work. 
FH: 10:45 Thanks. 
RVB: 10:45 It will be great to keep in touch. Super. Thank you so much for coming on the Podcast, Felienne. I really appreciate it, and I look forward to seeing one of your talks, blog posts, whatever, in the near future. 
FH: 10:59 No problem. 
RVB: 11:00 All right. Have a nice day. 
FH: 11:01 Bye. 
RVB: 11:01 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


No comments:

Post a Comment