Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:
- in the first post we focused on importing and refactoring the data in Neo4j
- in the second post we spent some time analysing the dataset with some algorithms and some specific pattern matching queries
Let's get right to it.
So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do
Fulltext querying of Emails
Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do