Tuesday, 16 June 2020

What VAT Fraud Detection and Contact Tracing have in common

In the previous blogpost we already illustrated in some detail that the contact tracing graph that we built, has a lot of similarities with a product recommendation system graph. We focused on a the Person-Visit-Place triangle that we had built in our Contact Tracing Graph data model, and converted the red and yellow bits into a Person-Purchase-Product triangles.
There is of course another part to the contact tracing graph that is also very interesting: the Person-Meets-Person subgraph. We derived that graph from the original contact tracing graph, by assuming that if a Person had visited a Place at the same time as another person, they would have been likely to have had a meeting there. This Person-Meets-Person subgraph was the basis for most of our graph analytics.


Let's now explore how we can use that subgraph to illustrate the similarities of the contact tracing use cases with another excellent graph use case: fraud detection. We will use the hypothetical case of a VAT Fraud carousel, aka "Missing trader fraud".  

Creating the Person-to-Person meeting graph with faker

To do so, we will create the Person-to-Person meeting graph with the neo4j faker plugin as well. Here's the script for this, in two steps:

//Create 5000 Persons using faker
foreach (i in range(1,5000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2020-05-15')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);

//Create 15000 MEETS relationships using faker
match (p:Person)
with collect(p) as persons
call fkr.createRelations(persons, "MEETS" , persons, "1-n") yield relationships as meetsRelations1
call fkr.createRelations(persons, "MEETS" , persons, "1-n") yield relationships as meetsRelations2
call fkr.createRelations(persons, "MEETS" , persons, "1-n") yield relationships as meetsRelations3
with meetsRelations1+meetsRelations2+meetsRelations3 as meetsRelations
unwind meetsRelations as meetsRelation
set meetsRelation.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
set meetsRelation.endtime = meetsRelation.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
set meetsRelation.meettime = duration.between(meetsRelation.starttime,meetsRelation.endtime)
set meetsRelation.meettimeinseconds=meetsRelation.meettime.seconds;

Note that the seconds step (where we create the relationships) does not use a faker function, but calls a faker procedure instead. This all runs in a few seconds:

And has a very simple recursive data model, as we had planned.

Now, what we would like to do is to convert the Person-Meets-Person model into a Company-Invoices-Company model, with a small extension that reifies the Invoice as a separate entity. We could have imagined having a "Meeting" entity in the Contact Tracing graph as well.

Transforming the Person-Meets-Person graph into a Company-INVOICES-Company graph

The transformation is pretty easy:
//set the indexes
create index on :Person(name);
create index on :Company(vatnumber);
create index on :Company(name);
create index on :Invoice(amount);
create index on :Invoice(invoicedatettime);

//Make Companies out of Persons
match (p:Person)
set p:Company
remove p:Person
set p.name = toUpper(p.name)+', '+fkr.stringElement("Ltd.,Inc.")
set p.vatnumber = fkr.stringElement("BE,NL,SE,GB,FR,DE")+' '+fkr.code(' #### #### ####')
remove p.confirmedtime
remove p.healthstatus;

//Make Invoices out of meetings
match (c1:Company)-[m:MEETS]->(c2:Company)
create (c1)-[invrel:INVOICES {amount: toInteger(round(rand()*10000)+1), invoicedatettime: datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")}]->(c2)
create (c1)-[:SENDS_INVOICE]->(invnode:Invoice)<-[:RECEIVES_INVOICE]-(c2)
set invnode = invrel
delete m;

//remove Person index
drop index on :Person(name);

The script runs very fast:

And this is the result:


Now we are ready to use the newly created graph, and illustrate how we can use this to detect VAT fraud rings.

Start ringfinding in the new graph

Let's start with the simples possible ring: 

//simplest ring - 2 hops
match path = (c:Company)-[:INVOICES]->(c1)-[:INVOICES]->(c)
return path
limit 1

This gives this result:

And now we can gradually make this a bit harder:

//5 hop ring
match path = (c:Company)-[:INVOICES]->(c1)-[:INVOICES*4..4]->(c)
return path
limit 1

This gives this result:


Then if we want to make it even harder, you will find that after some time you will want to start thinking a bit more about the performance of the query, as it can get pretty complicated after a short series of hops across the synthetic dataset.

1st optimisation: use apoc.periodic.iterate

If we want to increase the ring size, we will start finding that our synthetic dataset will start finding a **LOT** of rings. This could obviously point to a data problem in our dataset: it may not be as representative of the real world as we originally thought. But if we assume for a moment that that is not the issue, then we can easily understand that if there are a LOT of rings, the performance of the query may drop significantly, as all the possibilities have to be explored before returning results. 

This is where we can actually wrap our ring pathfinding query with an apoc procedure called apoc.periodic.iterate.

//find and persist rings up to 8 hops
CALL apoc.periodic.iterate("MATCH (c:Company) return c",
"match path = (c)-[:INVOICES]->(c2:Company)-[:INVOICES*..7]->(c) set c.ring = id(c) set c.ringsize = length(path)", 
{batchSize:10, parallel:true});

This will very quickly finish, and then we can show the result like this:

//visualise the longest fraud ring -- with cypher only
match (c:Company)
where c.ringsize is not null
with c, c.ringsize as ringsize
order by ringsize desc
limit 1
match path = (c)-[:INVOICES]->(c2:Company)-[:INVOICES*..7]->(c)
return path
limit 1;

Gives this VAT carousel as a result:

This is already interesting - it's unlikely that we would spot a carousel like this, of this length, with the naked eye. But let's explore this even further.

2nd optimization: using the apoc.path.expander

For detecting deeper / larger rings, we would actually want to be a bit more prescriptive about how the Neo4j graph database would explore the graph looking for the ring pattern. Specifically, it would be much nicer if we could avoid the "breadth first search" technique that Cypher normally uses. We can do this using an apoc procedure, called the "Path expander". More details on this very powerful set of tools over here.

Here's how we use this to find larger / deeper rings: 

//using the apoc.path.expander to find rings up to 20 hops
match (c:Company)-[:INVOICES]->(nb:Company)
WITH c,nb
call apoc.path.expandConfig(nb, {
relationshipFilter: "INVOICES>",
    minLevel: 2,
    maxLevel:20,
    terminatorNodes: [c],
    bfs: false, 
    uniqueness: "NODE_GLOBAL",
    limit: 1
})
yield path
set c.ring = id(c)
set c.ringsize = length(path)
return count(*);

The query above uses a very specific configuration map, which details that 
  • the query should not use breadth first search
  • the uniqueness should be global for the entire traversal pattern. This is described over here.
Running this takes little while on our dataset, but once it finishes, we can run a similar query as the one above:

//visualise the longest fraud ring -- with cypher only
match (c:Company)
where c.ringsize is not null
with c, c.ringsize as ringsize
order by ringsize desc
limit 1
match path = (c)-[:INVOICES]->(c2:Company)-[:INVOICES*20..20]->(c)
return path
limit 1;

This gets us:
I hope to have illustrated with this example that the Contact Tracing graph is a very solid graph use case, and is very similar to other cases in terms of the patterns that we are looking for. As such, we should use this as much as possible for our Covid-19 and other contact tracing - as it will certainly equip us with better and more complete predictive tools.

Hope this was useful. All of the scripts mentioned in this post are of course available on Github. Please take it for a spin an let me know what you think.

All the best

Rik

No comments:

Post a Comment