Careem has had rapid growth in popularity, exceeding 50 million user accounts today. It has also recently expanded from just ride-hailing into other services such as payments, food, groceries, and more in an attempt to become the Middle East’s leading Super App. This has also unfortunately made Careem a prime target for fraudulent activity. Fraudsters are constantly looking for new loopholes to exploit, create identity-faked accounts, and different ways to hijack genuine accounts. In our data science and analytics backed Risk team, we needed more advanced ways to detect and stop losses from fraud that may be damaging to both our revenue and brand reputation.

We see a wide variety of  fraud and criminals are finding new loopholes to bypass the specific measures we put in place to combat the existing fraud patterns. Traditionally, tackling these different kinds of fraudulent activities was a never-ending game of cat-and-mouse. We would often create rules or machine learning models for each specific type of fraud, but this was sometimes problematic on two levels:

  1. It only allowed us to identify and block an account after the fraud had been committed and detected, which means the money had already been lost.
  2. Fraudsters were quickly able to move on and find a new loophole to exploit once an existing fraud pattern had been detected.

A Smarter Way

It was clear that we needed a smarter and faster way to detect fraudulent activities and stop them before the act was committed. Instead of continuously creating very specific tools to detect very specific fraud patterns, we wanted to build an intelligent system that was almost a blanket detection mechanism over all users.

After a lot of experimentation, we decided  to focus on the identity of users, and came up  with a powerful way to outsmart any efforts of identity fraud. We opted to use  graph structure as a way of mapping different aspects and data points of each user’s identity together, and more importantly, characteristics shared across the identities of different users. This would then allow us to detect potentially fraudulent patterns in real time across user and account activity.

An example of a cluster of fraudulent accounts in the CrazyWall graph

We chose AWS and the automated real-time analysis and monitoring capabilities of Amazon Neptune [1], in part because it is a managed service. At Careem we were already using AWS for most of our cloud computing and data warehousing operations, so we opted to stay in the same environment for this fraud prevention project. We also quickly realised that we had a preference for the Gremlin query language [2] that is supported by Amazon Neptune over the query languages such as Cypher that are used by other graph database providers. Gremlin allows developers to query the graph in a range of programming languages, with our preference being Python. This ultimately made it very easy to create our core codebase for the project in Python, interacting with the graph through a Python-based Gremlin interface, Gremlin-Python [3].

The Identity Graph

To initially create this graph, we took Careem’s user base, as well as many different historical data points explaining each of their identities. All of these users and data points were inserted into the graph as nodes, and any relationships between user’s and aspects of their identities were connected through edges, thus also showing any relationships among different accounts through their identities as well. This historical snapshot was the starting point of the graph.

Next, we wanted this graph to be “living” and growing as new accounts were created at Careem, and also when existing accounts performed actions that may change the perception of their identity. To achieve this, we leveraged our ‘Event Processor’ mechanism, a project developed by the engineers in our team which receives and processes data about the actions and transactions performed by users anywhere on Careem’s platform. These events include things such as setting up a new account, logging in to an existing account, etc. Our ‘Event Processor’ was connected to the graph through our Python interface so that all of these events could be streamed and inserted into or mutated in the graph in real time. On average, data is added to or updated in the graph more than 100,000 times per day.

Now that the graph was populated with historical data and updating in real time, CrazyWall was born. The name comes from the idea of detectives placing all of the clues they have about a particular case on a wall, and connecting them together with strings in an attempt to link meaningful clues together in order to find the suspects and solve a case. These are known as crazy walls. In the same way, our graph can now link fraudsters together and detect suspicious activity occurring across accounts.

A detective’s crazy wall

Detecting Patterns by Focusing on Identity

The end goal for CrazyWall was to be fully automated in firstly evolving the graph and secondly detecting fraudsters in real time. The former was now achieved and next was to tackle the latter. For the initial version of the project we started as simple as possible in terms of its fraud detection capabilities, with the idea that we would create a minimal tool that could then set the foundation to be gradually improved upon.

This first version consisted of a basic rule-based approach for detecting fraud in the graph, a module which we have named Enola. Each time a user performed an action that added to, or changed their structure in the graph, we would query that user in the graph, returning what we call their “cluster” – the set of all nodes and edges connected to the user’s account in some way. The structural data for this cluster was then passed through our predefined rules, which then used their logic to make a classification as to whether or not the user is likely to be fraudulent or not, based solely on how their identity looks in the graph’s structure and how it is connected to that of other accounts.

When an account is flagged as potentially fraudulent it is either automatically blocked if the data shows it is historically an untrustworthy account, or flagged for manual review if it is a trustworthy or high-value account, such as that of a corporate customer. This decision based on the trustworthiness of an account is aided through the use of another one of our projects, Trust Tiering, which uses a clustering algorithm to group different accounts together based on the similarities in their transaction history on the platform. These groups are then assigned different levels of trust.

Since the deployment of CrazyWall v1, a considerable number of accounts have been automatically blocked by the system, with an initial decent benchmark precision which we can now aim to improve upon.

Improving Intelligence

After this initial success and strong foundation of v1 of the project, we are now working on CrazyWall v2 with the motivation being to improve on Enola’s naive level of intelligence with the introduction of machine learning.

We are now adding about 10x more data to the graph in terms of node and edge types, as well as properties within each node and edge. In terms of modelling, we needed something that was powerful enough to process and train on this extensive amount of data, while learning through both the values of the properties within nodes and edges, as well as the all-important structural information from the graph. Relational Graph Convolutional Networks (RGCN) [4], a relatively new proposed architecture for applying deep learning to relational graphs, fit the bill perfectly. RGCNs can be trained on vast graphs for many different applications – node classification in our case. The model learns from aggregating the underlying structure of the nodes and edges in the neighbourhood of each node being classified, up to a certain depth, as well as using the wealth of valuable information of properties as additional input features.

RGCN architecture

A regular GCN takes the entire graph as input (or batches of subgraphs in practice), performs convolutions on each node (through self-connections), its neighbouring nodes and their properties, thus each node’s representation is an aggregation of its properties and its neighbours’. All node properties are multiplied by a learnable weight matrix during the convolution. The outputs at each of these convolutional layers are then passed through an activation function. Each layer that is added to the network represents an aggregation of nodes to a further depth, as at each of these added layers, all nodes will now represent aggregations of their neighbours, which are aggregations of their neighbours, and so on. An RGCN differs in that the convolutions happen for neighbourhoods connected by each edge type, with separate weight matrices specifically for each edge type. The results of each edge type convolution are then aggregated, by summing, to obtain the node representation.

The representation of each node i at the (l+1)th layer is computed as:

h_{i}^{l+1}=\sigma\left ( W_{0}^{l} h_{i}^{l} + \sum_{r\in R} \sum_{j\in N_{i}^{r}} \frac {1} {\left | N_{i}^{r} \right |} W_{r}^{l} h_{j}^{l} \right )

where:

  • W_{0}^{l} is the weight matrix for self-connections
  • h_{i}^{l} is node i‘s representaion at the previous layer
  • N_{i}^{r} is the set of neighbouring nodes of i through edge type r \in R
  • \frac {1} {\left | N_{i}^{r} \right |} is a normalization constant
  • W_{r}^{l} is the weight matrix for edge type r \in R
  • \sigma is a non-linear activation function

Neptune ML [5] allows to export and transform the data from the graph into the format needed to train the model, to run a full hyperparameter optimisation (HPO) and training session on AWS SageMaker [6] to find the most optimal RGCN architecture for our specific graph, and to deploy the model, as well as exposing it through an endpoint for inference. This endpoint can then be used within a Gremlin query to make a classification for any desired node in the graph on demand with low latency. This entire training, HPO, and deployment pipeline is executed through effortless curl commands, with progress logs and metric reporting available through SageMaker’s UI. Under the hood, Neptune ML uses the open source Deep Graph Library (DGL) [7] to construct and train the models.

We are currently in the process of running this pipeline to train an optimal RGCN model for our graph which we hope to be able to test and deploy in the coming weeks. The objective of this new model will allow for improved recall compared to v1 of the project, where the system is able to correctly detect many more fraudulent accounts out of all the users that are analyzed by the project – while improving our current fraud prediction precision.

We are also collaborating directly with AWS’s Neptune team to develop an improved RGCN architecture that will have even more powerful learning capabilities than the current architecture.

References

[1] Amazon Web Services, Inc., 2021, Amazon Neptune

[2] JanusGraph, 2021, Gremlin Query Language

[3] JanusGraph, 2021, Gremlin-Python

[4] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, Max Welling, 2017, Modeling Relational Data with Graph Convolutional Networks

[5] Amazon Web Services, Inc., 2021, Amazon Neptune ML

[6] Amazon Web Services, Inc., 2021, Amazon SageMaker

[7] Minjie Wang and Da Zheng and Zihao Ye and Quan Gan and Mufei Li and Xiang Song and Jinjing Zhou and Chao Ma and Lingfan Yu and Yu Gai and Tianjun Xiao and Tong He and George Karypis and Jinyang Li and Zheng Zhang, 2019, Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

 

Written by Kevin O’Brien, Senior Data Scientist in Careem’s Integrity team, in collaboration with AWS teams.