Graph Analytics for a more risk-based approach

September 26, 2022 General

Shutterstock 386857171

On September, 8th 2022 De Nederlandsche Bank (DNB) published a study titled ‘‘Van herstel naar balans’’ [1], insisting that banks and regulators should make their anti-money-laundering (AML) approach more ‘risk-based’. An indispensable part of this transition will be efficient use of data-driven innovations, such as Machine Learning (ML) methods. Ideally, these methods increase the reliability of classification, whilst adhering to discrimination, privacy, data-protection and explainability requirements. Following DNB, RiskQuest believes that Graph Analytics naturally extends existing ML applications. The aim of this blog is twofold: to take a closer look at modern graph implementations for AML; and to pay attention to graph processing systems.

Graph Analytics combines the mathematics from graph theory with specialised computer systems known as graph processing systems. Why require specialised systems? Well, imagine the amount of existing links in a group of ten individuals. For ten people, there are forty-five possible links. Every link could either exist or not, so that accounts to a total of 245(i.e. over 35 trillion) possible graphs. Hence, specialised systems have been developed to manage the vast connectedness that is inherent to relational data.

Unlike traditional ML methods, Graph Analytics is able to discover new patterns by analysing relations rather than individuals. In the context of money-laundering, Graph Analytics has a natural capacity to address the various transactional relations between entities (e.g. people, brokers & exchanges). Furthermore, a very powerful property of the graph paradigm is that the practitioner is free to enforce his own beliefs onto the structure of the graph. A graph could be constructed where a single account is represented as a node, and a single transaction between two nodes is represented as an edge. However, one could also cluster accounts into entities based on chosen properties (e.g. being in the same company) and describe an edge as aggregated transaction traffic over time. Once the graph has been defined, numerous graph implementations will be at the practitioners disposal. A few notable examples are:

  • centrality algorithms like PageRank [4] perform outlier detection by scoring nodes based on their transaction behaviours.

  • Community detection algorithms like Weakly Connected Components can be used to detect money-laundering where users share properties such as addresses or social security numbers. Do they need shared properties, or can we also establish communities through transactions?

  • Graph embeddings are the result of transforming the graph into lower dimensional vectors. Ideally, these vectors capture relevant graph information, such as graph topology and vertex-to-vertex relationships. The embeddings could be used for making visualisations or predictions on a node level.

In the aforementioned study, DNB highlights that recently banks were able to expose an underground banking system with Graph Analytics. In such a banking system, money is transferred through informal rather than formal mechanisms, making it ideal for money-laundering or terrorists activity. The banking system, consisting of 200 suspects, was unmasked by identifying subcomponents through client relations and related activities. With the acquired information, risk-indicators concerning underground banking systems were extracted. These indicators have since then proven to be very effective, by improving the prediction performance of new models by considerable amounts.

Graph Processing Systems

Before we cover graph processing systems, it is important to define the term itself. Specifically, we define it as systems capable of computing structural properties of graphs as a whole. An example of a structural property would be the PageRank [4] score of nodes in a graph. As these structural properties are related to entire (sub-)graphs, computing them efficiently becomes non-trivial, even more so when considering large graphs (e.g. with billions of nodes and edges). As such a benchmark, called Graphalytics [5], which covers the most used graph processing algorithms has been devised to compare the performance of different graph processing systems.

Many systems have been built to provide graph analytics capabilities [6,7,8]. However, most of the systems presented in literature might not be immediately applicable in an industrial setting. Thus, we investigated the systems that apply to that domain. For large-scale graph processing, Apache Giraph has been used at Facebook to scale to trillions of edges. Similarly, Apache Spark has had two graph processing engines developed for it, initially GraphX and then GraphFrames. For streaming applications, Apache Flink with its Gelly graph processing engine can be used. Recently, Alibaba has also ventured into building a graph processing framework, called GraphScope which provides a complete end-to-end solution for graph workloads, from initial data exploration and graph analytics to graph machine learning.

The Elliptic Data Set

In order to get a feeling for the applications of Graph Analytics, we ran some experiments on the Elliptic Data Set [8]. The Elliptic Data Set, published by ‘Elliptic’, is a sub-graph of the bitcoin graph, consisting of 203,769 nodes, 234,355 edges, and 166 node features. A node is deemed “licit” / “illicit” if the corresponding transaction has been created by an entity that belongs to either a licit (exchanges, wallet providers, miners, financial service providers, etc.) or an illicit (scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc.) category respectively.

We used GraphScope to analyse The Elliptic Data Set. GraphScope’s Graph Interactive Engine was used to do initial exploratory data analysis and select a subgraph to analyse. We then used the Graph Analytics Engine to scale every node based on the PageRank algorithm (as can be seen in Figure 1). This visualisation brings out the natural ability of Graph Analytics to represent relational data in a pleasant way. For illustrative purposes, we have colored the nodes according to their label; licit (illicit) nodes were colored green (red), and the unknown nodes were colored beige. Inspecting the nodes, which have been colored and scaled, gives insight into typical entity behaviour. Most notably, Figure 1 shows that entities with a high PageRank score seem less likely to be fraudulent, which has been hypothesised and tested in [7]. In this paper, the writers found encouraging results in building classifiers with features derived from Strongly Connected Component SCC, PageRank, and shortest path algorithms on transaction data from ABN-AMRO. However, would classification based on graph properties also succeed for the bitcoin data of Elliptic?


Figure 1: Elliptic subgraph at timestamp 32. Nodes are coloured based on whether its label is licit (green), illicit (red), or unknown (beige). Furthemore, nodes are sized according to their PageRank score


The publication of The Elliptic Data Set coincided with a tutorial paper from both ‘Elliptic’ Data Scientists and researchers from the MIT-IBM Watson AI Lab [8]. This paper, titled: ‘Anti-money-laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics.’ sought to compare standard classification methods (Logistic Regression, Random Forest, Multi Layer Perceptron) with the more advanced Graph Convolutional Networks (GCN). Moreover, the analysis in [8] compares the effects of incorporating graph information into standard classification methods. Two important conclusions can be drawn from this paper, namely:

  • Standard classification methods all benefit from additional features based on GCN embeddings;

  • Not GCN, but a Random Forest with GCN-produced embeddings produced the best results.

As noted in [8], these conclusions suggest combining Graph Analytics (such as GCN) with more standard approaches like Decision Trees or Random Forest models. Similar to the underground-banking case in [1], this demonstrates that classifiers built out of graph properties are very promising, and carry great potential in complementing existing AML approaches.

Although Graph Analytics is a nascent domain and the systems are still being developed, RiskQuest shares DNB’s vision regarding the potential of Graph Analytics as a competitive technology. We covered some of the systems that can be used for graph processing today and we explored the Elliptic dataset with GraphScope as our main tool. This project is only the first step in evaluating Graph Analytics for AML and further research is vital, especially with relevant AML data. At RiskQuest, we envision taking into account graph information to be, not only a first step, but also a good step towards a more ‘risk-based’ approach.



References

[1] DNB. Van herstel naar balans (DNB), 2022.

[2] Larry Page, Sergey Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web, 1998.

[3] Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-Pérez, Thomas Manhardto, Hassan Chafio, Mihai Capotă, Narayanan Sundaram, Michael Anderson, et al. LDBC Graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. Proceedings of the VLDB Endowment, 9(13):1317–1328, 2016.

[4] Miguel E Coimbra, Alexandre P Francisco, and Luís Veiga. An analysis of the graph processing landscape. Journal of Big Data, 8(1):1–41, 2021.

[5] Safiollah Heidari, Yogesh Simmhan, Rodrigo N Calheiros, and Rajkumar Buyya. Scalable graph processing frameworks: A taxonomy and open challenges. ACM Computing Surveys (CSUR), 51(3):1–53, 2018.

[6] Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M Tamer Özsu. The ubiquity of large graphs and surprising challenges of graph processing: extended survey. The VLDB journal, 29(2):595–618, 2020.

[7] Molloy, Ian & Chari, Suresh & Finkler, Ulrich & Wiggerman, Mark & Jonker, Coen & Habeck, Ted & Park, Youngja & Jordens, Frank & Schaik, Ron. (2017). Graph Analytics for Real-Time Scoring of Cross-Channel Transactional Fraud. 22-40. 10.1007/978-3-662-54970-4_2.

[8] Mark Weber, Giacomo Domeniconi, Jie Chen, Daniel Weidele, Claudio Bellei, Tom Robinson, and Charles Leiserson. Anti-money-laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics. 07 2019.