Inspiration Insights Explainable AI models for detecting fraudulent transactions, is it possible?

Explainable AI models for detecting fraudulent transactions, is it possible?

January 29, 2024 General

As our lives become more digital, online transactions have surged, bringing along an unfortunate companion – fraud. Besides the devastating consequences fraud can have on individuals, it has a major impact on society as a whole as well. In the US, identity theft alone leads to an annual loss of a staggering 6.1 billion dollars. This underscores the critical need to combat fraud, and banks play a major role in detecting transactional fraud.

But how does that work? Picture this: you are a banker, in charge of detecting as many fraudulent transactions as possible, the stakes are high. In the data systems of the bank you can find all kinds of information about a transaction, like the date, time, amount, type, along with some information about the customer as well as the recipient. How can you see what makes a transaction shady? Are there some kind of patterns that you can infer? The magic buzzword that comes to mind these days is of course, ‘AI’, short for Artificial Intelligence. AI can be an excellent tool for complex pattern recognition. ‘Problem solved!’, you may think, however, it is not as simple as it seems.

Why explainable AI is a must

Financial institutions in the European Union must adhere to the General Data Protection Regulation (GDPR), which means they are obligated to justify their decisions to legal authorities and customers. And it just so happens that most AI models do not comply with such regulations. Complex AI models are often referred to as ‘black boxes’, which indicate that you give an input to the model (for example, transaction data), the model does something with it – but we don’t know what (the black box), and it produces an output (‘this transaction is (not) fraudulent!’).

Therefore, banks often stick to very basic techniques to identify fraud or money laundering, such as business rules. Fortunately, more advanced models such as Random Forests are also seeing an increase in popularity. Random Forests are models that consist of multiple decision trees (refer to Figure 1 for an example of a decision tree), where each tree generates an output (fraudulent/legitimate), also called a vote. The final decision is based on a majority vote: if most trees output ‘fraudulent’, the transaction will be classified as such. The Random Forest model is basically a forest of decision trees, which is also not very interpretable in itself, but it does the trick and it complies with the GDPR.

Figure 1: Example of a simple decision tree

Is this where it ends for AI in the financial sector? Are we doomed to stop innovating and let AI only flourish in other fields? Fortunately, the answer is no. A very recent paper by Visbeek et al. [1] introduced Deep Symbolic Classification (DSC), a novel framework that uses deep reinforcement learning (a subfield of AI) and is specifically designed for explainable and transparent fraud detection. The mechanisms behind DSC are not explainable in itself, as they rely on deep learning tools, but DSC does generate explainable decision algorithms that can be used in practice and also comply with the GDPR. How? We will delve into that below.

Mathematical decision rules

DSC can generate decision rules that determine whether a transaction is fraudulent or not, without the use of big complex forests. A question you may ask is how it is possible that simple decision rules can work at least as good as a combination of multiple decision trees. This is because the generated decision rules by DSC can include mathematical operators, that can describe complex patterns.

Imagine that fraudulent behavior adheres to the following pattern: a transaction is likely to be fraudulent, if the amount is high and it is summer or winter. This shows a seasonal dependency, that likely does not have hard cut-offs. The probability of a transaction being fraudulent is probably low mid spring, then gradually increases until a peak mid summer, after that it gradually decreases again. The likeliness graph would probably have the following shape:

Figure 2: Illustrative example of a seasonal effect on the probability of a transaction being fraudulent

As you can see, this may be a very complex pattern to map to simple decision rules. You may want to classify each transaction as fraudulent if the likeliness is higher than 0.8. For each day, you then need to determine the likeliness, and then you could define something like:

If amount > 10,000 euro AND
[(day > 118 AND day < 155) OR (day > 300 AND day < 338)],
then classify the transaction as fraudulent.

But what if you want to change the likeliness from 0.8 to 0.9 as this leads to better results? Then you need to calculate everything again. Or what happens when the time-dependent patterns become more complex?

By introducing mathematical operators, this becomes a lot more straightforward. The shape above can be reproduced by the following sine function: probability = sin(0.0344 * days). The decision rule could then look like the following:

If amount > 10,000 euro AND sin(0.0344 * days) > 0.8,
then classify the transaction as fraudulent.

In this example the value of 0.8 can be easily adjusted. Note that more complex functions can also be approximated with mathematical operators, and without them, the number of decision rules may increase exponentially, at the expense of explainability.

How does the model learn mathematical decision rules?

The interesting part about DSC is that it does leverage the predictive power of AI tools, to generate mathematical decision rules that can capture complex patterns. The model can learn these rules via reinforcement learning, which is a machine learning technique that follows the fundamental principles of trial and error. In essence, a decision rule receives points as a reward, based on its performance on the dataset, and the model seeks to optimize this reward.

Therefore, defining a reward function is a crucial aspect. But how do we determine whether a decision rule is deemed ‘good’ or ‘bad’? A simplistic approach might involve using the accuracy of the resulting decision rules on the dataset. However, this method can yield misleading results, particularly in highly skewed datasets typical of fraud detection scenarios where fraudulent transactions are a small fraction, often around 0.1%.

For instance, consider a decision rule that always classifies all transactions as legitimate. Despite capturing no fraudulent pattern at all, this rule could achieve an accuracy of 99.9%. Therefore, the accuracy does not adequately reflect how good a decision rule actually is. This shows the need for a more nuanced reward function to account for datasets that deal with high class imbalance.

In these kind of cases, the measurements precision and recall offer a more realistic perspective. Precision helps us avoid mistaking legit transactions as fraud, ensuring customers have a good experience as their transactions will not be blocked because of an incorrect model classification. On the other hand, recall is about catching as many actual fraudulent transactions as possible to minimize missing criminal activity. So for a model, you want both the precision and recall score to be high.

Visbeek et al. adopt the F1-score as reward function, to strike a right balance between these considerations. Formally, the F1-score is the harmonic mean of precision and recall, this means that the F1-score is a measurement that indicates if both precision and recall are high. It is defined as:

The authors show by utilizing the F1-score as the reward function, challenges associated with (high) class imbalance are mitigated.

This approach also eliminates the need for problematic over- and undersampling techniques that are conventionally employed to address class imbalance issues. Undersampling involves removing legitimate transactions from the dataset, until the number of legitimate and fraudulent transactions is even. This is undesirable, as you lose a lot of useful information. Oversampling is adding fraudulent transactions to the dataset (for example, by replicating the small number of fraudulent transactions), until the number of legitimate and fraudulent transactions is even. However, a big risk of this method is overfitting on the small number of fraudulent transactions, which means that other kinds of fraudulent patterns will never be captured. By utilizing the F1-score, DSC can work on imbalanced datasets without using these problematic techniques.

What is in it for stakeholders?

What makes DSC an interesting framework for banks? The first two aspects have been briefly addressed in this blog, which are its explainability and its property that it lacks the need for over- and undersampling. You may want to know how DSC could be perceived as more explainable than Random Forest models that are currently employed at banks.

Imagine this, you work at the customer’s service desk at the bank, and a customer that is really into regulations wants to know why her transaction was blocked, referring to the GDPR. When using Random Forests, you must somehow explain that the decision is based on the majority vote of several decision trees, where each tree has its own rules that you may also have to explain. When using DSC, you can explain that the transaction satisfies a simple set of rules and therefore is classified as fraudulent (the exact rules and thresholds cannot be given, as fraudsters might otherwise misuse it).

Another aspect that makes DSC desirable, is that it can output several sets of decision rules of varying complexity. The complexity of a set of rules is linked to (1) the number of rules, and (2) the complexity of the mathematical operators (you can imagine that a sine function may be less explainable than a plus sign). There is often a kind of trade-off here: decision rule sets that have higher complexity may lead to better performance, but it affects their explainability negatively. With this option, transaction risk modelers and other stakeholders within the bank can choose the set of decision rules that best aligns with their interests and priorities. This further highlights the practical applicability and versatility of DSC in the financial domain.

Performance of DSC compared to other models

In the original DSC paper, the authors display the performance of DSC along with that of some other models on an open-source transaction dataset. Below, we show a subset of the relevant results. Note that XGBoost is a state-of-the-art model on this dataset, which means it gains currently the best results. However, it lacks explainability and is therefore not really applicable at financial institutions. Furthermore, we show the results of the simpler model k-NN, where undersampling the dataset was a necessity.

Method	Accuracy	Precision	Recall	F1-score
k-NN*	0.93	0.02	0.83	0.03
Random Forest	0.99	0.99	0.67	0.81
XGBoost	0.99	0.98	0.70	0.82
DSC	0.99	0.95	0.67	0.78
*note that for this model, the dataset was first subject to random undersampling

At first glance, it becomes clear that accuracy is indeed not a reliable measurement, as this is high for all models (even if they do not perform well). The F1-score of DSC is slightly lower compared to Random Forest and XGBoost. This is mainly caused by a slightly lower precision. However, with a value of 0.95, this means that 95% of the detected transactions are fraudulent. The recall score is equal to that of Random forest, which indicates that two thirds of the fraudulent transactions are getting detected. This is a relatively high number in practice, as most fraudulent transactions go unnoticed.

The bottom line is that while DSC performs slightly worse compared to Random Forest and the state-of-the-art XGBoost (although this difference is quite marginal), it still surpasses them in terms of explainability. The authors also state some future directions to improve performance of DSC, as the framework is still in the early stages.

Conclusion

In this blog, we have delved into the possibility of Deep Symbolic Classification as explainable AI model for fraud detection. We have delved into some key aspects of DSC, such as its explainable decision rules, its performance, as well as its benefits for practical use at banks. For more information about its actual mechanisms, please refer to the original paper [1].

DSC stands out for its ability to generate transparent decision rules grounded in mathematical expressions. This generation process utilizes the predictive power of AI, through reinforcement learning—a trial-and-error optimization technique. A critical strength of DSC lies in its adeptness at addressing class imbalance without resorting to traditional over- or undersampling techniques. It provides a robust solution for datasets with skewed class distributions.

Moreover, the output of DSC enables stakeholders to make informed trade-offs between accuracy and complexity. This flexibility allows them to select a set of decision rules that aligns most closely with their priorities and preferences. The demonstrated comparable performance of DSC with state-of-the-art models underscores its promise as a practical AI tool in the domain of fraud detection.

In case of any questions feel free to reach out to Samantha Visbeek at [email protected].

References

[1] Visbeek, S., Acar, E., & den Hengst, F. (2023). Explainable Fraud Detection with Deep Symbolic Classification. arXiv [Cs.LG]. Retrieved from http://arxiv.org/abs/2312.0058...