As AI solutions grow more capable, they play an increasingly important role in decision making—which raises well-known concerns that models might not be fair. A credit-card-fraud detector might be biased when it intervenes and blocks transactions; loan applicants might be rejected after transaction profiling. In the Netherlands, discrimination and unfairness are considered such serious affairs that Rutte’s third-term government resigned over the unfair treatment of those in need of social tax support (Toeslagenaffaire, early 2021). Could similar misery arise between the bank and their exit clients, when they assemble and figure out none of them are Dutch? Find out how we can ensure you are making the right choices.
Fortunately, just like the model’s accuracy can be measured, so can its bias. However, there are several challenges:
- free choice in the definition of fairness,
- the “is”/”ought” problem (predictions versus truth),
- correcting bias,
- balancing different interests, and
- using sensitive data.
Our previous blog about Discrimination and Free Will discussed the fine line between behaviour that’s a free choice and behaviour that’s related to group characteristics. This blog considers free choice from a different perspective: how do we determine what is fair?
Sidenote: Article 1 of Dutch constitution
All persons in the Netherlands shall be treated equally in equal circumstances. Discrimination on the grounds of religion, belief, political opinion, race or sex or on any other grounds whatsoever shall not be permitted.
Anti-discrimination laws are clearly written, but clearly not by data scientists. Let’s phrase fairness in the context of a binary decision: you either get a “yes”, or you don’t. Algorithmic fairness is the field of data science that gives us the perspectives to quantify fairness. Unfortunately, there are multiple ‘definitions’ of fairness, and they are mutually incompatible. In other words, there’s a choice to be made. Here are three options (loosely based on the guidelines from the UK authority ICO):
- Demographic parity: decisions should be representative of the population.
- The distribution of people that gets a “yes” should look just like the distribution of all people. So, within statistical error, just as many males as females should get a “yes”.
- Error parity: decisions should be fair towards those that deserve it.
- Does everyone who should get “yes” have an equal chance of getting it? Are we not holding out on males or females that ought to get a “yes”?
- Equal calibration: decisions should be correct, fairly for everyone.
- For everyone that did get a “yes”, did they ought to, with equal chance? Are we not giving males or females more “yes”es than we ought to?
As the great philosopher Kant wrote, ethics are laws by which to exercise free choice. In this case, we look for principles to base off our definition of fairness. If the question is: “Which fairness definition should we consider?”, the answer could be: “All of them!”—which drives us into compatibility issues. As we will see, to truly make the fair call, we’ll want to shift focus from statistical results to their broader impact.
Broadening our understanding, the “is”/”ought” problem is terminology also hijacked from ethical philosophy (where it means something else). In this case, we mean that the decision that “is” is not the decision it “ought” to be. Example: a bank account holder is exited due to a false suspicion of fraud. The problem: we may not even know what the decision “ought” to be. In fraud cases, we can sometimes use existing analysis outcomes as the ground truth. Those outcomes “ought” to be—now we can analyse the fairness of model decisions against this ground truth to calculate error parity and equal calibration.
But if we don’t have such labels, we don’t know what “ought” to be, and we can only calculate demographic parity for bias quantification. Asking our algorithm to output the population distribution is a very strong requirement, however, which we may not intend. Maybe, in some way, males “ought” to get a “yes” more often than females. For instance, while we roughly have as much males as females, public data shows that males are four times more often convicted of money laundering than females.
While this statistic raises concerns of its own, accepting it for a moment as our reality, we expect our transaction monitoring system to also trigger more often on males than on females—not because they are males, of course, but because risky transactions are apparently more often executed by males. If the TM system’s distribution is in between the population 50/50 and the CBS 20/80, we have no reason to suspect unfair treatment.
However, what if our bias quantification shows unfair treatment under one of the above definitions? This may not be a problem if the impact of an unfair decision is negligible—but sometimes, it will be problematic. Under a standard optimisation approach like maximum likelihood, we expect optimal model performance, but this may amplify existing bias in your data. It may also negatively influence humans interacting with the algorithmic decision. Technical solutions are to re-train your model under constrained optimisation, or to add a penalty term for any disparity. This will be specific to the model you’re optimising and hence may impact your modelling approach from the start. (Contact us to discuss details.)
Inevitably, imposing fairness on an unfair model leads to performance reduction—which could be costly. Data science can quantify the difference and provide solution options. However, a true assessment of what’s fair should consider all parties involved. For instance:
- clients and prospects (i.e. involving your Legal and Privacy offices),
- management levels and their personal responsibility,
- operational cost to the bank (e.g. your Operations department),
- risks of reputational damage (i.e. ORM), and
- societal impact and the role of the government.
Working with all involved, the data scientist can clarify the impacts of each option. But the final decision might call for management to strike a balance between ethical goals, costs, and responsibility. It’s here that we see that ethics is really about our choices.
The impact of unfairness depends a lot on the situation. An example with strong interventions are some U.S. universities on the policy that freshman distribution should resemble population distribution to some degree (demographic parity). Seeing how universities prepare a new generation for important societal positions, this seems a way to try and balance out society itself. It may not be the bank’s role to intervene like that—but should the bank just accept for a fact that money launderers are more often male than female?
(But alas, then we also won’t find out if females are actually just much better at laundering money.)
Our conclusions can be summed up as follows.
- ‘Being fair’ is a conscious effort. It can call for nontrivial choices, especially if your use case is already susceptible to bias.
- Hence, we view fairness as a balance struck very deliberately. Your choices here don’t exist in a vacuum; they impact your business and society.
- We use algorithmic fairness to guide understanding—but there is not one fairness algorithm to rule them all. The final call requires you to have a holistic understanding of all perspectives.
- We involve the right decision makers and stakeholders to make that final call.
When it really matters, data science is a necessary, but not sufficient element to base your decisions off of. That’s why we’re both modeler and consultant.
Sidenote: GDPR special category data
It’s only fair to include the European GDPR in our consideration, as it’s protecting our pricavy w.r.t. the sensitive data that we usually think about in context of discrimination. However, the GDPR currently also presents a problem for fairness, since it’s very strict about processing special category data, such as religion, even for purposes of bias quantification. The UK authority ICO has taken the position that bias quantification serves “substantial public interest” and hence falls under GDPR exemption, meaning UK data scientists can process such sensitive data strictly for this purpose. In the Netherlands, the AP (Authoriteit Persoonsgegevens) has not clarified this issue yet, meaning NL data scientist could potentially break privacy law if they use special category data even for bias quantification. The best we can do in the current situation is a maximum effort, i.e. doing as much as we can to mitigate unfairness, then weighing the residual discrimination risks against the positive impact of our model.
If you want to join our RiskQuest team, please check our current job openings here