Is it possible to read hundreds of annual reports within a day?

June 03, 2021 Data Science

Plaatje blog NLP 768x512

In Crime Scene Investigation series the main detective understands the content of a complex file by just taking one quick glance at it. Evidently, it doesn’t work this way in practice, but increased computational power and further developed techniques allow for much quicker analyses of large amounts of text.

As quantitative risk managers we are used to work a lot with numbers and complex models. However, also large amounts of text, think of legislation and regulation, annual reports and model documentation, are a significant part of our job.

Natural language processing (NLP) plays a major role in providing quick insights into the trends appearing within texts and in analyzing these texts. In this blog we give an example of the application of the NLP technique, but there are numerous possibilities to apply this technique in our field of work.

Natural language processing

A short introduction will be given to introduce NLP and more specific the Latent Dirichlet Allocation (LDA) model. The LDA model is a statistical model that is able to analyze many observations in an effective way and to find latent groups in these observations[1].

The aim of this blog is to find latent topics in the risk sections of annual reports of banks. However, LDA is applicable in several fields of science, e.g. microbiology or behavioral sciences.

The LDA topic model assumes that the topics in the paragraphs of annual reports follow a Dirichlet distribution. The Dirichlet distribution is a multivariate probability distribution whose realizations again follow a probability distribution.

In our specific case this means that the topics in the paragraphs follow a probability distribution and that the words forming a topic also follow a probability distribution (see subsection: Generative process Latent Dirichlet Allocation in blue). Training the LDA topic model results in groups of words which form (with a certain probability distribution) a topic. The process of assigning a topic to a group of words is not performed by the algorithm and therefore remains human work. Figure 1 shows an illustration of groups of words with corresponding probabilities that form a topic. The illustration shows for example that the probability of observing the word “risk” in the topic “Risk Management” is four times larger than the probability of observing the word “appetite” in this topic.

Figure 1: Word probabilities of the three topics risk management (red), capital management (blue) and funding & liquidity (green). Note that words which belong to a certain topic can also be used to form another topic.

When applying the LDA algorithm to the annual reports each word in a paragraph is assigned to a latent topic; see Figure 2 for an example where words belonging to three different topics have been marked.

Subsequently, each paragraph will have a distribution of the topics appearing in that paragraph. A paragraph that contains the words “Capital”, “Adequacy” and “CET1” has a high probability of getting the topic “Capital management” assigned to it, whereas a paragraph with the words “Liquidity”, “Funding” and “Stress” is more likely to be related to “Funding & Liquidity”. The last step in the process is to link the topic with the highest probability to the concerned paragraph.

Figure 2: Two paragraphs of the chapter Risk and capital management from the 2018 annual report of ING Bank. The marked words are linked to the topics risk management (red), capital management (blue) and funding & liquidity (green). Note that many more topics could have been present in these specific paragraphs.

Application to annual reports

Banks report about several types of risk in their annual reports. In order to find trends in the risk types appearing in these annual reports we have implemented the LDA topic model and applied it to the risk sections of annual reports of large banks. The LDA model has been trained on 298 annual reports of banks in the European Union and the United Kingdom. The annual reports of the years 2016, 2017 and 2018 have been analysed for each bank provided they were publicly available.

With the use of the trained model a topic is assigned to each paragraph of each annual report. In order to indicate differences between the topics appearing in the annual reports, the annual reports have been divided according to two categorizations; the first categorization divides the annual reports based on the reporting year; the second categorization divides the annual reports based on the credit rating that has been assigned by S&P to the obligations of the country where the bank is located. The categorizations have been made such that differences in the annual reports can be observed and explained by economic developments. For the analysis based on the credit rating categorization, countries with a AAA credit rating will be compared to countries with a credit rating ≤BBB.

Figure 4 and 5 show the distributions of the topics per categorization. When taking the categorization on the reporting year, it is observed that the topics capital requirements and credit risk have an increasing contribution in the annual reports over the years.

The increase in the contribution of the topic capital requirements is explained by the extra buffer banks have to hold as a consequence of the Basel III regulations. The so-called ‘countercyclical capital buffer’ (CCyB) ensures that the capital requirements in the banking sector take account of the macro-financial environments in which the banks operate. This means that banks have to hold additional capital when the average creditworthiness increases, such that when the cycle turns down this additional capital can be used to cover credit losses. The countercyclical capital buffer has been phased in between the beginning of 2016 and the end of 2018 and became fully effective on 1 January 2019.

The increase in the contribution of the topic credit risk is explained by the guidelines that have been published by the European Banking Authority (EBA) in the period 2016-2018. In this period the EBA published guidelines on the definition of default, on the modelling and estimation of the probability of default (PD) and Loss Given Default (LGD). Furthermore, as of 1 January 2018 banks have to provide insight on the impact of IFRS 9. Together with the fact that a general shift from market risk to credit risk has occurred, the increase in the contribution of the topic credit risk in the annual reports is a legitimate observation.

Figure 4: The distribution of the topics in the annual reports of banks divided over the reporting years.

Figure 5 shows the differences between the annual reports of banks based on the credit rating categorization. One can immediately point out that in AAA-countries credit risk is a topic of frequent occurrence. However, when taking a closer look at the ≤BBB-countries one can conclude that the topics in the paragraphs are more evenly distributed amongst different risk topics. The credit rating of the obligations of a country is often positively correlated with the creditworthiness of the clients of the bank in that country. As a result, for AAA-countries it is viable to implement an internal model (Advanced Internal Ratings Based (AIRB)) such that capital requirement and credit risk can be reduced. Implementation of an AIRB model in ≤BBB-countries is less viable, which makes the choice for applying the Standardized Approach (SA) more reasonable. An AIRB model returns banks more results to report in their annual reports when compared to the less advanced SA model.

Figure 5: The distribution of the topics in the annual reports of banks divided by the categorization on credit rating basis.


Giving an answer to the question stated in the title: no, from a human point of view it is not possible to read hundreds of annual reports in detail within a day. However, increased computational power of computers makes that techniques like NLP can be applied a lot easier and faster than in the past. Based on this technique analyses of large amounts of text can provide relevant insights within a day. However, one has to realize that running the algorithm does not mean that the results follow directly. In order to interpret the results correctly some kind of expert judgement is required.

Application of the LDA topic model can be valuable for risk managers. Instead of, for example, going through dozens of regulations, the algorithm can select the specific paragraphs of interest. The LDA model is also used on a large scale to monitor trends in investing news. Other examples where the LDA model can be applied are: analyzing client generated content like questions or complaints about a certain product or investigating all available documents belonging to an investment portfolio. Overall, we see a lot of possibilities to apply this technique in our field of work.

For more information on this topic contact Remco de Smit (Consultant), Vincent Schothuis (Manager), or Sven de Man (Partner)

If you want to join our RiskQuest team, please check our current job openings here

[1] Blei, D. et al (2003), Latent Dirichlet Allocation, in Journal of Machine Learning Research 3 (2003) 993-1022,