During the past few years, I have noticed a quick growth of interest towards data lineage. Data lineage has become a hot topic in the data management community. Being deeply involved in practical implementation of data lineage for the past two years, I would like to share my knowledge and experience with you.

Why has data lineage become a hot topic?

First of all, ask yourself: who in your (or any other) company might be interested in data lineage and why? Years ago, only IT knew what it was, but now business stakeholders, especially those from finance and risk have become the biggest data lineage enthusiasts.

Why this sudden interest in data lineage? There can be a number of reasons:

  • appearance of new legislation requirements
  • business changes
  • an increase in data quality initiatives
  • supervisor and audit requirements.

Let’s talk about these in more detail.

Legislation requirements

My professional journey to data lineage has started with the investigation of requirements of the Basel Committee on Banking Supervision‘s standard number 239: “Principles for effective risk data aggregation and risk reporting” (BCBS 239 or PERDARR). Later on, the EU General Data Protection Regulation (GDPR) , IFRS9, TRIM (Targeted review of internal models) and others came into the picture­. Many specialists consider data lineage as the ultimate ‘remedy’ to meet these requirements.

The funny thing is that you never find the term ‘data lineage’ literally mentioned in these regulatory documents.

All conclusions about the necessity of data lineage are based on careful investigating of legislation requirements and consequent matching of these requirements to the data management ammunition, with data lineage being part of it.

Business changes

Very often, company deals with different types of business changes, such as changes in information needs and requirements, changes in application landscape, organizational changes etc. As an example, let us consider a change in a database of a business application. Usually, data is transformed and processed through the chain of applications, as you can see in Figure 1:

A chain of applications.

Figure 1. A chain of applications.

For the convenience, the chain consists of just a few applications, but in reality, especially in large companies, such chains consist of dozens of applications.

Let’s assume that the database of one of the applications is changed, for example in ‘Company web-page’ (the starting point of the chain on the left side of the Figure 1). It means that professionals will need to estimate all required changes in the consequent applications, including the impact on the end reports and/or dashboards. In this case, data lineage will be able to ease the impact analysis of the change.

If changes touch, for example, information & reporting requirements (the end point of the chain in Figure 1), professionals will need to use root-cause analysis that will allow to assess which data is required to produce this new information, where data should come from and how it should be transformed. In such a case, a root-cause analysis will be much easier to do if data lineage is already recorded.

So far, knowledge about data processing is very often kept in minds of professionals or in the best case scenario, on local computers in the form of Word or Excel documents.

Data quality

Nowadays there are a lot of initiatives around the quality of data. In large international companies it could take years to roll out such a program and it would take even more time and effort to make it a success. Very often it is not realized that knowing data lineage is one of the key conditions in the resolution of data quality issues.

Supervisor and audit requirements

Last but not least, let’s talk about the supervisor and audit requirements. There is a growing tendency that next to aggregated reports, supervisors require companies to provide granular reporting data. In addition, especially finance and risk functions camp with requirements to explain how critical metrics and figures in their reports have been derived. For that, you need to be able to trace back the full chain of data transformation and explain its path. To be able to do that, you definitely need to know data lineage.

I hope that by now I have ignited your interest in data lineage, and inspired you dive into this topic. If your question now is ‘But what IS data lineage?’, you’re on the right track.

In the next part of this blog series (Data lineage 202), I talk about the definition of ‘data lineage’, as well as the challenges it presents.