Data lineage is a complex concept. What is such special about it?

During the past few years, I have noticed a rapid growth of interest in data lineage. The data lineage concept has become a hot topic in the data management community. Being deeply involved in the practical implementation of data lineage for the past two years, I would like to share my knowledge and experience with you.

Why has data lineage become a hot topic?

First of all, ask yourself: who in your (or any other) company might be interested in data lineage and why? Years ago, only IT knew what it was, but now business stakeholders, especially those from finance and risk, have become the most significant data lineage enthusiasts.

Why this sudden interest in the data lineage concept? There can be several reasons:

  • the appearance of new legislation requirements
  • business changes
  • an increase in data quality initiatives
  • supervisor and audit requirements.

Let’s talk about these in more detail.

Legislation requirements

My professional journey to data lineage has started with investigating the Basel Committee requirements on Banking Supervision’s standard number 239: “Principles for effective risk data aggregation and risk reporting” (BCBS 239 or PERDARR). Later, the EU General Data Protection Regulation (GDPR), IFRS9, TRIM (Targeted review of internal models), and others came into the picture­. Many specialists consider data lineage the ultimate ‘remedy’ to meet these requirements.

The funny thing is that you never find the term’ data lineage’ literally mentioned in these regulatory documents.

All conclusions about the necessity of data lineage are based on careful investigation of legislation requirements and the consequent matching of these requirements to the data management ammunition, with data lineage being part of it.

Business changes

The company often deals with different types of business changes, such as changes in information needs and requirements, changes in the application landscape, organizational changes, etc. For example, let us consider a change in a business application’s database. Usually, data is transformed and processed through the chain of applications, as you can see in Figure 1:

A chain of applications.

Figure 1. A chain of applications.

The chain consists of just a few applications for convenience, but in reality, especially in large companies, such chains comprise dozens of applications.

Let’s assume that one of the applications’ databases is changed, for example, in the ‘Company web-page’ (the starting point of the chain on the left side of Figure 1). It means that professionals must estimate all required changes in the consequent applications, including the impact on the end reports and/or dashboards. In this case, data lineage will be able to ease the impact analysis of the change.

Suppose changes touch, for example, information & reporting requirements (the chain endpoint in Figure 1). In that case, professionals will need to use root-cause analysis to assess which data is required to produce this new information, where data should come from, and how it should be transformed. In such a case, a root-cause analysis will be much easier to do if data lineage is already recorded.

So far, knowledge about data processing is very often kept in professionals’ minds or, in the best-case scenario, on local computers in the form of Word or Excel documents.

Data quality

Nowadays, there are a lot of initiatives around the quality of data. In large international companies, it could take years to roll out such a program, and it would take even more time and effort to make it a success. It is often not realized that knowing data lineage is one of the key conditions in resolving data quality issues.

Supervisor and audit requirements

Last but not least, let’s talk about the supervisor and audit requirements. There is a growing tendency that, next to aggregated reports, supervisors require companies to provide granular reporting data. Besides, especially finance and risk function camp with requirements to explain how critical metrics and figures in their reports have been derived. For that, you need to trace back the entire chain of data transformation and explain its path. To do that, you need to know the data lineage concept.

I hope that by now, I have ignited your interest in data lineage and inspired you to dive into this topic. If your question is, ‘But what IS data lineage?’, you’re on the right track.

In the next part of this blog series (Data lineage 202), I discuss the definition of ‘data lineage’ and its challenges.

For more insights, visit the Data Crossroads Academy site: