Data lineage is a complex concept. What is such special about it?
During the past few years, I have noticed a quick growth of interest in data lineage. Data lineage concept has become a hot topic in the data management community. Being deeply involved in the practical implementation of data lineage for the past two years, I would like to share my knowledge and experience with you.
Why has data lineage become a hot topic?
First of all, ask yourself: who in your (or any other) company might be interested in data lineage and why? Years ago, only IT knew what it was, but now business stakeholders, especially those from finance and risk, have become the biggest data lineage enthusiasts.
Why this sudden interest in data lineage concept? There can be several reasons:
- the appearance of new legislation requirements
- business changes
- an increase in data quality initiatives
- supervisor and audit requirements.
Let’s talk about these in more detail.
My professional journey to data lineage has started with investigating the Basel Committee requirements on Banking Supervision‘s standard number 239: “Principles for effective risk data aggregation and risk reporting” (BCBS 239 or PERDARR). Later on, the EU General Data Protection Regulation (GDPR), IFRS9, TRIM (Targeted review of internal models), and others came into the picture. Many specialists consider data lineage as the ultimate ‘remedy’ to meet these requirements.
The funny thing is that you never find the term ‘data lineage’ literally mentioned in these regulatory documents.
All conclusions about the necessity of data lineage are based on careful investigation of legislation requirements and the consequent matching of these requirements to the data management ammunition, with data lineage being part of it.
The company often deals with different types of business changes, such as changes in information needs and requirements, changes in application landscape, organizational changes, etc. As an example, let us consider a change in a database of a business application. Usually, data is transformed and processed through the chain of applications, as you can see in Figure 1:
For convenience, the chain consists of just a few applications, but in reality, especially in large companies, such chains consist of dozens of applications.
Let’s assume that one of the applications’ database is changed, for example, in ‘Company web-page’ (the starting point of the chain on the left side of Figure 1). It means that professionals will need to estimate all required changes in the consequent applications, including the impact on the end reports and/or dashboards. In this case, data lineage will be able to ease the impact analysis of the change.
If changes touch, for example, information & reporting requirements (the endpoint of the chain in Figure 1), professionals will need to use root-cause analysis that will allow assessing which data is required to produce this new information, where data should come from, and how it should be transformed. In such a case, a root-cause analysis will be much easier to do if data lineage is already recorded.
So far, knowledge about data processing is very often kept in the minds of professionals or, in the best-case scenario, on local computers in the form of Word or Excel documents.
Nowadays, there are a lot of initiatives around the quality of data. In large international companies, it could take years to roll out such a program, and it would take even more time and effort to make it a success. It is often not realized that knowing data lineage is one of the key conditions in the resolution of data quality issues.
Supervisor and audit requirements
Last but not least, let’s talk about the supervisor and audit requirements. There is a growing tendency that next to aggregated reports, supervisors require companies to provide granular reporting data. Besides, especially finance and risk function camp with requirements to explain how critical metrics and figures in their reports have been derived. For that, you need to trace back the full chain of data transformation and explain its path. To be able to do that, you definitely need to know data lineage concept.
I hope that by now, I have ignited your interest in data lineage and inspired you to dive into this topic. If your question now is ‘But what IS data lineage?’, you’re on the right track.
In the next part of this blog series (Data lineage 202), I talk about the definition of ‘data lineage’ and the challenges it presents.