This article is about different approaches to data lineage documentation.

In the first three articles (Data Lineage 101, 102, and 103), we have discussed why we need data lineage, what data lineage actually is, and what are the key legislative requirements for data lineage.

In this article, I would like to discuss and give my answer to the most complicated question: how should data lineage be documented?

Before you even start thinking about documenting data lineage, there are a few crucial decisions to be made beforehand:

  • What is the scope of data lineage in your situation?
  • What method of documentation will you choose: descriptive or automated?
  • Will you select process design data lineage or value data lineage?

Let’s carefully consider the content of each decision and what they would mean for your company.

The scope of data lineage

First of all, let’s specify the meaning of ‘scope.’ In order to do that, we need to recall the concepts of ‘horizontal and vertical data lineage.’ Horizontal data lineage represents the path along which data flows, starting from its point of origin to the point of its usage. Horizontal data lineage can be documented on different data model levels, such as conceptual, logical, and physical. Links between the components of data lineage on these different levels are very often called ‘vertical data lineage.’

In the figure below, you see the illustration of these two concepts:

 An illustration of ‘horizontal’ and ‘vertical’ data lineage.

Figure 1. An illustration of ‘horizontal’ and ‘vertical’ data lineage.

The scope of the data lineage you define is based on the following:

  • A list of (critical) data sets or elements for which you will document data lineage.
    Critical data elements derive from the scope of your data management initiative and make the most significant impact on company performance. The bigger the company, the less probable that you would be able to document data lineage for all existing data sets, as data lineage documentation is a very time- and resource-consuming exercise all at once.
    Therefore, choosing the critical data sets or elements is the best way to make your attempts feasible.
    The question of the definition of CDE is a rather complicated topic which I will not discuss in this article but instead come back to it in one of my future blogs.
  • The ‘length’ of your horizontal data lineage.
    Of course, we always strive to document data lineage from the ‘golden’ source to its ultimate destination. Larger companies will have longer chains. Companies often choose to document only a part of the whole data lineage chain. The starting point of data lineage is very often ‘relative’ and corresponds to the needs and the scope of the entire data lineage initiative. In this respect, your ‘golden’ source could be some application in the middle of the complete chain.
  • The ‘depth’ of your vertical data lineage.
    As we have seen, you can document data lineage on one of the three data model levels.
    It could be only one level, or you might choose to combine two levels. The choice of the number of levels on which you will document data lineage depends mainly on the selected method of documentation.
Method of documentation: descriptive vs. automated

Usually, all companies start their journey with descriptive data lineage. What does descriptive data lineage mean?

Descriptive data lineage

Descriptive data lineage means that you make a description of data lineage manually using one or another application. Microsoft Office PowerPoint, Word, Excel, and Visio are the most used applications. Some well-known data governance applications exist, such as Axon by Informatica or Collibra. Regardless of the tooling you choose, there are several common features of descriptive data lineage:

  • It is very time and resource-consuming, even if you work with designated data governance applications.
  • You usually document data lineage on either conceptual or logical levels.
    • Conceptual level
      On this level, you typically map such components as:
      – business processes
      – applications, including reports and ‘golden’ sources
      – data sets or data terms interlinked with the help of restriction rules; data sets should be provided with business definitions.
    • Logical level
      On this level, you link the following components of data lineage:
      – applications, including databases, interfaces, reports, and ‘golden’ sources
      – data entities and data attributes interlinked with the help of business rules; data entities should be provided with corresponding business definitions; repository of business rules should be kept in a particular repository
      – data checks and controls.
  • Although it would require a tremendous amount of time and effort, some IT professionals manage to document data lineage on a physical level in Excel. Still, I would not recommend you attempt this. In my opinion, you should opt for an automated solution for documentation of the physical level.
  • If you decide to proceed with documenting data lineage at least at two levels, you should also create a vertical data lineage between these two levels:
  • Data sets or terms at a conceptual level are to be linked with data entities & attributes at the logical one.
  • Restriction rules are to be linked to business (transformation) rules.

Automated data lineage

Automated data lineage means that you automate the process of recording metadata at the physical level of data processing using one of the applications available on the market. The most known vendors are SAS, Informatica, Octopai, etc. You can find an extended list of providers of such a solution on metaintegration.com. The company provides meta-integration components to major providers of the metadata lineage function. Of course, these kinds of solutions sound very attractive. But before choosing which one you want to use, keep in mind the following:

  • It would be best if you were very careful when defining the scope of your initiative, as automated data lineage is an expensive and resource-consuming task.
  • The more legacy systems you have, the more difficulties you will experience.
    There are two sources of metadata for automated data lineage (as specified by my colleague, a SAS consultant) :
    ‘as built’ – meaning that metadata that is needed for the solution is already available in the database structure;
    ‘as designed’ implies that information can be read from system design documentation, such as data modeling tools.
    Very often, legacy systems cannot provide either of these metadata sources. It will urge you to face the ‘reverse engineering’ task. I have also experienced situations when even an old application’s database documentation has been lost. So, checking all of your applications before making decisions about automated data lineage is recommended.
  • When purchasing specialized metadata software, you need to think about a particular set of modules to get automated data lineage. The reason is that for different types of metadata, you would need other repositories, for example, metadata and business rules repositories.
  • Existing software on the market usually allows you to document data lineage on physical levels. In some cases, you might also get an option to document data lineage logically. But be aware that documentation on conceptual and logical levels will still have to be done manually, as well as the mapping between physical and logical levels.
  • The number of metadata lineage object types can rise to dozens and hundreds. Therefore, the corresponding amount of metadata could explode your storage capacities.
  • Very often, business stakeholders overestimate the capabilities of automated data lineage.
Process design data lineage vs. value data lineage

Different groups of stakeholders have different requirements for data lineage. There are at least two key stakeholder groups: IT technical professionals and business users such as financial and business controllers, business analysts, and auditors. The key expectations of business users are the ability to follow changes in data values and the ability to get historical information on data processing up to 5-7 months in the past. An automated solution can’t satisfy these requirements. The automated data lineage is basically data processing design documentation. What is expected by business users, I call ‘value data lineage.’ Strictly speaking, data lineage has nothing to do with such requirements. You can call it ‘drill-down capabilities.’ The most challenging is that none of the existing data lineage providers have a solution for the ‘value’ data lineage in the scope of the whole application landscape.

Hopefully, I have managed to give you an overview of what documenting data lineage looks like and what challenges it encompasses. If you are brave enough to proceed, I will provide a few tips on starting and running the data lineage implementation project in the following (and final) article of this series.

For more insights, visit the Data Crossroads Academy site: