In the first three articles (Data Lineage 101, 102 and 103) we have discussed why we need data lineage, what data lineage actually is and what are the key legislative requirements for data lineage.

In this article, I would like to discuss and give my answer to the most complicated question: how should data lineage be documented?

Before you even start thinking about documenting data lineage, there are a few crucial decisions to be made beforehand:

  • What is the scope of data lineage in your situation?
  • What method of documentation will you choose: descriptive or automated?
  • Will you choose process design data lineage or value data lineage?

Let’s carefully consider the content of each decision and what they would mean for your company.

The scope of data lineage

First of all, let’s specify the meaning of ‘scope’. In order to do that, we need to recall the concepts of ‘horizontal and vertical data lineage’. Horizontal data lineage represents the path along which data flows starting from its point of origin to the point of its usage. Horizontal data lineage can be documented on different data model levels such as conceptual, logical and physical. Links between the components of data lineage on these different levels are very often called ‘vertical data lineage’.

In the figure below, you see the illustration of these two concepts:

 An illustration of ‘horizontal’ and ‘vertical’ data lineage.

Figure 1. An illustration of ‘horizontal’ and ‘vertical’ data lineage.

The scope of data lineage you define is based on the following:

  • A list of (critical) data sets or elements for which you will document data lineage.
    Critical data elements are those that derive from the scope of your data management initiative and those that make the greatest impact on company performance. The bigger the company, the less probable that you would be able to document data lineage for all existing data sets as data lineage documentation is very time- and resource consuming exercise all at once.
    Therefore, choosing the critical data sets or elements is the best way to make your attempts feasible.
    The question of the definition of CDE is a rather complicated topic which I will not discuss in this article, but rather come back to it in on of my future blogs.
  • The ‘length’ of your horizontal data lineage.
    Of course, we always strive to document data lineage from the ‘golden’ source to its ultimate destination. Larger companies will have longer chains. Very often, companies choose to document only a part of the whole data lineage chain. The starting point of data lineage is very often ‘relative’ and corresponds to the needs and the scope of the whole data lineage initiative. In this respect, your ‘golden’ source could be some application in the middle of the complete chain.
  • The ‘depth’ of your vertical data lineage.
    As we have seen, you can document data lineage on one of the three data model levels.
    It could be only one level, or you might choose to combine two levels. The choice of the number of levels on which you will document data lineage depends mostly on the chosen method of documentation.
Method of documentation: descriptive vs automated

Usually, all companies start their journey with descriptive data lineage. What does descriptive data lineage mean?

Descriptive data lineage

Descriptive data lineage means that you make a description of data lineage manually using one or another application. The most used applications are Microsoft Office PowerPoint, Word, Excel and Visio. There are some well-known data governance applications such as Axon by Informatica or Collibra. Regardless of the tooling you choose, there are several common features of descriptive data lineage:

  • It is very time and resource consuming, even if you work with designated data governance applications.
  • You usually document data lineage on either conceptual or logical level.
    • Conceptual level
      On this level, you usually map such components as:
      – business processes
      – applications, including reports and ‘golden’ sources
      – data sets or data terms interlinked with help of restriction rules; data sets should be provided with business definitions.
    • Logical level
      On this level, you link the following components of data lineage:
      – applications, including databases, interfaces, reports and ‘golden’ sources
      – data entities and data attributes interlinked with help of business rules; data entities should be provided with corresponding business definitions; repository of business rules should be kept in a special repository
      – data checks and controls.
  • Although it would require a tremendous amount of time and effort, some IT professionals manage to document data lineage on physical level in Excel, but I would not recommend you attempt this. For documentation of the physical level, in my opinion, you should opt for an automated solution.
  • If you decide to proceed with documenting data lineage at least at two levels, you should also create a vertical data lineage between these two levels:
  • Data sets or terms at conceptual level to be linked with data entities & attributes at the logical one.
  • Restriction rules are to be linked to business (transformation) rules.

Automated data lineage

Automated data lineage means that you automate the process of recording of metadata at physical level of data processing using one of application available on the market. The most known vendors are SAS, Informatica, Octopai etc. You can find an extended list of providers of such a solution on metaintegration.com. The company provides meta integration components to major providers of the metadata lineage function. Of course, this kinds of solutions sound very attractive. But before choosing which one you want to use, keep in mind the following:

  • You should be very careful when defining the scope of your initiative, as automated data lineage is an expensive and resource consuming task.
  • The more legacy systems you have the more difficulties you will experience.
    There are two sources of metadata for automated data lineage (as specified by my colleague, a SAS consultant) :
    ‘as built’ – meaning that metadata which is needed for the solution is already available in the database structure;
    ‘as designed’ – meaning that information can be read from documentation of system design, for example, from data modeling tools.
    Very often, legacy systems cannot provide either of these metadata sources. It will urge you to face the ‘reverse engineering’ task. I have also experienced situations when even database documentation has been lost for an old application. So, it is recommended to check all of your applications before making decisions about automated data lineage.
  • When purchasing specialized metadata software, you would need to think about a certain set of modules in order to get automated data lineage. The reason is that for different types of metadata you would need different repositories, for example metadata and business rules repositories.
  • Existing software on the market usually allows you to document data lineage on physical levels. In some cases, you might get an option also to document data lineage on logical level. But be aware that documentation on conceptual and logical levels will still have be done manually, as well as mapping between physical and logical levels.
  • The number of metadata lineage objects types can rise to dozens and hundreds. Therefore, the corresponding amount of metadata could explode your storage capacities.
  • Very often, business stakeholders overestimate capabilities of automated data lineage.
Process design data lineage vs value data lineage

Different groups of stakeholders have different requirements for data lineage. There are at least two key stakeholder groups: IT technical professionals and business users such as financial and business controllers, business analysts, auditors. The key expectations of business users are the ability to follow changes in data values and the ability to get historical information on data processing up to 5-7 months in the past. An automated solution can’t satisfy these requirements. The automated data lineage is basically data processing design documentation. What is expected by business users, I call ‘value data lineage’. Strictly speaking, data lineage has nothing to do with such requirements. You can call it ‘drill-down capabilities’. The most challenging is that none of the existing data lineage providers have a solution for the ‘value’ data lineage in the scope of the whole application landscape.

Hopefully I have managed to give you an overview of what documenting data lineage looks like, and what challenges it encompasses. If you are brave enough to proceed, in the next (and final) article of this series I will provide you with a few tips how to start and run the data lineage implementation project.