The previous article reviewed specifics in selecting data management solutions and their functionalities. In this article, we will discuss the following:
- The definition of data lineage in the IT tools context
- Business needs and requirements for a data lineage tool
- Situation with commercial-off-the-shelf (COTS) data lineage tools (based on the analysis of 44 tools) in terms of:
- IT tool types
- Deployment options
- High-level functionalities
- Industry solutions
Data Lineage Definition
There is no aligned definition of data lineage exist. Many professionals mistakenly narrow the understanding of data lineage to the physical level and automated method of its documentation. I shared my practical experience in data lineage implementation in multiple publications: the book “Data Lineage from a Business Perspective,” articles, and webinars. In this article, I will summarize my approach and the latest discoveries in the area of DL tools.
Data lineage is a description of data movements and transformations at various abstraction levels along data chains and relationships between data elements at these levels.
Figure 1 demonstrates the metamodel of data lineage. This metamodel is the result of an analysis of various data lineage-related concepts. Data lineage can be documented at multiple abstraction levels: business, conceptual, logical, and physical. Various data lineage components and objects can describe each abstraction level. For example, at the physical level, a table, column, and ETL jobs are key data lineage components for SQL databases. Different table types, for example, core and lookup, will represent different data lineage objects.
We recognize several data lineage types, as shown in Figure 2.
Data lineage classification depends on several factors:
- The object of documentation: metadata and data value lineage
Metadata lineage describes processes that enable data movements and transformations. Data value lineage means describing data movements at the level of data instances.
- The method of documentation: descriptive and automated
The descriptive method means that data lineage is described manually using various IT tools. This method must be used for business, conceptual, and logical layers. Many companies still describe physical data lineage manually in Excel, which is not recommended.
The automated method means ingesting and integrating metadata at the physical level from various repositories. This method requires multiple products and functionalities, including scanners, integration, visualization services, etc.
- The direction of documentation: vertical and horizontal
Horizontal data lineage links data lineage objects along a data chain at the same abstraction level. Vertical data lineage maps data lineage objects between more than one layer. This article focuses only on metadata lineage.
Business Needs and Requirements for a Data Lineage Tool
Various data lineage stakeholders have quite different requirements for data lineage. I described the data lineage requirements of four stakeholder groups in detail in the article: “Data Lineage: the Needs of and Benefits to Various Stakeholders.”
It is essential to understand that the requirements for a data lineage tool depend on the data lineage type. Requirements for descriptive data lineage at the business, conceptual, and logical levels will differ significantly from the requirements for automated data lineage at the physical level.
If a company wants to implement a data lineage at multiple layers, then the data lineage requirements will combine requirements for descriptive and automated data lineage.
Let’s briefly consider some examples of the requirements.
Requirements for descriptive data lineage
The key requirements for data lineage documentation at the business, conceptual, and logical levels are:
- Ability to document and link business processes and IT assets
This requirement comes from multiple legislations. Usually, a company starts recording data flows by describing and linking business processes to IT assets like IT systems, applications, databases, ETL tools, etc. These flows are an example of data architecture deliverables.
- Ability to document and link multiple data models with various types of diagrams and notations
Numerous techniques to model data exist. Multiple teams within a company may use different diagrams and notations that must be linked.
- Collaboration and integration capabilities
Multiple teams must be able to work on various models simultaneously. A tool must support internal and external integration capabilities. Internal integration means mapping artifacts produced by multiple teams. External integration assumes the ability to integrate with other IT tools, including automated data lineage solutions.
Requirements for automated data lineage
The key requirements for automated data lineage at the physical level are:
- Ability to discover, ingest, integrate, and visualize various metadata types from various database and language types
One of the key challenges in data lineage documentation is the level of granularity. Documenting data lineage at the table level provides a few advantages. To use data lineage outcomes effectively, professionals must see data movements at the column level enriched by transparent content of implemented ETLs. As a company can have various databases and ETL engines, an IT solution must provide multiple scanners to read and ingest metadata from different types of databases and coding languages.
- Collaboration and integration
These requirements are similar to the previously discussed ones. It is worth noting that integration between descriptive and automated data lineage outcomes is also critical. Machine learning capabilities can assist in matching these outcomes.
- Ability to perform root-cause and impact analysis
Root-cause and impact analysis are two key techniques IT and data management professionals require and use in optimizing DevOps operations, implementing data quality, and so on.
Let’s see how existing commercial-off-the-shelf tools meet these requirements.
Overview of COTS Data Lineage Tools
In this article, I will discuss several challenges IT and data management professionals must know while selecting a data lineage tool.
Challenge 1: Vendors use different approaches in labeling data lineage tools
Figure 3 demonstrates the analysis results of 44 tools that provide data lineage functionality. You can see that these tools have quite different names like “platform,” “solution,” “software,” and so on.
As you can see, 14 tools use the term “platform” in the tool title. However, the reality is more complex as these platforms have pretty various specifications like “data lineage,” “data discovery,” “data catalog,” “data management” platforms, and so on.
This labeling diversity can complicate the selection process and will require thorough investigation.
Challenge 2: The same tool can have multiple labels
You can find out that the same tool can be classified differently depending on the information source. For example, I came across four different classifications for Collibra:
- Metadata management
- Data management
- Data governance
- Data lineage
One of the reasons for that is that Collibra offers multiple functionalities. You may need only data lineage-related functionality, which is a small percentage of the proposed functionality. So, you should have a clear vision of a company’s requirements.
Challenge 3: Vendors and 3d parties use pretty different approaches to labeling the same tools
Labeling approaches by vendors and 3d parties’ information providers vary greatly. Figure 4 demonstrates this challenge:
Several recognizable IT solution review providers share their expert viewpoints on IT tools. Their sites can be the first information source about data lineage solutions. Assume you are looking for a data lineage tool. You will come across several lists with multiple “best” solutions. My first question is: “What are the criteria of the ‘best’?” Unfortunately, I could never find an answer. Then, you may face a situation where a 3d site provider presents a data lineage tool that has been labeled by its provider quite differently. Let’s take Alation as an example. I found it one of the “best” data lineage solutions, while Alation recognizes itself as “the data intelligence platform.” This situation leads to the following challenge.
Challenge 4: Data lineage is often not a core functionality
Among the 44 tools I analyzed, only 25% have provided data lineage as the core functionality. The rest deliver data lineage as a supporting functionality for the technologies available in their solutions. If you need a data lineage solution for multiple database types, you should focus only on this 25% of solutions and search for those that provide scanners for all your database types.
Challenge 5: Data lineage solutions often focus on data lineage at the physical level
Only 25% of 44 data lineage solutions offer integrated capabilities to document data lineage at multiple abstraction levels. Solidatus, Collibra, and Informatica are examples.
Challenge 6: Some providers offer 3d party or open-source data lineage solutions
I discovered that some providers don’t provide their own data lineage functionalities. Some of them include 3d party or open-source solutions.
Challenge 7: Some providers don’t provide the complete set of data lineage functionalities
The complete set of automated data lineage functionalities includes metadata scanners, integration, and visualization services. I noticed that only 27% of reviewed solutions mentioned scanners as a part of their solutions.
Challenge 8: Data lineage providers offer different deployment opportunities
Information about deployment opportunities can be challenging to find. 34% of solution providers offer on-premises and cloud functionality. For the rest of the providers, the information is either not available, or they provide only “on-premises” solutions.
Challenge 9: Data lineage tools offer some other functionalities
Figure 5 demonstrates the variety of functionalities included in data lineage-related tools.
As mentioned earlier, for 75% of reviewed solutions, data lineage is a supporting functionality. These solutions include multiple other functionalities, like data catalogs and business glossaries, data governance, data quality, data lifecycle support, etc. This approach is understandable as data lineage is a supporting capability for other data management capabilities. The key point is matching a company’s needs, requirements, and resources with offered solution functionalities.
Challenge 10: Data lineage tools are implemented in a variety of industries
While choosing a data lineage solution, it is advisable to check its applicability for a particular sector. Figure 6 demonstrates the core industries in which data lineage providers compete.
It is essential to understand that this figure demonstrates that the need for data lineage varies per industry. It may indicate something other than the fact that an IT tool offers an industry solution. For example, legislative requirements pressure is the main business driver for the banking, insurance, and financial services industry. That is why multiple data lineage solutions have been implemented in this sector.
The reviewed challenges highlight the key steps a company should undertake as a part of its selection process.
Summary
A company must use a sophisticated approach in choosing an appropriate data lineage tool and perform multiple steps, including the following:
- Identify the business needs and requirements of various data lineage stakeholder groups
Various stakeholders have different needs and requirements regarding data lineage types, outcomes, and the functionality of a data lineage tool.
- Desing a metamodel of data lineage
Data lineage is a complex metadata construct as data movements and transformations can be documented at various abstraction levels. A company may choose one or more layers of data lineage documentation. The number and nature of data lineage components and objects per layer may differ. The chosen metamodel of data lineage will significantly impact the scope and success of the data lineage initiative.
- Define an approach and methods to documenting data lineage
The metamodel of data lineage will define the documentation method: descriptive and/or automated. The chosen method will determine the required functionalities of a data lineage tool.
- Perform a detailed investigation and comparison of data lineage tools’ functionalities
A thorough examination will assist in choosing an appropriate solution that fits a company’s needs and resources.