The name of this article is the summary of my 10-year hands-on experience in data management. Implementation of data management frameworks in medium-sized companies and data lineage are the core of my professional experience. These topics might seem to be quite different, but my practical experience brought me to the conclusion that they are actually very much related to each other. In this article, I will try to explain how, and share my knowledge and techniques.

WHAT are data management and data lineage?

This question might look odd as we, data management professionals, should know the answer. One of the key data management tasks is ensuring aligned and unambiguous definitions. Yet, it is not the case for the majority of data management terms.

Data management

Each of my workshops on data management-related topics starts with me asking the participants to formulate in one/two sentences their understanding of data management. So far, I have never got to hear the same answer from two different people. When you try to analyse the definitions of data management from published sources you face the same challenge. This shows that data management has a variety of meanings that strongly depend on the context.

I consider data management as ‘a business capability that safeguards the company’s data and information resources and optimizes data and information value chains (‘the chain’) to ensure effective conduction of business’ (Source). The word ‘capability’ stresses the ability of data management to reach goals and deliver outcomes. Data and information value chain supports the business in the creation of business value as shown in Figure 1. The key data management sub-capabilities such as data governance, data quality, data modeling, and information systems (data and application) architecture are needed to design the chain.

Figure 1. Core data management sub-capabilities

An IT-related set of data management sub-capabilities enables the functioning of the chain.

Data lineage

In my opinion, data lineage is one of the most abstract and misaligned concepts in data management. As once said by one of my colleagues: ‘Everybody needs data lineage, nut no one can explain their understanding of it’. Usually, metadata data lineage documents the flows of data across the organization. The concept of data lineage intercepts with those of five other concepts: data chain, data value chain, integration architecture, data flow, and information value chain. Some of these terms are even considered to be synonyms of ‘data lineage’. There are different viewpoints on the constituent components of data lineage. Based on the analysis of the concept of data lineage and requirements of several legislative documents, I came up with the following set of components, that are required for proper documentation of data lineage:

  1. application landscape
  2. three levels of data models (conceptual, logical, and physical) with corresponding business rules/ETLs, linked vertically with each other.
  3. business processes and roles
  4. reports catalogue
  5. data quality checks and controls catalogues.

(For more detailed explanation, please check out this article.)

The scheme of the metadata lineage components can be seen in Figure 2. More about data lineage can be found in the set of articles Data Lineage 101, 102, 103, 104 and 105.

Figure 2. The scheme of the metadata lineage components

Now let us discuss the way to implement the data management function and demonstrate its relation to data lineage.

HOW are the setup of the data management function and the documentation of data lineage linked?

The answer to this question lies in similarities between:

  • the deliverables of data management sub-capabilities and key components of data lineage
  • the logical steps of implementation of data management and documentation of data lineage.

I will demonstrate the process of data management implementation and data lineage documentation using ‘the data management star’ model by Data Crossroads, shown in Figure 3.

Figure 3. The data management star by Data Crossroads.

Step 1. Defining needs and requirements.

Data management has different business stakeholders with specific needs concerning data and information. In Step 1, a company will focus on the specification of a feasible scope of data management initiative. The list of deliverables includes a list of business drivers, stakeholders, and their most urgent information needs. Information is delivered in the form of reports and dashboards. The report catalogue is one of the data lineage components. The scope of the data management initiative will limit the scope of the documentation of data lineage.

When the scope is clear, the corresponding data management tasks and responsibilities should be defined.

Step 2. Dividing tasks and responsibilities

The set of tasks and responsibilities belong to the data management framework which is a set of rules and roles. Rules include but not limited to data management strategy, policies, standards, processes, procedures, plans. Roles should be linked to data management processes, tasks, and deliverables.

Data lineage is one of the deliverables of data management. Therefore, a company needs to specify and document its understanding of data lineage, constituting components, as well as the way to document it (descriptive or automated). Accountabilities regarding data lineage documentation should be assigned to the relevant data management-related roles.

Step 3. Building the data management framework.

The implementation of data management will be done in several steps.

Step 3.1. Specify data requirements

To meet information requirements, specified in Step 1, corresponding data should be found, delivered, and processed. Very often, the relationship between raw data and information is not (fully) known. Data lineage is a means to fill in this gap. Usually, data lineage documentation starts with the specification of existing business processes and mapping them to the data sets.

Step 3.2. Document business processes

Business process documentation is not considered to be a part of any of the data management sub-capabilities. Still, this is a required component of data lineage. The majority of companies begin their data lineage documentation with the analysis of business processes. The performance of business processes is closely related to systems and applications used in these processes.

Step 3.3. Document system and application landscape

Data transformation usually takes place in systems and applications. Documentation of applications and data flows is a part of information systems architecture. At the same time, these flows are mandatory components of data lineage.

Step 3.4. Develop conceptual, logical, and physical data models and link them with each other.

Data flows can be documented on different levels of data model: conceptual, logical, and physical. A company might document data flow/data lineage on any of these levels. A company can choose to document data lineage on the combination of these levels. These models and links between them are deliverables of data modeling and data architecture. These models and links are at the same time are mandatory components of data lineage. The links between different levels of data models are either called ‘vertical data lineage’ or ‘linkage’.

Step 3.5. Identify critical data elements

The definition of critical data elements is a state-of-the-art task. More about practical techniques for the specification of critical data elements you can read in this article. The key reason to apply the concept of critical data is prioritization of data management initiatives, including building data quality checks and controls. The set of critical data elements is the deliverable of the data modeling sub-capability. The mandatory pre-requisite to specify critical data elements is knowing data lineage.

Specification of data quality requirements and corresponding checks and controls belongs to the deliverables of data quality sub-capability. Data checks and controls are considered being a component of data lineage.

Step 3.6. Assemble data lineage

Only when all above-mentioned steps are done, data lineage can be assembled. And at this point, a company de-facto is ready with the implementation of data management function.

Step 4. Intermediate assessment and gap analysis

This step is required to compare the desired results specified in Step 1 with the achieved one. This is also a point where the maturity assessment of data management can be performed.

Step 5. Planning further actions

As soon as the company has achieved the desired results, it might want to extend the scope of its data management initiative, including the scope of data lineage to be documented.

I hope that by now you understand how the initial statement ‘The setup of data management function follows the logic of the documentation of data lineage’ came to be.

If you are interested to learn more, please read more about our method here or our book The Data Management Toolkit.