Data lineage is a complex concept. In this article, we investigate a data lineage model.
In ‘Data Lineage 101: What’s so special about data lineage?’ we discussed why data lineage triggers such an interest among business, IT, and data management professionals. Now it is time to dive into what data lineage actually is.
‘Everyone wants data lineage, but no one can explain what they exactly mean by that and what their expectations are,’ said one of my colleagues a few years ago. This sentence stuck in my mind, as this is precisely what is happening with data lineage nowadays.
So nowadays, when someone starts talking to me about data lineage, I first ask: ‘what do you mean by “data lineage”? What is your understanding of data lineage? Believe it or not, but so far, I haven’t met a person whose definition and interpretation of data lineage have perfectly corresponded with mine or others that I have encountered over the years. The reason is that you can hardly find an aligned, unambiguous, and widely accepted definition of data lineage. In this article, I would like to touch on the topics of defining data lineage and its key components. Let’s start with the definitions.
Defining ‘data lineage.’
Let’s start by taking a look at reference industry guides and publications issued by DAMA International (DAMA-DMBOK), The Open Group (TOGAF), and the EDM (Enterprise Data Management) Council (DCAM).
Two key documents published by DAMA International contain definitions and information about data lineage: DAMA-DMBOK21 and DAMA Dictionary2. The key challenge is that the definition of data lineage is ambiguous and intercepts (and has a lot in common with) other terms, such as ‘data flow,’ ‘integration architecture,’ and ‘data & information (value) chain.’
Let’s have a look at the following definitions provided by DAMA Dictionary:
‘Data flow is ‘the transfer of data between systems, applications, or data sets.” 3.
‘Data lineage is a description of the pathway from the data source to their current location and the alterations made to the data along the pathway.’ 4
This part was pretty straightforward, showing that data flow is a process of data transformation, and data lineage describes this process. But then I opened DAMA-DMBOK2 and found the following:
‘[…] data […] has lineage (i.e., a pathway along which it moves from its point of origin to its point of usage, sometimes called the data chain)’ 5.
‘Data flows are a type of data lineage documentation that depicts how data moves through business processes and systems. End-to-end data flows illustrate where the data originated, is stored, and used, and how it is transformed as it moves inside and between diverse processes and systems’6.
After reading all these definitions, I could hardly see the difference between data flow and lineage. Do you?
Let’s leave the DAMA world for now and see what the EDM Council says about it.
DCAM
The definition of ‘data lineage’ by the EDM Council sounds as follows:
Data lineage is ‘documentation of the sequence of movement and/or transformation of data as it flows
between the consumer and the source(s).’ 7
CONCLUSIONS
This brings me to the following conclusion:
- ‘Data flow,’ ‘data lineage,’ and ‘data chain’ describe similar concepts of data movement and transformation. Therefore, these terms are often used interchangeably.
- Data lineage is a description of the path along which data flows from the point of its origin to the point of its use.
Still, the definitions say nothing about documenting data lineage. In order to understand the way to document it, you need to know which components constitute data lineage.
Data lineage components
To get some clarity on the key components of data lineage, let’s again check the aforementioned industry reference guides.
DAMA-DMBOK
DAMA-DMBOK2 states that:
‘Data flows map and document relationships between data and:
- Application within a business process
- Data stores or databases in an environment
- Network segments (useful for security mapping)
- Business roles, depicting which roles have responsibility for creating, updating, using, and deleting data
- The location where local differences occur.’ 8
So, according to DAMA-DMBOK2, the key components of data flow / lineage are IT system components (applications, databases, network segments) and business processes.
DCAM
In the Standard Glossary of EDM Council, you can find that ‘Data Lineage describes the chronology of ownership, custody, and data location. Data Lineage provides a visual mapping of the movement and changes in data from system to system. [..] The complete lineage will document the full data flow and capture metadata about the movement and transformation of the data element. Lineage may include a mapping of the data controls’9.
So, according to the EDM council, data lineage links such components as systems, data controls, ownership, custody, and metadata.
TOGAF
TOGAF 9.1 by The Open Group, the leading guide in Enterprise architecture, stipulates that ‘The Data Flow view is concerned with storage, retrieval, processing, archiving, and security of data’10.
The definition of TOGAF9.1 seems to have nothing in common with the definitions of DAMA International and EDM Council. It instead refers to the concept of data lifecycle (which is a separate topic and will not be discussed in this article).
CONCLUSIONS
So, after the analysis, we may conclude that:
- There is no agreed list of components that constitute data lineage.
- These are the components of data lineage that you should take into account while documenting data lineage:
- IT systems (applications, databases, network segments)
- Data elements
- Business processes, including different functional roles (data- and non-data related)
- Data controls.
Figure 1 is a visualization of these conclusions:

Figure 1. Key components of data lineage
Once again, the key points that you should keep in mind when it comes to discussing data lineage:
- Data lineage is a representation of the path along which data flows from the point of its origin to the point of its usage.
- Data lineage is used to design and describe processes of data transformation and processing.
- Data lineage is recorded by representing a set of interlinked components such as data (elements), business processes, IT systems and applications, and data controls. These components could be presented at a different level of abstraction and detalization.
Usually, such a lineage also is called ‘horizontal’ data lineage.
In ‘Data Lineage 101: What’s so special about data lineage?’, we have discussed that legislation requirements (among others) are among the strongest motivators for companies to implement data lineage.
In the next part of the blog series (‘Data Lineage 103’), I discuss which components of data lineage are required by different legislations.
For more insights, visit the Data Crossroads Academy site:
——————————————————————————————————————-
References
- DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017
- DAMA International. The DAMA Dictionary of Data Management, Second Edition: Technics Publications, 2011
- DAMA International. The DAMA Dictionary of Data Management, Second Edition: Technics Publications, 2011, p.75
- DAMA International. The DAMA Dictionary of Data Management, Second Edition: Technics Publications, 2011, p.78
- DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017, p.28
- DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017, p.107
- Enterprise Data Management Council. The Standard Glossary of Data Management Concepts, version 0.2.1, 2017, p.9
- DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017, p.108
- Enterprise Data Management Council. The Standard Glossary of Data Management Concepts, version 0.2.1, 2017, p.9
- The Open Group. “TOGAF Version 9.1, The Open Group Standard” no. G116, 2011, p.426