Knowledge Graphs, Data Lineage, and Metadata Management: Architecture and Technology

Now, we will look at the similarities and differences in data architecture and technology required to realize these three capabilities.

Metadata management (MM)

We have previously discussed that metadata can be of various types. Different types of metadata come from multiple sources. Examples of metadata sources are business and data modeling solutions, enterprise architecture solutions, data catalogs, relationship repositories, IT applications and databases, and data transformation tools. This list can be expanded with other sources. The biggest challenge is collecting and integrating metadata from these sources. A company should have well-developed metadata architecture in place. Depending on the number of metadata sources, the scope of a metadata initiative, and metadata volumes, a company should opt for a centralized or decentralized metadata architecture, as shown in Figure 1. A single metadata repository or lake is an example of centralized architecture. Shared metadata repositories represent decentralized metadata architecture. Metadata repositories can be built using both relational and graph databases.

Figure 1: A high-level view of metadata architecture.

The task of metadata management is to plan, collect, integrate, and maintain metadata stored in different repositories.

Knowledge graph (DL)

A distinguishing feature of a knowledge graph is that it connects data and metadata stored in different repositories. However, we should be aware of the volumes of data and metadata to be stored in a repository or a set of repositories. A company should use either centralized or decentralized repositories with the preferable usage of graph databases, as demonstrated in Figure 2.

Figure 2: A high-level view of knowledge graph architecture.

Data lineage (DL)

Data lineage is a metadata construct. Therefore, its implementation is realized in the metadata architecture landscape. Data lineage enables metadata management to integrate metadata and trace and visualize data movements, transformations, and processes across various repositories by using metadata, as shown in Figure 3. The implementation of data lineage requires various software applications, including scanners for reading and ingesting metadata into a metadata repository, visualization software for demonstration and analysis of metadata objects and their relationships, and metadata and relationship repositories.

Figure 3: A high-level view of data lineage architecture.

The method of implementation

All these three capabilities share the same challenge: how to retrieve and integrate metadata. Two methods exist: descriptive and automated.

A descriptive method is a method to retrieve and record metadata and relationships manually.

The descriptive method is applicable to document business and partial technical metadata that belongs to the business, conceptual/semantic, and logical/solution layers.

An automated method is a method to record metadata and relationships by implementing automated processes to scan and ingest metadata into a repository.

This method applies to technical metadata at the physical level.

In my book, “Data Lineage from a business perspective, “I have described these two methods in-depth.”

Even if a company uses the automated method, a lot of manual work is still required to integrate metadata between logical and physical layers. Some machine learning algorithms can assist in matching metadata. However, the final mapping is still a human-being task.

Data fabric as a common distributed data architecture approach

A company can use different data architectures to implement these capabilities. While deciding on the architecture, a company should anticipate the volume of data and metadata to be stored and processed in the foreseeable future.

Nowadays, many data management professionals talk about “data fabric”: “a modern, distributed data architecture that includes shared data assets and optimized data management and integration processes” as a solution for implementing three capabilities: MM, KG, and DL. Figure 4 demonstrates the schematics of the data fabric idea.

Figure 4: A high-level overview of data fabric architecture.

Data fabric allows for collecting data from different sources. It includes optimized data chains to process, transform, and deliver data to various data users. Data fabric shares data assets. Among other things, data assets include (meta)data sets and knowledge graphs. The data lineage capability ensures the transparency of data processing and metadata integration. Data fabrics can be realized in various environments: on-premises, cloud, and hybrid.

So, data fabric architecture unites all three capabilities discussed in this article. It is worth mentioning that the data fabric architecture can’t be realized in a single product or even on a single platform.

Conclusions

MM, KG, and DL capabilities deal with similar metadata objects and their relationships from the same data repositories:
- The metadata management capability cares about all types of metadata, including business, technical, and operational.
- The knowledge graph capability maps data, related business and technical metadata, and relationships between data and metadata objects at various abstraction levels.
- The data lineage capability stores and visualizes business and technical metadata objects, relationships, and data transformation rules.
MM, KG, and DL can be implemented using various data architecture types: centralized, decentralized (distributed), and hybrid.
MM and DL repositories can use relational and graph databases, while KG initiatives mainly need a graph database.
Each of these three capabilities has its role in managing data and metadata at different abstraction levels:
- The metadata management capability plans, designs, implements and maintains metadata of various types in the corresponding (meta)data repositories.
- The data lineage capability plans, designs, implements and maintains the integration and visualization of metadata and its relationships stored in various (meta)data repositories.
- The knowledge graph capability plans, designs, implements and maintains integration of relationships between data and metadata from various (meta)data repositories.
Documentation and integration of data and metadata are performed using automated and descriptive methods.
A data fabric architecture can be used to implement and unite all these three capabilities.

For more insights, visit the Data Crossroads Academy site:

https://academy.datacrossroads.nl/courses/data-lineage-what-why-how/lesson/data-lineage-what-why-how/