The business environment has become very dynamic, uncertain, and volatile. Many companies understand the role of data in getting a competitive advantage. Many business people declare that “data is their company’s asset,” and they need “to get value from it” by “becoming data-driven.” Many established and new data management capabilities should assist in reaching these goals. Knowledge graphs, metadata management, and data lineage are examples of these new capabilities that have been actively discussed and developed. There are some challenges associated with the implementation of these capabilities:

  • Definitions of these capabilities are ambiguous and depend on the context.
  • These capabilities intersect each other to a great extent and have many dependencies.
  • Their implementation is time- and resource-consuming.

This article aims to:

  • Illustrate relationships between key concepts that form the core of these capabilities: data, metadata, information, and knowledge
  • Analyze and compare knowledge graphs (KG), data lineage (DL), and metadata management (MM) by comparing:
    • Definitions and structures of each capability
    • Business drivers that motivate companies to establish them
    • Architecture and technology
    • Use cases
  • Demonstrate differences and similarities between these three capabilities

We will start with analyzing the core concepts.

Data, metadata, information, and knowledge

Data, metadata, information, and knowledge are core subjects of KG, DL, and MM, as demonstrated in Figure 1.

Figure 1: Meta(data), information, and knowledge are core subjects of MM, KG, and DL.

Figure 1: Meta(data), information, and knowledge are core subjects of MM, KG, and DL.

These concepts have multiple ambiguous definitions. It is essential to align our understanding of these concepts for the purpose of this article.

Let’s start with definitions.

Data is the physical or electronic representation of signals or facts “in a manner suitable for communication, interpretation, or processing by human beings or by automatic means.”  For example, “50” is data. Has it meaning in this format? No.

If we add “Celsius (C), ” we will understand that we are talking about tempreture. By adding “C,” we created a context that we can understand. In this case, “Celsius” is metadata that has made the context. Thus, metadata is data that defines and describes other data in a particular context.

Information is data in a context that explains its meaning and relational connections. So, “50 C is information because we can explain the meaning of it and demonstrate relations with other data; for example, “50 C” is a day temperature in Sahara. Information gives answers to the questions: “who,” “what,” “where,” and “when”?

Knowledge is a collection of information. It gives answers to the question “why.” Knowledge creates a basis for information analysis. It allows us to understand principles. For example, gathering historical data about the maximum day temperature in a specific region during the summer may assist us in choosing the destination for our next vacation.

Wisdom is the result of interpolating and extrapolating processes based on knowledge. In this context, interpolation means an estimation of a value within a sequence of known values. Extrapolation refers to estimating an unknown value based on extending a known sequence of values or facts. Analyzing historical temperature trends may lead to building weather forecasts.

Now, let’s analyze the relationships between these concepts. You can be familiar with the well-known diagram by Russell L. Ackoff, shown in Figure 2. According to him, the content of the human mind can be classified in the following categories: data, information, knowledge, and wisdom. He has indicated that the first three concepts, data, information, and knowledge, relate to the past. They demonstrate what has been known. Wisdom assists people in creating the future.

We can change the human mind’s context to the company’s data processing. Figure 2 demonstrates these steps. First, you need to gather data and collect related metadata. By adding metadata to data, we create context and produce information. Adding semantics to information and analyzing it, we create knowledge that allows us to understand and apply rules and principles to forecasting the future.

Figure 2: The relationship between (meta)data, information, knowledge, and wisdom.

Figure 2: The relationship between (meta)data, information, knowledge, and wisdom.

Now we know the definitions and relationships between data, metadata, information, and knowledge and can proceed with the analysis of three DM capabilities: metadata management, knowledge graphs, and data lineage.

MM, KG, and DL: definitions and structure

Metadata management (MM)

DAMA-DMBOK2 defines metadata management as “Planning, implementation, and control activities to enable access to high quality, integrated metadata.” Metadata is a complex concept. Therefore, the definition of the metadata management scope is not an easy task. Metadata can be classified differently depending on its subject area and user groups. I use DAMA-DMBOK2 metadata classification in my practices. Figure 3 demonstrates three categories of metadata: business, technical, and operational.

“Business Metadata focuses largely on the content and conditions of the data and includes details related to data governance.” Data models (conceptual and logical), business terms and definitions, data owners, and business rules are examples of business metadata.

“Technical Metadata provides information about technical details of data, the systems that store data, and the processes that move it within and between systems.” Physical data models, database table and column properties, and ETL jobs represent technical metadata

“Operational Metadata describes details of the processing and accessing data.” Logs of jobs execution and error logs are examples.

Data management professionals often unreasonably narrow the scope of metadata management to technical metadata. Even ISO/IEC 11179 Metadata Registry Standard provides recommendations for the metadata repository for the documentation of metadata at the physical level. However, we see that the scope of metadata management can be defined much broader and include all metadata types.

The above-discussed definition of metadata management highlights its key tasks:

  • Plan, implement, and maintain metadata
  • Integrate metadata of various types
    Figure 3: Metadata classification according to DAMA-DMBOK2.

    Figure 3: Metadata classification according to DAMA-DMBOK2.

Knowledge graph (KG)

I investigated multiple sources for this blog and never came across a similar definition of a knowledge graph. It means that many professionals understand this concept differently and in different contexts. The summary of these definitions sounds like the following: a knowledge graph is a graph that describes real-world facts and relationships between them in a human- and machine-understandable format using semantic models and graph databases.

Let’s elaborate on this definition using a simplified example, presented in Figure 4. A company ,XYZ, bought a computer, Acer Swift 5. In this case, “XYZ” and “Acer Swift 5” are data. The labels “LLC” (limited liability company) and “Laptop,” added to “XYZ” and “Acer Swift 5,” correspondingly are metadata. This is a basic model of a knowledge graph.

Figure 4: The simplified example of a basic knowledge graph.

Figure 4: The simplified example of a basic knowledge graph.

We can continue creating the knowledge graph by adding other metadata, as shown in Figure 5.

Figure 5: The simplified example of a taxonomy.

Figure 5: The simplified example of a taxonomy.

By adding additional metadata, “Legal entity” and “Customer type,” we created a taxonomy. A taxonomy is a hierarchical classification scheme. Taxonomies represent collections of topics with “subcategory_of” relationships between them. We have created a hierarchy of categories: LLC is one of the legal entity types. A legal entity is one of the XYZ customer types.

If we create a similar taxonomy for “Acer Swift 5” and “Laptop” entities and link these two taxonomies, we will build an ontology, as demonstrated in Figure 6.  An ontology is also a classification scheme that describes the categories and their relations in a business domain.

Figure 6: The simplified schematic of an ontology.

Figure 6: The simplified schematic of an ontology.

All schemes discussed above: basic, taxonomy, and ontology, represent knowledge graphs. The most important thing to remember is that knowledge graphs link data and metadata. Metadata creates semantics, adds meaning to data, and puts data into a context.

Data lineage (DL)

The concept of data lineage, like knowledge graphs, has multiple definitions in various contexts. My practical experience with data lineage let me formulate the following definition: data lineage is a description of data movements and transformations at various abstraction levels along data chains and relationships between data at these levels.

In my book, “Data Lineage from a Business Perspective,” I demonstrated the complex model of data lineage that describes data movement at four abstraction levels and then linked these levels. These levels, presented in Figure 7, are:

  • Business: IT assets, business processes, and people
    You document movement of data at the level of Information technology assets and link them to business process, people, applications
  • Conceptual/semantic: business subject areas, data entities, restrictions, and constraints
    The conceptual/ semantic level is the highest level of a data model. You describe the movement of data between data entities.
  • Logical/solution: data attributes, business rules
    At this data model level, you describe data movements between data attributes
  • Physical: database (db) schemas at the level of tables/columns (relational db) or labels/nodes (graph db), and ETL (extract, transform, load) jobs
    Data movements are documented between tables and columns in databases.
Figure 7: The metamodel of data lineage.

Figure 7: The metamodel of data lineage.

There are several misunderstandings associated with the term “data lineage”:

  1. Depending on the documentation subject, data lineage can be classified as data value or metadata lineage.
    Data value lineage describes the movement of data instances along data chains. Business stakeholders require this type of data lineage. For example, a monthly revenue in a financial report amounts to 1 mln EUR. They want to know all original contracts and transformations applied along a data pipeline to derive the amount of 1 mln EUR. Many business stakeholders dream about such a lineage. However, it is hardly realizable.
    Metadata lineage describes data processing: the movement and transformation of data using metadata, not data itself. Data lineage is a combination of business and technical metadata.  All data lineage-related initiatives focus on documenting metadata lineage.
    In our daily language, we use the term “data lineage,” but mean “metadata lineage.” This situation can be confusing for many business stakeholders.
  1. Many professionals mistakenly limit data lineage only to the physical/technical level.
    Data lineage can be documented at four abstraction levels, and these levels should be linked, as we discussed above.
  1. Data lineage maps objects not only along data chains at the same abstraction level but also objects between different layers.
    Depending on the direction of documentation, we recognize horizontal and vertical data lineage.
    Horizontal data lineage demonstrates how data flows from the origination point to the point of usage.
    Horizontal data lineage can be documented at each of the four layers.
    Vertical data lineage integrates data lineage components between different layers.

Now, we can make several conclusions regarding similarities and differences in the definitions of the concepts of metadata management (MM), knowledge graphs (KG), and data lineage (DL.)

Conclusions

  1. Metadata is the core subject of MM, KG, and DL. Only KG can demonstrate the relationships between data and metadata.
  2. All three capabilities focus on metadata integration.
  3. Only DL aims to describe data transformations expressed in business rules and ETLs.