Knowledge Graphs, Data Lineage, and Metadata Management: Introduction
Knowledge Graphs, Data Lineage, and Metadata Management: Introduction
The business environment has become very dynamic, uncertain, and volatile. Many companies understand the role of data in getting a competitive advantage. Many business people declare that “data is their company’s asset,” and they need “to get value from it” by “becoming data-driven.” Many established and new data management capabilities should assist in reaching these goals. Knowledge graphs, metadata management, and data lineage are examples of these new capabilities that have been actively discussed and developed. There are some challenges associated with the implementation of these capabilities:
Definitions of these capabilities are ambiguous and depend on the context.
These capabilities intersect each other to a great extent and have many dependencies.
Their implementation is time- and resource-consuming.
This article aims to:
Illustrate relationships between key concepts that form the core of these capabilities: data, metadata, information, and knowledge
Analyze and compare knowledge graphs (KG), data lineage (DL), and metadata management (MM) by comparing:
Definitions and structures of each capability
Business drivers that motivate companies to establish them
Architecture and technology
Demonstrate differences and similarities between these three capabilities
We will start with analyzing the core concepts.
Data, metadata, information, and knowledge
Data, metadata, information, and knowledge are core subjects of KG, DL, and MM, as demonstrated in Figure 1.
Figure 1: Meta(data), information, and knowledge are core subjects of MM, KG, and DL.
These concepts have multiple ambiguous definitions. It is essential to align our understanding of these concepts for the purpose of this article.
Information is data in a context that explains its meaning and relational connections. So, “50 C is information because we can explain the meaning of it and demonstrate relations with other data; for example, “50 C” is a day temperature in Sahara. Information gives answers to the questions: “who,” “what,” “where,” and “when”?
Knowledge is a collection of information. It gives answers to the question “why.” Knowledge creates a basis for information analysis. It allows us to understand principles. For example, gathering historical data about the maximum day temperature in a specific region during the summer may assist us in choosing the destination for our next vacation.
Wisdom is the result of interpolating and extrapolating processes based on knowledge. In this context, interpolation means an estimation of a value within a sequence of known values. Extrapolation refers to estimating an unknown value based on extending a known sequence of values or facts. Analyzing historical temperature trends may lead to building weather forecasts.
Now, let’s analyze the relationships between these concepts. You can be familiar with the well-known diagram by Russell L. Ackoff, shown in Figure 2. According to him, the content of the human mind can be classified in the following categories: data, information, knowledge, and wisdom. He has indicated that the first three concepts, data, information, and knowledge, relate to the past. They demonstrate what has been known. Wisdom assists people in creating the future.
We can change the human mind’s context to the company’s data processing. Figure 2 demonstrates these steps. First, you need to gather data and collect related metadata. By adding metadata to data, we create context and produce information. Adding semantics to information and analyzing it, we create knowledge that allows us to understand and apply rules and principles to forecasting the future.
Figure 2: The relationship between (meta)data, information, knowledge, and wisdom.
Now we know the definitions and relationships between data, metadata, information, and knowledge and can proceed with the analysis of three DM capabilities: metadata management, knowledge graphs, and data lineage.
Data management professionals often unreasonably narrow the scope of metadata management to technical metadata. Even ISO/IEC 11179 Metadata Registry Standard provides recommendations for the metadata repository for the documentation of metadata at the physical level. However, we see that the scope of metadata management can be defined much broader and include all metadata types.
The above-discussed definition of metadata management highlights its key tasks:
Plan, implement, and maintain metadata
Integrate metadata of various types
Figure 3: Metadata classification according to DAMA-DMBOK2.
Knowledge graph (KG)
I investigated multiple sources for this blog and never came across a similar definition of a knowledge graph. It means that many professionals understand this concept differently and in different contexts. The summary of these definitions sounds like the following: a knowledge graph is a graph that describes real-world facts and relationships between them in a human- and machine-understandable format using semantic models and graph databases.
Let’s elaborate on this definition using a simplified example, presented in Figure 4. A company ,XYZ, bought a computer, Acer Swift 5. In this case, “XYZ” and “Acer Swift 5” are data. The labels “LLC” (limited liability company) and “Laptop,” added to “XYZ” and “Acer Swift 5,” correspondingly are metadata. This is a basic model of a knowledge graph.
Figure 4: The simplified example of a basic knowledge graph.
We can continue creating the knowledge graph by adding other metadata, as shown in Figure 5.
Figure 5: The simplified example of a taxonomy.
By adding additional metadata, “Legal entity” and “Customer type,” we created a taxonomy. A taxonomy is a hierarchical classification scheme. Taxonomies represent collections of topics with “subcategory_of” relationships between them. We have created a hierarchy of categories: LLC is one of the legal entity types. A legal entity is one of the XYZ customer types.
If we create a similar taxonomy for “Acer Swift 5” and “Laptop” entities and link these two taxonomies, we will build an ontology, as demonstrated in Figure 6. An ontology is also a classification scheme that describes the categories and their relations in a business domain.
Figure 6: The simplified schematic of an ontology.
All schemes discussed above: basic, taxonomy, and ontology, represent knowledge graphs. The most important thing to remember is that knowledge graphs link data and metadata. Metadata creates semantics, adds meaning to data, and puts data into a context.
In my book, “Data Lineage from a Business Perspective,” I demonstrated the complex model of data lineage that describes data movement at four abstraction levels and then linked these levels. These levels, presented in Figure 7, are:
Business: IT assets, business processes, and people You document movement of data at the level of Information technology assets and link them to business process, people, applications
Conceptual/semantic: business subject areas, data entities, restrictions, and constraints The conceptual/ semantic level is the highest level of a data model. You describe the movement of data between data entities.
Logical/solution: data attributes, business rules At this data model level, you describe data movements between data attributes
Physical: database (db) schemas at the level of tables/columns (relational db) or labels/nodes (graph db), and ETL (extract, transform, load) jobs Data movements are documented between tables and columns in databases.
Figure 7: The metamodel of data lineage.
There are several misunderstandings associated with the term “data lineage”:
Depending on the documentation subject, data lineage can be classified as data value or metadata lineage. Data value lineage describes the movement of data instances along data chains. Business stakeholders require this type of data lineage. For example, a monthly revenue in a financial report amounts to 1 mln EUR. They want to know all original contracts and transformations applied along a data pipeline to derive the amount of 1 mln EUR. Many business stakeholders dream about such a lineage. However, it is hardly realizable. Metadata lineage describes data processing: the movement and transformation of data using metadata, not data itself. Data lineage is a combination of business and technical metadata. All data lineage-related initiatives focus on documenting metadata lineage. In our daily language, we use the term “data lineage,” but mean “metadata lineage.” This situation can be confusing for many business stakeholders.
Many professionals mistakenly limit data lineage only to the physical/technical level. Data lineage can be documented at four abstraction levels, and these levels should be linked, as we discussed above.
Data lineage maps objects not only along data chains at the same abstraction level but also objects between different layers. Depending on the direction of documentation, we recognize horizontal and vertical data lineage. Horizontal data lineage demonstrates how data flows from the origination point to the point of usage. Horizontal data lineage can be documented at each of the four layers. Vertical data lineage integrates data lineage components between different layers.
Now, we can make several conclusions regarding similarities and differences in the definitions of the concepts of metadata management (MM), knowledge graphs (KG), and data lineage (DL.)
Metadata is the core subject of MM, KG, and DL. Only KG can demonstrate the relationships between data and metadata.
All three capabilities focus on metadata integration.
Only DL aims to describe data transformations expressed in business rules and ETLs.
Irina is a data management practitioner with more than 10 years of experience. The key areas of her professional expertise are the implementation of data management frameworks and data lineage.
Throughout the years, she has worked for global institutions as well as large- and medium-sized organizations in different sectors, including but not limited to financial institutions, professional services, and IT companies.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!