The business environment has become very dynamic, uncertain, and volatile. Many companies understand the role of data in getting a competitive advantage. Many business people declare that “data is their company’s asset,” and they need “to get value from it” by “becoming data-driven.” Many established and new data management capabilities should assist in reaching these goals. Knowledge graphs, metadata management, and data lineage are examples of these new capabilities that have been actively discussed and developed. There are some challenges associated with the implementation of these capabilities:
- Definitions of these capabilities are ambiguous and depend on the context.
- These capabilities intersect each other to a great extent and have many dependencies.
- Their implementation is time- and resource-consuming.
This article aims to:
- Illustrate relationships between key concepts that form the core of these capabilities: data, metadata, information, and knowledge
- Analyze and compare knowledge graphs (KG), data lineage (DL), and metadata management (MM) by comparing:
- Definitions and structures of each capability
- Business drivers that motivate companies to establish them
- Architecture and technology
- Use cases
- Demonstrate differences and similarities between these three capabilities
We will start by analyzing the core concepts.
Data, metadata, information, and knowledge
Data, metadata, information, and knowledge are core subjects of KG, DL, and MM, as demonstrated in Figure 1.
These concepts have multiple ambiguous definitions. It is essential to align our understanding of these concepts for the purpose of this article.
Let’s start with definitions.
Data is the physical or electronic representation of signals or facts “in a manner suitable for communication, interpretation, or processing by human beings or by automatic means.” For example, “50” is data. as it meaning in this format? o.
If we add “Celsius (C),” we will understand that we are talking about temperature. B adding “C,” we created a context we can understand. I this case, “Celsius” is metadata that has made the context. T us, metadata is data that defines and describes other data in a particular context.
Information is data in a context that explains its meaning and relational connections. S , “50 C is information because we can explain the meaning of it and demonstrate relations with other data; for example, “50 C” is the day temperature in Sahara. I formation gives answers to the questions: “who,” “what,” “where,” and “when”?
Knowledge is a collection of information. I gives answers to the question “why.” Knowledge creates a basis for information analysis. I allows us to understand principles. F r example, gathering historical data about the maximum day temperature in a specific region during the summer may assist us in choosing the destination for our next vacation.
Wisdom is the result of interpolating and extrapolating processes based on knowledge. I this context, interpolation means estimating a value within a sequence of known values. E trapolation refers to estimating an unknown value based on extending a known sequence of values or facts. A alyzing historical temperature trends may lead to building weather forecasts.
Now, let’s analyze the relationships between these concepts. Y u can be familiar with the well-known diagram by Russell L. Ackoff, shown in Figure 2. A cording to him, the human mind’s content can be classified into the following categories: data, information, knowledge, and wisdom. H has indicated that the first three concepts, data, information, and knowledge, relate to the past. T ey demonstrate what has been known. W sdom assists people in creating the future.
We can change the human mind’s context to the company’s data processing. F gure 2 demonstrates these steps. F rst, you need to gather data and collect related metadata. B adding metadata to data, we create context and produce information. A ding semantics to information and analyzing it, we create knowledge that allows us to understand and apply rules and principles to forecasting the future.
Now we know the definitions and relationships between data, metadata, information, and knowledge and can proceed with analyzing three DM capabilities: metadata management, knowledge graphs, and data lineage.
MM, KG, and DL: definitions and structure
Metadata management (MM)
DAMA-DMBOK2 defines metadata management as “Planning, implementation, and control activities to enable access to high quality, integrated metadata.” Metadata is a complex concept. T erefore, the definition of the metadata management scope is not an easy task. M tadata can be classified differently depending on its subject area and user groups. I use DAMA-DMBOK2 metadata classification in my practices. F gure 3 demonstrates three categories of metadata: business, technical, and operational.
“Business Metadata focuses largely on the content and conditions of the data and includes details related to data governance.” Data models (conceptual and logical), business terms and definitions, data owners, and business rules are examples of business metadata.
“Technical Metadata provides information about technical details of data, the systems that store data, and the processes that move it within and between systems.” Physical data models, database table and column properties, and ETL jobs represent technical metadata
“Operational Metadata describes details of the processing and accessing data.” Logs of job execution and error logs are examples.
Data management professionals often unreasonably narrow the scope of metadata management to technical metadata. E en ISO/IEC 11179 Metadata Registry Standard provides recommendations for the metadata repository to document metadata at the physical level. H wever, we see that the scope of metadata management can be defined much broader and include all metadata types.
The above-discussed definition of metadata management highlights its key tasks:
- Plan, implement and maintain metadata
- Integrate metadata of various types
Knowledge graph (KG)
I investigated multiple sources for this blog and never came across a similar definition of a knowledge graph. I means that many professionals understand this concept differently and in different contexts. T e summary of these definitions sounds like the following: a knowledge graph is a graph that describes real-world facts and relationships between them in a human- and machine-understandable format using semantic models and graph databases.
Let’s elaborate on this definition using a simplified example in Figure 4. A company, XYZ, bought a computer, Acer Swift 5. I this case, “XYZ” and “Acer Swift 5” are data. T e labels “LLC” (limited liability company) and “Laptop,” added to “XYZ” and “Acer Swift 5,” correspondingly are metadata. T is is a basic model of a knowledge graph.
We can continue creating the knowledge graph by adding other metadata, as shown in Figure 5.
By adding additional metadata, “Legal entity” and “Customer type,” we created a taxonomy. A taxonomy is a hierarchical classification scheme. T xonomies represent collections of topics with “subcategory_of” relationships between them. W have created a hierarchy of categories: LLC is one of the legal entity types. A legal entity is one of the XYZ customer types.
If we create a similar taxonomy for “Acer Swift 5” and “Laptop” entities and link these two taxonomies, we will build an ontology, as demonstrated in Figure 6. ontology is a classification scheme that describes the categories and their relations in a business domain.
All schemes discussed above: basic, taxonomy, and ontology, represent knowledge graphs. T e most important thing to remember is that knowledge graphs link data and metadata. M tadata creates semantics, adds meaning to data, and puts data into context.
Data lineage (DL)
The concept of data lineage, like knowledge graphs, has multiple definitions in various contexts. M practical experience with data lineage let me formulate the following definition: data lineage is a description of data movements and transformations at various abstraction levels along data chains and relationships between data at these levels.
In my book, “Data Lineage from a Business Perspective,” I demonstrated the complex model of data lineage that describes data movement at four abstraction levels and then linked these levels. T ese levels, presented in Figure 7, are:
- Business: IT assets, business processes, and people
You document the movement of data at the level of Information technology assets and link them to business processes, people, and applications.
- Conceptual/semantic: business subject areas, data entities, restrictions, and constraints
The conceptual/ semantic level is the highest level of a data model. Y u describe the movement of data between data entities.
- Logical/solution: data attributes, business rules
At this data model level, you describe data movements between data attributes.
- Physical: database (db) schemas at the level of tables/columns (relational db) or labels/nodes (graph db), and ETL (extract, transform, load) jobs
Data movements are documented between tables and columns in databases.
There are several misunderstandings associated with the term “data lineage”:
- Data lineage can be classified as data value or metadata lineage, depending on the documentation subject.
Data value lineage describes the movement of data instances along data chains. B siness stakeholders require this type of data lineage. F r example, a monthly revenue in a financial report amounts to 1 mln EUR. T ey want to know all original contracts and transformations applied along a data pipeline to derive the amount of 1 mln EUR. M ny business stakeholders dream about such a lineage. H wever, it is hardly realizable.
Metadata lineage describes data processing: the movement and transformation of data using metadata, not data itself. D ta lineage is a combination of business and technical metadata. l data lineage-related initiatives focus on documenting metadata lineage.
In our daily language, we use the term “data lineage,” which means “metadata lineage.” This situation can be confusing for many business stakeholders.
- Many professionals mistakenly limit data lineage only to the physical/technical level.
Data lineage can be documented at four abstraction levels, which should be linked, as discussed above.
- Data lineage maps objects along data chains at the same abstraction level and objects between different layers.
Depending on the direction of documentation, we recognize horizontal and vertical data lineage.
Horizontal data lineage demonstrates how data flows from the origination point to the point of usage.
Horizontal data lineage can be documented at each of the four layers.
Vertical data lineage integrates data lineage components between different layers.
Now, we can make several conclusions regarding similarities and differences in the definitions of the concepts of metadata management (MM), knowledge graphs (KG), and data lineage (DL.)
Conclusions
- Metadata is the core subject of MM, KG, and DL. O ly KG can demonstrate the relationships between data and metadata.
- All three capabilities focus on metadata integration.
- Only DL aims to describe data transformations expressed in business rules and ETLs.
For more insights, visit the Data Crossroads Academy site: //academy.datacrossroads.nl.