The concept of critical data has been around for a long time. Multiple authors and experts in the field, like Thomas Redman and David Loshin, have developed and discussed the concept of critical data (elements) over time, starting in 2008. DAMA-MDBOK2 also included this topic as part of the Data Quality Knowledge Area. I shared my first article on this topic in 2019 and later described the dependencies between critical data and data lineage in my book, “Data Lineage from a Business Perspective.”
Regardless of the history of this concept, data management professionals still discuss it extensively when implementing it in practice.
In this article, I share my insights and practical experience in implementing this concept, focusing on the following topics:
- The definition and objectives of critical data and critical data elements
- Data management capabilities that provide input and consume the results of critical data initiatives
- Step-by-step high-level methodology for implementing this concept
Definition and Objectives of Critical Data
DAMA-DMBOK2 provides only general characteristics of critical data, specifying its usage, such as regulatory reporting, financial reporting, business policy, ongoing operations, and business strategy. It also emphasizes that “specific drivers for criticality will differ by industry.”
One of the drivers of this concept was the Basel Committee on Banking Supervision‘s standard number 239: “Principles for effective risk data aggregation and risk reporting” (BCBS 239 or PERDARR). This standard defined critical data as:
- “Data that is critical to enabling the bank to manage the risks it faces
- data critical to risk data aggregation and IT infrastructure initiative
- aggregated information to make critical decisions about risk”
In my practice, I apply the following definition of critical data:
Critical data is “data that is critical for managing business risks, making business decisions, and successfully operating a business.”
However, I must acknowledge that this definition is broad enough to be applied in the practical implementation of this concept. One of the challenges I observe within the community is that professionals often focus on implementing this concept without fully considering its underlying purpose.
I believe the fundamental goal of utilizing the critical data concept is prioritizing various data management initiatives. Occasionally, I observe discussions where critical data is seen as serving the purpose of either “prioritization” or “limitation.” Any data management initiative is time- and resource-consuming. Therefore, to make the initiative feasible, the organization must set priorities. However, it does not mean that only particular “critical” data must be managed properly. From this perspective,” prioritization” is the ultimate goal of using the “critical data” concept.
Input-Providing and Consuming Data Management Capabilities
DAMA-DMBOK2 has limited the consideration of critical data only to the Data Quality Knowledge Area. The challenge is that multiple data management capabilities provide input in defining critical data. In this article, I call them “input-providing.” You can consider the data quality capability as one of the recipients or consuming capabilities of the results of critical data implementation.
Figure 1 demonstrates the examples of input-providing and consuming capabilities.
I used the data quality capability as an example of the consuming capability.
A data governance capability must provide the requirements of regulations related to critical data, if applicable, and the definition and criteria for determining criticality. It should also outline the data management roles involved in the identification process. Additionally, this capability can establish the methodology for identifying and managing critical data.
Critical data must be established at least at two levels of data models—logical and physical—defined by a data modeling capability.
Data, application architecture, and metadata management must provide data flows and lineage to define critical data along data chains effectively.
Only by having these inputs can the data quality capability perform its tasks. For critical data elements, this capability enables gathering data quality requirements, building corresponding data quality checks, and measuring data quality.
Let me summarize the key steps in identifying critical data (elements) indicating relevant data management capabilities.
Step-by-Step High-Level Methodology for Implementing the Critical Data Concept
Step 1. Identify the objectives for using the critical data concept
Various business drivers demand different data initiatives. If your organization must comply with regulations that include critical data in their requirements, your objective is clear. However, the critical data concept must have a well-defined objective for other organizations. The key reason for this is that the objective drives the definition of critical data within the organization.
The primary objective of applying the critical data concept is prioritizing data-related initiatives. The different natures of these initiatives will lead to varying definitions of critical data. For example, personal data will be considered critical for an organization focused on improving the management of this data type. In contrast, customer data will take precedence in enhancing customer experience.
This task can be considered as a deliverable of data governance.
Step 2. Identify the critical use cases, reports, or dashboards
The identified business drivers and corresponding data initiatives have specific stakeholders and use cases. Reports and dashboards are typical examples of such use cases. The next logical step is to document these use cases and identify those critical to the data initiatives’ objectives. In many organizations, there are hundreds or even thousands of reports. Therefore, the simplest way to understand the volume of reports your organization produces is to begin documenting them.
This exercise can take some time, but it is not the end of the story. It can be considered as a deliverable of data governance and data architecture.
Step 3. Analyze data elements in use cases and define the critical ones
Reports and dashboards are containers of information or data elements. These reports include leading business key performance indicators (KPI). The simplest examples of such KPIs are “Customer profitability” or “Monthly Net Revenue.” So, it would be recommended to identify all or some of these KPIs as critical data elements.
This step sounds simple, but it is not always so. Identifying (critical) data elements in tabular reports, in which each column represents a unique data element, is relatively simple. However, some reports, including regulatory ones, include complex multiple schedules. Very often, you can find a taxonomy of similar data elements there. Figure 2 presents a simplified example.
“Net Revenue” is a data element corresponding to a financial KPI and can be marked as critical. However, “Net Revenue” can also be segmented by customer group, such as “Corporate” or “Retail” segments. In this case, the customer segment becomes another data element with its taxonomy. The organization must then decide whether to recognize “Net Revenue,” “Net Revenue per corporate segment,” and “Net Revenue per retail segment” as separate critical data elements or as one data element with multiple dimensions.
The answer is not straightforward and depends on several factors. The first factor is the level of data model at which critical data elements are identified. This example seems to be a conceptual or semantic data model. However, I recommend defining critical data elements at the level of logical data models (application-agnostic or application-related) and/or physical ones. Each approach has its advantages and disadvantages.
The second factor is the purpose of identifying critical data elements. If the purpose is to scope a data quality initiative, then identifying data elements at the physical level is recommended. One of the key goals is building data quality checks and measuring data quality, which can be done only at the level of an attribute in a database.
The preferable approach is to first identify critical data elements at the level of logical models along a data chain and then, for the purpose of building DQ checks, link the logical and physical models using horizontal and vertical data lineage. In any case, each use case or report requires careful feasibility assessment.
It is worth mentioning that I recognize the critical data elements (CDEs) in reports as the “ultimate” ones because they resign at the end of the data chain. Identifying these ultimate critical data elements requires ensuring their uniqueness. A precise business definition for each CDE must be developed to achieve this.
This step necessitates robust data modeling and metadata management capabilities. Data models and business glossaries are essential prerequisites for successfully performing this step. Additionally, the criticality criteria for the ultimate CDEs must also be predefined. The criticality criteria depend on the nature of the business drivers and data initiatives.
Step 4. Cross-check ultimate critical data elements across critical use cases
It’s not uncommon to find similar critical data elements (CDEs) across multiple reports. After documenting the (critical) data elements for each report, a team of professionals should cross-check and analyze the uniqueness of these CDEs. This analysis is also valuable for optimizing reports and assessing data quality. It’s important to note that data elements in different reports may have similar titles but distinct meanings. Conversely, data elements with different titles may share the same meaning.
This step will deliver the final list of ultimate critical data elements, which must be appropriately documented and shared. Please be aware that the “critical” status is simply a business metadata element that needs to be added to data elements. So, the solution to mark data elements as critical depends on the IT tool used for cataloging data elements or data models. It can be a (meta)data catalog, data governance solution, or Excel.
The question often arises: what is the optimal number of CDEs an organization should manage? I don’t believe there is a ‘one-size-fits-all’ solution. First, we should understand that while defining the number of CDEs, we often mean only the ultimate ones. The optimal number of CDEs depends on various factors, including the organization’s size, industry, business objectives, resources available, and the complexity of its data environment and initiative. The data model level at which critical data elements are identified also influences the optimal number of elements. Identifying CDEs at a more abstract level, such as a conceptual or logical data model, may result in a broader but fewer set of elements. Conversely, identifying CDEs at the physical data model level may lead to a more significant number of more detailed elements.
This step also requires robust data governance, analytics, modeling, and metadata management capabilities.
Step 5. Identify CDEs along a data chain
Identifying critical data elements along data chains is challenging. The importance of this exercise becomes apparent when working on improving data quality. It is generally recommended that data quality checks be moved as close to the data sources as possible. However, even if you intend to build a chain of data quality checks along the data chain, you must first understand the relationships between the data elements in different data locations/applications.
When performing this task, you must again decide the data model level at which critical data elements (CDEs) should be documented. Documenting data lineage at the logical level can be time-consuming and resource-intensive, especially if your organization has not already documented logical data models. Additionally, it may not always provide significant business value.
On the other hand, documenting data lineage at the physical level might be even more time—and resource-consuming. The feasibility of this approach depends heavily on your organization’s application landscape. It’s important to note that while automated methods exist for documenting data lineage at the physical level, they still may require manual intervention. Each type of database and ETL (Extract, Transform, Load) tool may require individual scanners, adding further complexity to the process. Marking the criticality of data attributes at the physical level can also be a problem.
I can recommend a solution to the dilemma of choosing between documenting data lineage at the logical and physical levels. You should leverage a data lineage analytics capability to establish relationships between data elements across different locations and applications along the data chain. Fortunately, modern metadata and data lineage solutions often have this capability built in. However, you’ll be challenged to classify critical data elements at this stage. It’s important to understand that different data elements can have varying levels of criticality when it comes to delivering the ultimate critical data elements.
In my book, “Data Lineage from a Business Perspective,” I describe four types of critical data elements. It’s important to note that these types must be identified using different criticality criteria.
Performing this step requires several data management capabilities: data modeling, data and application architecture, metadata management (e.g., data lineage, scanners, etc.), and IT capabilities (e.g., database administration, data lifecycle management, etc.).
Table 1 summarizes our discussion by mapping key steps in identifying critical data (elements) and required data management capabilities.
Of course, all the topics discussed in this article require an in-depth analysis of each organization’s particular situation.
Key Takeaways
This article outlines the comprehensive tasks that various business, data management, and IT professionals should undertake to identify critical data elements (CDEs) and leverage them effectively for their intended purposes.
- Prioritization Over Limitation: The goal of identifying critical data elements is to prioritize various data-related initiatives, not to limit them.
- Input-Providing Capabilities: Several data management capabilities are essential for providing input in the process of identifying CDEs. These include data governance, modeling, data and application architecture, and metadata management, collectively serving as input-providing capabilities.
- Output-Consuming Capabilities: Some data management capabilities consume the outputs of a CDE initiative. These include data quality and master data management, which rely on the defined CDEs to ensure accuracy, consistency, and integrity across the organization.
- CDE Classification and Documentation: Critical data elements (CDEs) can vary in type and be identified at different data model levels, depending on their location along the data chain. Ultimate CDEs, found in reports or dashboards at the end of the data chain, can be identified at conceptual/semantic, logical, or physical data model levels. As you move along the data chain, CDEs should be classified based on their contribution to the ultimate CDEs. These CDEs should be recognized at the physical data model level, as determined by data lineage analytics.
- Data Lineage as a Mandatory Input: Identifying critical data elements along a data chain requires robust metadata and data lineage capabilities.