Industrial Data Platform Capability Map (v1)
Which capabilities do I need when building a state-of-the-art Industrial Data System? We have identified the 7 most important capabilities from Connectivity to Data Sharing and everything in between.
This article will help you identify the capabilities to build a modern industrial data system. As you’ll have with any capability map, it will most likely not be complete (feel free to leave your thoughts in the comment section!). You can use this list of capabilities to start your request for information (RFI) or request for proposal (RFP) process.
This article is a first release, it’s not perfect, it’s not complete, but it’s the best way to get the conversation started. Any thoughts? Make sure to comment here or contact us as we will release a second version in a couple of months.
Some initial things to note:
We deliberately do not put weights/importances on the different categories, you need to figure out which ones are more important to you than others.
For now, we have decided not to map these capabilities onto vendors as this would require in-depth knowledge from our side or we need to trust the vendors to map themselves (and they will obviously be able “to do it all” 😀). We will however list the names we know at the end of the article for you to review.
It’s not an article about the Unified Namespace, because that is just a part of a bigger discussion;
Now, we’ve been thinking about how to name this article for a while… We wanted to avoid terms such as ‘Historian’ (because out-dated), IoT Platform (because there is more to the OT world than IoT), or ‘DataOps something’ (because that term is too “IT/Data” focused and is less known by OT folks).
So to keep in line with our previous parts on Data Platforms, we’ll call this the “Industrial Data Platform Capability Map (v1)”. 🎉
Be sure to review the previous parts for essential background information if you are new to our blog: Part 1 (The IT and OT view on Data), Part 2 (Introducing the Operational Data Platform), Part 3 (The need for Better Data), Part 4 (Breaking the OT Data Barrier: It's the Platform) and Part 5 (The Unified Namespace).
1 - Connectivity
We need a secure and scalable connectivity layer to integrate different data sources into the Industrial Data Platform. Especially in our OT world, it is important to take into account the need to connect to older assets using legacy protocols as well as newer assets equipped with the latest shiny bells and whistles. In some cases you might need to favor high-throughput, redundancy, the need to work offline and buffer data, or all combined.
Identify your Data sources, for example:
Local Time Series Data Sources: Data from Historians, SCADA systems and PLCs,
Cloud Data: Data from IIoT devices and other cloud-based sources,
MOM Data: MES, Quality System, Planning Systems, and many more
Engineering & GIS Data: Includes digital twins and GIS systems,
Identify your required Protocols, for example:
OPC (DA, HDA, UA),
MQTT (with or without Sparkplug B),
Machine/Sensor protocols (Profibus, Modbus, I/O Link, etc..)
Database Connections,
REST API’s (for the new and shiny stuff),
Parse and get text files (such as CSV, JSON and alike) from a certain end-point.
Identify your additional requirements, for example:
Real-time streaming and/or working event driven
Buffering / Backfilling capabilities
Redundancy
2 - Contextualization & Data Management
At this point, our (time series) data is still raw and unstructured. We want to make life for our users as easy as possible, that means giving them the right context they need without the hassle of finding data from different sources themselves. Here are some context examples:
Asset Context gives us insight into the physical assets in our plant, it is a rudimentary Digital Twin. This data can often be found in Engineering or Master Data systems.
Production Context allows us to link the data to the actual manufacturing process which took place. This data can often be found in a Manufacturing Execution System (MES). It helps us to identify the product which was made, the order/batch it belongs to, the materials used, the operator/shift/team who worked on it and much more.
Maintenance Context can give direct insights in the OEE of our equipment. Understanding the relationship between planned and unplanned maintenance to certain process conditions can again be the starting point of a data exploration project.
Identify your data management requirements
Asset Context and Data Management is in general rather static. It doesn’t change too much in a typical manufacturing plant. However, in the IIoT world we do need ways to automatically add new devices to the asset model once they are online. In many cases, these assets can then announce themselves to your platform.
Modeling, management, viewing, versioning capabilities to build your ontology, based on standards, vendor provided templates, your own templates or ‘Build-your-own’ schema;
Ways to manually input or automatically ingest and process metadata which leads to the model;
Do you need a very simple straightforward model, or do you need a complex - object oriented style - of modeling?
Identify ways to query the model, e.g. using GraphQL
Identify Production / Maintenance and other related contexts
This is typically another way to ingest data, in this case, the information needs to be found in other data sources which are highly dynamic. For example: batch records will update continuously. We often refer to “slicing and dicing” functionality, that means that we want the ability to slice a set of measurements into parts.
Do you want to only link to these data sources? Or do you want to cache and/or store the relevant events in the data platform? This answer will typically depend on the performance of these data sources and the complexity of understanding where your ‘single source of truth’ can be found.
Do you need certain ETL (extract-transform-load) functionality to map the data from the source system into a format you can use to contextualize your data.
PS 1: More about Context in this previous IT/OT Insider article:
PS 2: This article on Rhize’s website describes the Ontology concept very well!
PS 3: We plan to dedicate a new article to this capability in the near future, you don’t want to miss this one, so make sure to subscribe 🙂!
3 - Data Quality
The new kid on the block is Data Quality, or more specifically Sensor Data Quality. Data issues, somewhere in the pipeline from sensor to final report, are difficult to detect and correct. You might just sum up data containing an outlier, resulting in totally wrong conclusions. Or a sensor might have been given a flatline reading for days, weeks or even months without your knowledge. Depending on your use case, some basic data quality checks might be sufficient, in other cases you might need more advanced functionality.
Identify your data quality requirements:
What data types & sources are the input
What are your data monitoring requirements (out of bounds, NaN, spikes, data gaps, drift, under sampling, over sampling, calibration issues, missing metadata, incorrect metadata, etc).
Do you have custom/specific data monitoring requirements?
How do you want to expose these observations to your users? Do you want to store quality information in your data platform as a new context layer, do you want to be able to integrate data quality metrics into your BI reports, etc…
Do you need a solution to clean and augment your data? Who will be responsible for cleaning data? Do you need a user interface? An API?
Do you want to automatically send cleaned data towards a “silver” data store?
Do you need to calculate data quality KPIs ?
PS: More on Data Quality in this previous article:
4 - Data Broker & Store
We need a way to receive and store the resulting data sets (both the raw data, but also prepared and cleaned data sets) in a system which can handle sensor data in context at scale. But storing is just one side of the equation, getting it back is equally important which means that we need ways to subscribe to data and to query it at scale.
Identify Data Store Capabilities:
Time-Series Data Handling: The system must be designed for time-series data, supporting continuous streams from industrial sensors with low-latency access and storage capacity for vast datasets.
Event and Alarm Storage: Store events (typically in a relational database format) and potentially also alarms.
Publish-Subscribe (act as a Broker): A central capability of a broker is to handle data in real-time via a publish-subscribe model, enabling data producers (e.g., sensors) to transmit updates as they occur, while consumers (e.g., analytics systems) subscribe to relevant data streams (most popular: an MQTT Broker)
or you could still go for a more traditional polling model in which the platform polls data from defined sources (eg, reads data from an OPC server).
Data Lifecycle Management: Supports “hot-warm-cold” data management for efficient storage (especially useful when using Cloud storage):
Hot: Immediate, short-term storage for real-time access.
Warm: Semi-archival storage, optimized for recent but less frequently accessed data.
Cold: Long-term storage of archival data, often moved to cheaper, high-capacity storage options like object storage.
Multi-Layered Storage (Bronze/Silver/Gold - Follows the Delta Lake principles):
Bronze: Raw data, stored as-is from the source.
(optional) Silver: Cleaned and validated data, processed for initial insights.
(optional) Gold: Finalized, prepared data ready for analytics and reporting, eg to be used in a Power BI Dashboard.
Identify Querying and Data Retrieval Capabilities:
Subscription and Query Capabilities: Allows users and systems to subscribe to specific data streams or datasets, ensuring relevant data retrieval at scale.
API Accessibility: Data should be accessible through APIs, both simple (e.g., REST) for general use cases and advanced (e.g., GraphQL) for complex queries with specific data constraints.
Contextual Data Retrieval: Enables queries to access data within the operational context (e.g., time range, location, or specific production batch) to support more effective decision-making.
5 - Analytics
In many cases you want the possibility to run analytics directly on the data platform or even on the edge. Examples can include virtual tags (values which are calculated in the platform, but not measured), or running algorithms at the Edge to pre-process data before it is sent to the platform (e.g. create some statistics per second on high frequency data from a vibration sensor). This capability is not to be confused with Data Sharing in which we make the data available to data users and applications outside the platform.
Identify the need for Analytics on top of the Data Platform:
(Real-time and Batch) Analytics, Advanced Calculations
Edge computing capabilities
Optional: Identify the need for data preprocessing at the Edge:
The need to have certain analytics capabilities at the edge to preprocess events/data streams before they are consumed further in the stack, that might be for example to run machine learning models at the edge to preprocess video feeds into certain features, or to sample high frequency data into statistical features.
6 - Visualization
Detailing out this section will be done in a future release (subscribe to get informed!).
When building a Data Platform, you want to create value for all users in your organization. Most people will only be exposed to this capability. This is where your data, of good quality, in context should find its way to your users. Giving everyone in your organization - from operators to management - easy access will make or break your project.
Visualization, Dashboarding
Sharing and collaboration possibilities of these visuals and dashboards
7 - Data Sharing
Detailing out this section will be done in a future release (subscribe to get informed!).
The final capability is making sure the platform is open for the outside world. That might be individual users wanting to link their systems to the platform, it might be other applications.
Incl API’s, SDK’s, etc
Data Querying Capabilities (including full/raw, cleaned, prepared, contextualized etc…)
⭐ Important remark regarding open systems
We do not state that all components can or should be based on one single vendor or one single product. Quite the contrary: we believe that for the different capabilities, different solutions could and should be used. Some vendors are really strong in one capability, while others might shine in other domains.
You want to look for interoperability, open protocols, good documentation, proven track records, and standardized implementations as much as possible (which results in interchangeable components instead of vendor lock-ins).
Additional Capabilities for Management & Orchestration
Detailing out this section will be done in a future release (subscribe to get informed!).
Can be different for each capability, but if you are mixing and matching software products, you need to make sure to have some kind of overarching management model as well !
Deployment model
Edge/On-Prem first
Edge/On-Prem first + Cloud capable = Hybrid
Cloud first + Edge/On-Prem Capable = Hybrid
Cloud first
Supporting capabilities
Cyber Security
User Management
Life Cycle Management (Monitor, Deploy, Update)
Make sure to compare apples to apples
Short side step: when comparing prices, make sure to compare apples to apples. Here are some questions to ask:
What are the ongoing license costs? On what factors do they depend (number of users, number of data points, number of servers, amount of storage, amount of data consumed…)
Open source is never free, not because of the license cost, but because every product needs know-how and support. Are there support models you can buy? Are there partners who are knowledgeable about the product and can provide you a support model?
How many people do you need to run this thing? (internal and/or external)
What is the infrastructure cost? What about Cyber Security requirements?
Especially when you run workloads in the cloud: what are the expected costs for storage, compute and in some cases even data ingress and egress.
And probably many more ;)
Where is my UNS (Unified Namespace) in this capability map?
“Where’s the UNS?”, you might ask… Remember from our previous article on UNS that the Unified Namespace is a concept to bring data together in an organized and contextualized way in a central data broker as it is right now. Also remember that MQTT is just one of the potential protocols to use to achieve that.
If we map the capabilities in the previous paragraph to our drawing, we can immediately see that capability 1 (connectivity), 2 (context) and 4 (central data broker) are the capabilities linked directly to the core concept of a UNS. We deliberately changed the line to the storage part into a dotted line, as this is not required by the core concept.
We thus like to state the Unified Namespace - in the context of a data platform - can only be valuable, when it is part of a larger concept, but it doesn’t replace it.
PS: This video from Flow Software is a great additional resource you might want to review.
List of potential vendors
As stated in the introduction: at this point in time, we do not want to map our capabilities against the functionalities offered by these vendors. It is just too time consuming. We welcome all input. If any of these vendors listed want to reach out to us and share their own map, please contact David directly.
Updates after publication:
That’s it for now!
Make sure to subscribe to receive future releases and don’t forget to review our other content :)
I invite you to join this conversation on LinkedIn as well: https://www.linkedin.com/feed/update/urn:li:activity:7265001658437222402/