The IT and OT view on data - Part 1 on Industrial Data Platforms

Is it a Data Lake that I need or a Data Warehouse? Why not a Data Ocean! Or a Data Platform? And do we need the “OT version” of those, or the “IT” one?

and

Oct 09, 2023

In our relentless efforts in bringing together the IT and OT world, it is now time to see if we can find some common ground when it comes to storing data at scale.

Making data accessible is a really (really) important topic. This is the starting point from which all your data projects will take off. If your data is difficult or not accessible, you can immediately set aside your digital ambitions.

Welcome to Part 1 in our Data in IT and OT series. Discover all parts: Part 1 (The IT and OT view on Data), Part 2 (Introducing the Operational Data Platform), Part 3 (The need for Better Data), Part 4 (Breaking the OT Data Barrier: It's the Platform), Part 5 (The Unified Namespace). We also published our Data Platform Capability map, with even more insights.

NEW! We just launched our ITOT.Academy. Learn the language and architecture of IT and OT to push past “just a POC” in our live online academy.

Discover more and save your seat!

The IT view on Data Lakes and Data Warehouses

Credit where it's due: the concept of a Data Lake and Warehouse comes from our friends in the IT world, so let’s first define them both:

Data Lake

A repository for storing unstructured data from different sources and in different formats,
Stores raw data,
Because there is no chosen structure yet and the data is still raw, a data lake supports more advanced analytics use cases (you often want to have the raw, unprocessed data).

Data Warehouse

A repository for storing structured data from different sources,
Stores processed data (for examples: aggregations),
Clearly defined data schema per use case (managing the data schemas is a challenge on its own),
Because of the predefined schemas, a data warehouse can be queried very effectively making it perfect as a data source for Business Intelligence tools,
Because the Data Warehouse typically holds processed (not raw) data, you might lose some fidelity/granularity (for example: only a sum or count of certain records, not the individual ones) making it less suitable for (advanced) data science projects.

Let's consider an example using the context of a cookie factory, Sweet Harmony Treats.

In the data warehouse of Sweet Harmony Treats, there's a structured table that contains information about each batch of cookies produced. The table includes columns like "Batch ID", "Date Produced", "Cookie Type", "Ingredients Used", "Quantity Produced" and "Production Line Operator." Because the data is structured, it is easy to look up things like "How many chocolate chip cookies were produced last month?" or "Which operator produced the most batches of oatmeal raisin cookies?"
In the data lake of Sweet Harmony Treats, there's a diverse collection of data stored in its raw and unprocessed form. This includes data from various sources, such as customer feedback from social media comments, sensor readings from ovens, images of different cookie types, and even audio recordings of customer service calls. This unstructured and semi-structured data can be explored and analyzed to gain insights that might not be captured in a traditional structured data warehouse.

The OT view on Data

Your control system is designed to capture and store sensor (time-series) data generated by industrial processes and equipment. It collects data points at different intervals from various sensors, devices and machinery. That includes measurements like temperature, pressure, flow, voltage and more.

These data points are time stamped and stored, creating a historical record of how the processes have evolved over time. Some values might be stored every second, others every minute, hour, day or at completely random intervals (this randomness makes it impossible to store these values in a traditional relational database at scale).

In most cases the Time Series data is stored in a Process Historian on Level 2 or 3 of the Purdue Model. This piece of software is designed to store and access time series data in a very efficient manner.

Let us go back to Sweet Harmony Treats:

To be more precise, we go to the conveyor belt transporting the raw dough through the oven to bake. This is a process which needs to be adjusted very carefully: the belt needs to have the right speed, the oven needs to have the right temperature distribution and air flow. Finally the cookies need to be (air) cooled: not too slow, not too fast. Here is an example of what data would be included in the historian to monitor this line:
- Temperature sensors in the heating and cooling section
- Power consumption of the motors
- Gas or electricity consumption of the heating elements
- Humidity at different places in the oven
And many more…

We will circle back to the definitions of a Data Lake and Data Warehouse from the previous chapter to compare it with the functionally of the Historian:

The historian stores raw data.
The data it holds is unstructured as no schemas are applied to it (yet).

Well, doesn’t that sound like a Data Lake?

It does, but it is also a bit short-sighted. The main problem many companies face is the fact that their historian typically only includes sensor data (no quality reports, no pictures from your maintenance engineer, no thermal video feeds from within the oven et cetera). As a result, an historian is just one of many data sources necessary to prepare analyses and reports.

Time series data / sensor data explained with product and quality context infographic

Towards an OT Data Platform

At this point, our time series data is still ra w and unstructured.

Production context - such as start/end batch or product made - might be available to some extent in a Manufacturing Operations Management (MOM/MES) system, but is notoriously difficult to combine to time series data.

Furthermore, as we would like to encourage everybody in our organization to start working with data (often referred to as “Citizen Data Science”), we now come to the conclusion that only a small part of those Citizens understand that L15.B1.T01A.PV refers to a temperature sensor in the baking oven. In most cases the Asset Context is missing as well.

Finally, we now assume that data is available and usable, which is often not the case as we highlighted in this previous article about AI in Manufacturing. In summary we discussed:

fixing the data problem in manufacturing

In Part 2 of this article, we will introduce the concept of an Operational Data Platform, talk about different types of context and highlight some of the most important challenges such as Data Governance and Data Quality.

The Operational Data Platform: How the future of OT Data will look like (Part 2)

David Ariens and Willem van Lammeren

October 24, 2023

Read full story

NEW! We just launched our ITOT.Academy. Learn the language and architecture of IT and OT to push past “just a POC” in our live online academy.

Discover more and save your seat!

Make sure to subscribe! →

Discover all parts: Part 1 (The IT and OT view on Data), Part 2 (Introducing the Operational Data Platform), Part 3 (The need for Better Data), Part 4 (Breaking the OT Data Barrier: It's the Platform), Part 5 (The Unified Namespace)

OUTTAKE
What is a schema less database (or NoSQL)?
Most of us are familiar with a classic relational database. Columns, unique keys and relations between tables. This model of managing data drives most applications, but is also restrictive and has problems in scaling.
This is where NoSQL comes in.You store the data in a simple key:value format and let the magic happen when you query the data.This is powering Social media platforms, IIoT and big data analytics.
To learn more about NoSQL we highly recommend reading and playing around with MongoDB.

The IT/OT Insider

The Operational Data Platform: How the future of OT Data will look like (Part 2)

Discussion about this post