The need to manipulate mass data for informed decision-making has been an issue for industry for years. Technologies such as BI, Data-Warehouse, Big-Data and Data Lake all have the same goal: store data over time and process it to enable organizations to drive their businesses more effectively and improve their performance.
It is tempting to centralize all business data regardless of its source: sales, marketing, R&D or production in a single business data lake. This approach is appealing for many reasons, including centralization in a single information system and possibilities for comparison and standardization.
Data Lakes provides operational staff and data scientists with quick access to massive amounts of data. Many industrial data initiatives run into obstacles including incomplete data, lack of structure, and insufficient performance. This article highlights the particularities of industrial data and the implications on Data Lake implementation for their use.
Data from industrial production processes is relatively varied but there are structural consistencies. Industry data includes:
Methods for handling data vary for different types of data.
Note that this data is very different from data generated by other departments in the business which is mostly transactional.
Storing time series or traceability data, for example, requires specific approaches to meet performance, cost, and usage requirements.
For time series, it must be possible to process requests over long periods of time with potentially fine-grained data. For example, one year of one-minute data samples alone is 525,600 points. High volumes need to be processed in a short time, with robust storage efficiency which limits the amount of occupied data.
Specialized databases for time series have been developed to that end such as: TimescaleDB, InfluxDB, InfluxDB, KDB+, OpenTSB (HBase-Hadoop), Quasar DDB, Warp, Warp, Azure TS Insight, and AWS Time Stream DB. The choice in the field is vast. That said, it is not easy to choose the right database for time series. The choice largely depends on intended uses.
For traceability data, the ability to carry out effective searches for requested elements and to rebuild trees and relationships prevails. That is what is usually expected of a strong relational database.
Clearly, a hybrid storage strategy depending on data types is necessary to make the right compromise between features, performance, and cost.
Finally, we have our lake full of data. But now we are in danger of drowning. One of the first priorities is to structure the data by linking it to a business context. There are several different approaches, some of which encourage working with fairly unstructured data. We recommend structuring data early in the process, as soon as it is stored. Linking a business context makes life easier for all the Data Lake users. It also continuously increases the level of information, and in turn, what can be extracted from the data.
But it is still not enough for extracting the required information. Overlaps between data types and their transformation make it possible to build the information needed. Industry process data requires relatively common treatment:
These transformations are often specific to production in process industries. Results require the consideration of many subtleties which can make creating and using a very complex Data Lake. As is the case for data storage, performance must also be a priority to provide a smooth experience for users.
Standardizing all data from a production site in a single Data Lake is still tempting. There are indeed several advantages:
However, as discussed above, particularities about the use of production process data can make this approach less attractive:
A”Process Data Lake” specializes in industrial data with appropriate tools that respond quickly to operational needs.
It is adapted to the storage, processing, and use of industry data. It features:
Some examples exist to show the impact of using a generic Data Lake for industrial data. Some industry groups have used Hadoop technology to implement a global data base but there are pitfalls.
One of the first problems was managing time series in HBase, the Hadoop database: it is necessary to group data with a fine-grained time mesh, to predict pre-calculated data for temporal aggregates (min, max, average, etc.) with useful intervals (15 minutes, 1 hour, 1 day, etc.) with useful intervals (15 minutes, 1 hour, 1 day, etc.). This makes calculation logic complex and copying is inefficient for storage. Systems such as HBase are not optimized to efficiently compress time data.
Even using a specific management overlay such as OpenTSDB, time series processing suffers from the cost of the number of technical layers and the complexity of Hadoop.
The second difficulty was indexing choices for different uses. This type of column database is very sensitive to such choices because the design of the column keys (rowKey) is essential for optimizing queries. When the search uses data characteristics that are not in the rowKey (Tags), performance deteriorates significantly. For example, if you want data from a sensor over a period of time, a “sensorid-timeStamp” rowKey is ideal. This will prove ineffective for other search types unless duplicate tables are created with a rowKey adapted to each. This, however, is likely to increase complexity and storage costs.
In conclusion, the complexity of Hadoop and a vast need for customized design to ensure quality data processing requires significant investment. The cost of managing and maintaining this type of complex architecture is also very high.
Autor: Mathieu Cura and Jean-François Hénon