AI Datasets

By Pat Dixon, PE, PMP

President of DPAS, (DPAS-INC.com)

Artificial Intelligence (AI) is not only a buzzword in the 4^th industrial era. It is a real resource that is being adopted for a multitude of applications. In industry it is in its infancy, but as it matures adoption will grow.

One example of the growth of AI is in the power industry. The IEEE Power & Energy magazine for the November/December issue is entitled "Artificial Intelligence and the Power Grid". This issue is filled with examples of AI applications from around the world for monitoring the grid, forecasting load and generation, managing distribution, predictive maintenance, diagnosing faults, and optimization/control. It is impressive to see how the power industry has advanced in the implementation of AI for these purposes.

These advances did not come easily. The development of modeling algorithms has gone through many iterations to get to the point of efficient execution and sufficient accuracy. One of the challenges has been overfitting. Neural networks are a common AI modeling approach due to its universal application and relative ease of implementation; you don't need to design a model on first principles. That is why overfitting can be a challenge. Overfitting is a problem in neural networks and several other modeling techniques where the algorithm learns the noise and outliers in the data instead of realistic relationships. In an article describing an implementation of convolutional neural networks to forecast electricity consumption in China from transformer data, the authors state "Transformer-based models have a substantial demand on the volume of the input data; for relatively small-scale datasets, they may introduce unnecessary complexity, leading to overfitting."

This can be addressed by combining first principles into an AI model. In an article describing an implementation in Denmark, the authors introduce the application of physics informed neural networks which "learn directly from the equations describing the physical system." In addition to improving model accuracy by mitigating the overfitting problem while using measurement data, the models are much more computationally efficient.

Even when you have a sufficiently accurate model with efficient execution, it can be difficult to interpret the results. AI models tend to be black boxes; if the model is derived exclusively from data with no understanding of what the model looks like, it is a black box. In an article about virtual power plants in China, the authors state "Ai interpretability is one of the key requirements for industrial AI applications. For example, when using AI to assist users in energy management at the edge, it is vital to explain whether the optimization solution offered by the model adheres to the production rules of the industrial assembly line and matches the user's power usage habits. Only when AI solutions can provide a clear explanation of the reasoning behind their results can a credible and persuasive energy management plan be established." Another article about South Korean AI applications states "Complex neural networks are often considered as hidden mechanisms, making it challenging to interpret their internal workings."

A technique for opening the AI black box is called sensitivity analysis. If you hold all inputs of a model fixed and only move one from its lowest to highest range, a plot can be generated showing the relationship between that input and the model output. This can be very helpful for identifying overfitting in a model, and can help operations move their process toward an optimal condition. However, if there are correlated inputs this analysis can be misleading.

Another challenge we see in a lot of industrial AI applications is the difficulty in obtaining a good dataset. The editor's column in this issue describes the history of AI in industry including "the difficulty of creating datasets suitable for the learning problems." As we know, our data has noise, outliers, and process dynamics. If this data is not pre-processed properly, your model will learn nonsense. It is also challenging for a dataset to reflect sufficient process excitation to capture significant change. Some processes are single setpoint dominant, meaning that when they are operating they run the same way every day. Even processes that have grade changes may not have large step changes that produce sufficient excitation in the data to learn from. We don't operate our processes with the intention of creating statistically significant datasets.

That is why I developed a dataset generator, which was the subject of a paper I presented at TappiCon 2023. The purpose of the dataset generator is to address scenarios such as:

A student or young engineer that wants to learn data analytics
An instructor that wants to teach students and young engineers data analytics
A vendor that wants to develop a solution using data analytics, test the solution, and demonstrate it
An engineer for a manufacturer that wants to develop a methodology for performing data analytics using various techniques

If the use case is to perform data analytics or predictive modeling for an industrial process, of course the actual data needs to be used. Actual process data cannot be replaced with artificially generated data. However, in the use cases considered, actual process data is not required. The goals of this project were as followed:

Provide a general-purpose tool that can be customized and configured to produce a dataset representative of an actual industrial process
Reduce the time to generate that dataset from the months required for actual data to hours for simulated data
Provide the tool as open source, which can be obtained, customized, and improved at no cost.

With the dataset generator, I can develop AI applications and demonstrate them without asking anyone for their data.

An example of this is an application I have developed in Seeq, which is a leading data analytics platform. The application automatically performs the dynamic pre-processing so that data from far back in the process (such as refiners) is aligned with the final product (at the reel). Models can be built very quickly and compared with different techniques, and the sensitivity plots are automatically generated so that overfitting and first-principle relationships can be analyzed. The dataset generator made it possible for me to build, test, and demonstrate this Seeq application with a sufficient dataset at low cost.

The last step is the quality and availability of data. We are often dependent on lab results in our datasets, and some lab tests have inherent variability that can exceed that of process instrumentation. In the paper industry we often lack measurements that are significant in the development of a model. An example is freeness. Lab samples once a shift will not capture the swings that explain production problems on a machine, and freeness testing in a lab has significant variability even when carefully following a standard procedure. There are also measurements, such as crill, that can have a huge impact on modeling paper strength but are not found in most mills. Today there are fiber analyzers that can provide very important information about furnish, making AI models more feasible.

The power industry is impressive in their adoption of AI. The paper industry can do the same with the right tools and expertise.

AI Datasets

The browser you are using is outdated!

Log In to Nip Impressions

Not a Member?