Big data for engineers and scientists, part 2: Analyzing, processing, and creating models

Read “Big data for engineers and scientists, part 1: Introduction” here.

There is a wealth of information and value contained in the enormous sets of data collected from smart sensors embedded in items such as measuring instruments, manufacturing equipment, medical devices, connected cars, aircraft, and power generation equipment. From this data, models can be generated that not only describe physical phenomenon, but also utilize predictive models to make decisions on when to optimally perform maintenance on expensive equipment (such as aircraft); thereby saving money on unnecessary maintenance or unplanned downtime. These models can be used to forecast when to optimally turn on expensive power generation plants and can be embedded into medical devices or vehicles to increase their efficacy and performance.

As you probably realize, these models can differentiate your products and services from those of competitors. But, as with any system, especially one operating in the real world, the data collected from these systems and devices is far from perfect. There can be external influences on the data that need to be understood before an effective model can be created.

As a scientist or engineer, you have the domain expertise and knowledge needed to decipher data. You may require a software analysis and modeling tool though that enables you to identify trends in your data, clean and correct dirty data, and provide the algorithms needed to determine the most influential signals in large datasets to implement a practical model.

Exploration and processing of large sets of data

Before creating a model or theory from your data, it’s important to understand what is in your data, as it may have a major impact on your final result.

  • Slow moving trends or infrequent events spread across your data that are important to take into account in your theory or model.
  • Bad or missing data that needs to be cleaned before a valid model or theory can be established.
  • Deriving additional information for use in later analysis and model creation.
  • Finding the data that is most relevant for your theory or model.

Let’s look at some of the capabilities that can help you easily explore and understand data, even if it is too big to fit into the memory of your desktop workstation.


Summary visualizations, such as the binScatterPlot shown below, provide a way to easily view patterns and quickly gain insights within large datasets. The binScatterPlot highlights areas of greater concentrations of datapoints, with changes in color intensity. Using a slider control to adjust color intensity lets you interactively explore large datasets to rapidly gain insights.


[Figure 1 | binScatterPlot in MATLAB. Copyright: © 1984–2017 The MathWorks, Inc.]

Data cleansing

All data contains outliers or bad and missing entries. This data needs to be removed or replaced before you are able to properly understand or interpret data. Having a way to programmatically clean this data provides a method to manage new data as it’s collected and stored.

[Figure 2 | Example of filtering big data with MATLAB. Copyright: © 1984–2017 The MathWorks, Inc.]

Data reduction

The large number of signals collected from your systems can make it difficult to find important trends and behaviors in your data. Much of the data may not be correlated with the behavior you are looking to predict or model. Being able to calculate correlations across your data, as well as utilizing techniques such as Principal Component Analysis allows you to reduce your data to only those signals that most influence the behavior you are modeling. By reducing the number of inputs to your model, you create a more compact model and require less processing when the model is embedded into your product or integrated within a service application.

Data processing at scale

As an engineer or scientist, you may find that you are most efficient when working on your local desktop workstation using tools you are familiar with. However, to be efficient when working with big data requires a software analysis and modeling tool that not only works with large sets of data on your desktop workstation, but also allows you to use your analysis pipeline or algorithms on an enterprise class. The ability to move between systems without changing your code greatly increases efficiency.

Creating models

Assume you have collected months or even years’ worth of data. What is it that is so valuable in this data? In the case of Baker Hughes, information from temperature, pressure, vibration, and other sensors was collected over the lifetimes of many pumps. This data was analyzed to determine which signals in the data had the strongest influence on equipment wear-and-tear. This step included performing Fourier transforms and spectral analysis, as well as filtering out large disturbances to better detect the smaller vibrations of the valves and valve seats.

The engineers discovered that data captured from pressure, vibration, and timing sensors allowed them to accurately predict machine failures. To create the models eventually used to predict actual failures from these large sets of data, machine learning was used. Machine learning is commonly used in these situations due to the large number of observations (samples) and the possibility of many variables (sensor readings/machine data) being present in the data.

Machine learning techniques use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. It turns out this ability to train models using the data itself opens up a broad spectrum of use cases for predictive modeling―such as predictive health for complex machinery and systems, physical and natural behaviors, energy load forecasting, and financial credit scoring.

Machine learning is broadly divided into two types of learning methods, supervised and unsupervised learning, each containing several algorithms tailored for different problems.

[Figure 3 | The two types of machine learning methods provide different algorithms tailored for different problems. Copyright: © 1984–2017 The MathWorks, Inc.]

Supervised learning

Supervised learning is a type of machine learning that uses a training dataset which maps input data to previously known response values. From this training dataset, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset. Using this technique with large training datasets often yields models with high predictive power, which can generalize well for new datasets. Supervised learning includes two categories of algorithms:

  • Classification: For response values where the data can be separated into individual “classes.”
  • Regression: For prediction when continuous response values are desired.

Unsupervised Learning

Unsupervised learning is a type of machine learning used to draw inferences from datasets with input data that does not map to a known output response.

  • Cluster analysis: Common unsupervised learning method that is used to find hidden patterns or groupings in data. 

Using models

To truly take advantage of the value of big data, you must be able to incorporate the models and insight gained from the data into your products, services, or operations.

A direct path from development to the integration of an algorithm or predictive model into a device, vehicle, IT system, or web-based service allows you to better adapt to changing environmental or business conditions and to address market needs more effectively.

There are numerous applications for which analytics and predictive models are being developed by engineers and scientists; these applications dictate whether you need to integrate your model with an enterprise IT application, use it as part of an IoT system, or incorporate it within an embedded system for local processing or for reducing the amount of data sent to a centralized analytics platform.

  • Connected cars: Large amounts of real-world driving data is used to develop and implement algorithms for use within embedded systems to support driver-assist and self-driving capabilities.
  • Manufacturing and engineering operations: Sensors on machinery are providing up-to-the-second information on the health and operation of refining, energy production, and manufacturing systems. This data is used to optimize the operation, yields, and up-time of these systems and requires integration as part of an enterprise IT application.
  • Design and reliability engineering: Data is being captured from aircraft under test and real-world flight conditions and from mobile and medical devices. This data is being used by engineering and operations groups to improve the reliability, performance, and capabilities of these devices and systems.

Big data has the potential to greatly enhance your products, services and operations. But you need a software analysis and modeling tool that allows you to explore, process, and create models with big data using a familiar syntax and functions, while also providing the ability to integrate these models and insights directly into your products, systems, or operations. Having tools like MATLAB that provide scalability and efficiency will enable you as a domain expert to be a data scientist while giving your company a competitive advantage in the global marketplace.

As product marketing manager at MathWorks, Dave Oswill works with customers in developing and deploying analytics along with the wide variety of data management and business application technologies in use today.