Big data for engineers and scientists, part 1: Introduction
Working with big data is fast becoming a key step in the process of scientific discovery and engineering. This is happening as technologies such as smart sensors and the Internet of Things (IoT) are enabling the collection of vast amounts of detailed data from scientific instruments, manufacturing systems, connected cars, and aircraft.
There is significant value to this data, as it may show important physical phenomena or provide information on the operating environment, efficiency, and health of a system. With the proper tools and techniques, this data can be used to make rapid scientific discoveries and develop and incorporate more intelligence into your products, services, and manufacturing processes. This can differentiate your company with better performing products or services, as well as help in conforming to regulatory requirements (as in the case of meeting engine fuel efficiency standards or providing assisted driving capabilities).
Gaining access and working with the data may sound like an intriguing, yet daunting task. Because of the value and size of this data, it is commonly stored and managed in large file shares, databases, or big data systems such as Hadoop or Spark. Not too long ago, in order to apply advanced techniques such as machine learning to large sets of data, computer scientists with experience in IT systems would work alongside engineering and scientific experts. This team would jointly support a workflow that includes:
- Accessing big data in files, database, or in the Hadoop Distributed File System (HDFS)
- Exploring, processing, and analyzing this data on specialized compute clusters
- Creating algorithms for use in embedded systems, business applications, and other services
Today, software analysis and modeling tools such as MATLAB have been enhanced with new capabilities for working with big data. This enables engineers and scientists who have the domain knowledge and experience to make design and business decisions with this data. Engineers and scientists can then conveniently access this data no matter the location and easily work with it using familiar syntax and functions.
Let’s look at an actual example of how engineers are using big data. Engineers at Baker Hughes, a provider of services to oil and gas operators, needed to develop a predictive maintenance system to reduce pump equipment costs and downtime on their oil and gas extraction trucks. If a truck at an active site has a pump failure, Baker Hughes must immediately replace the truck to ensure continuous operation. Sending spare trucks to each site costs the company tens of millions of dollars in revenue that could be generated elsewhere if they were in active use at another site. The inability to accurately predict when valves and pumps will require maintenance underpins other costs. Too-frequent maintenance wastes effort and results in parts being replaced when they are still usable, while too-infrequent maintenance risks damaging pumps beyond repair.
Terabytes of data were collected from the oil and gas extraction trucks and this data was used to develop an application that predicts when equipment needs maintenance or replacement. MATLAB provided the engineers at Baker Hughes the functionality needed for developing predictive models and combining multiple kinds of data, including sensor data from a proprietary file format, into one analysis application.
Accessing large sets of data
The first challenge in working with big data is determining how to access large data sets, as they come in many different forms and are stored in various types of systems.
Many big engineering and scientific data sets consist of a large number of small or medium sized files, although files are becoming increasingly large and won’t fit into the memory of a single computer. These files typically reside within one or more directories on a shared drive and may consist of delimited text, spreadsheets, images, videos, and various proprietary formats.
There are a wide range of database types that are used to store and manage big sets of data:
- Relational (SQL): Widely used for business applications, popular among IT developers.
- Data Warehouse: Based upon relational (SQL) databases, houses business critical data and provides analytical capabilities and fast access for business-critical applications.
- NoSQL: Optimized for data that doesn’t fit into relational databases.
- Data Historians: Optimized for time-based, production, and process data that is commonly collected from industrial equipment.
- IoT Data Aggregators: Typically includes cloud-based services for aggregating time series data from connected sensors and devices. These services are typically accessed via web service calls.
Hadoop is a system for storing and processing big data sets based upon distributed computing and storage principles. It is comprised of two major subsystems that coexist on a cluster of compute servers:
- HDFS: A large, failure-resistant file system referred to as the Hadoop Distributed File System.
- YARN: Manages applications that run on Hadoop, including batch processing frameworks, such as MapReduce and Spark, and SQL interfaces, such as Hive and Impala.
To efficiently capture the benefits of big data, engineers and scientists need a scalable tool, such as MATLAB, to provide access to a wide variety of systems and formats used to store and manage data. This is especially important in cases where more than one type of system and format may be in use. Sensor or image data stored in files on a shared drive may need to be combined with metadata stored in a database; in the case of Baker Hughes, data of many different formats must be used together in order to understand the behavior of the system and develop a predictive model.
[Figure 1 | Access a wide range of big data. Copyright: © 1984–2017 The MathWorks, Inc.]
The ability to work with big data is fast becoming an important aspect of scientific discovery and engineering. These data sets have invaluable data within them, providing the means to differentiate your products and services. As a scientist or engineer, you have the domain knowledge and experience to make design and business decisions with this data, but may require a software analysis and modeling tool that is easy to work with. Using tools such as MATLAB offers scalability and efficiency, while providing your company with a competitive advantage in the global marketplace.