The ins and outs of embedded databases for the IoT
While many facets of the Internet of Things (IoT) are falling into place, some hurdles still exist for the databases that will be used to manage IoT sensor data. In this roundtable with Christoph Rupp of hamsterdb, Sasan Montaseri of ITTIA, Steve Graves of McObject, and ScaleDB’s Mike Hogan we explore the factors currently limiting embedded databases, scaling and securing IoT databases, and the available tools and techniques for managing and analyzing sensor inputs from a sea of connected embedded devices.
Where are the current bottlenecks in embedded databases and database management systems (DBMS), particularly as they relate to the IoT?
MONTASERI, ITTIA: An embedded database will have different databases depending on what type of system it is located on. We see sensors, mobile devices, IoT gateway devices, and embedded systems as important parts of an IoT system, each facing different data management challenges.
For sensors, limited resources such as memory and flash media are the most important bottleneck, since they typically produce data streams that originate from a single source. For IoT gateways, write performance with concurrent read access is important because the device will collect data from a number of sensors or similar devices. For mobile devices, the main bottleneck is the availability of data when there is no connection. For embedded systems, interoperability and maintainability of these subsystems are very important.
GRAVES, McOBJECT: The hurdles for on-device embedded database systems, in many cases, are not hurdles for the DBMS itself so much as limitations of the embedded system (device). For example, while McObject’s eXtremeDB DBMS was written explicitly for embedded systems in 2000 with a focus on high efficiency and a “small footprint,” it still requires at least a 24-bit memory address (a 24-bit pointer) and realistically about 1 MB of RAM. The code size of the core of the eXtremeDB database system is approximately 150 KB, and it will need perhaps 40 KB of RAM at a minimum for the database dictionary and other run-time metadata such as transaction buffers, connection/transaction/object handles, etc. And then you need memory for the stored data itself, or for a cache if it is a persistent database.
A 16-bit system simply cannot address enough memory for a DBMS (64 KB). Whereas you might squeeze a DBMS into that space, it wouldn’t leave room for metadata, the application code, etc. On the other hand, a 24-bit pointer can address 16 MB – plenty of room for the DBMS and applications.
RUPP, hamsterdb: Collecting sensor data or other data mostly requires storage, but not necessarily a database. Especially devices with low processing power will transfer their data to a server for post-processing and analysis. The bottlenecks are usually I/O write performance or network throughput for transferring the data to a central server. Improving I/O performance is mainly a monetary issue because better devices are more costly.
However, often it is possible to apply strategies to reduce the amount of data without sacrificing the data quality, for example by storing only one average value per second instead of many discreet values. Also, sensor data often does not change much over time and therefore can be compressed well (Figure 1, Table 1). Integer compression is not CPU intensive. Even low-cost CPUs can compress millions of integers per second, vastly reducing the storage requirements. With some creativity it is often possible to create tailor-made solutions optimized for specific data patterns.
Of the popular languages for database development, which is best suited for an embedded database deployments in the IoT, and why?
GRAVES: For on-device data management, SQL is probably inappropriate for the vast majority of use cases. We think C/C++ and a DBMS with a fast native API are most appropriate. For embedded systems with sufficient resources, one of the embedded Java machines (such as JamaicaVM from Aicas) could be suitable. SQL will be too resource-intensive. The code size of any SQL implementation is going to be substantially larger than a non-SQL solution – not to be confused with “noSQL” – and consume substantially more CPU cycles for any given unit of work.
On-device embedded database systems will exist primarily to collect data, take some action(s) based on that data, and perform some processing/manipulation of the data. These actions don’t require and wouldn’t benefit from the robustness and complexity of the SQL language. Devices won’t be executing complex, and certainly not ad hoc, queries that involve multiple tables with sophisticated filtering and ordering.
On the other hand, upstream from the devices, DBMSs that are used to collect, aggregate, and otherwise process the profusion of data generated by the IoT will certainly benefit from SQL.HOGAN, SCALEDB: For backend systems, those that aggregate and process the data (analyze, execute triggers, etc.), much of the challenge is handling the flood of data, unlike data from humans who tweet or post intermittently.
MySQL uses SQL. It is good for online transaction processing (OLTP) use cases, primarily on the backend of the IoT – not the device side, but gateway and backend. Most companies end up doing a combination of technologies, for example MySQL for customer/transaction info, NoSQL for fast data ingest of the device data, and Hadoop for analytics of the device data. Our technology extends your MySQL infrastructure with fast data, enabling you to eliminate the NoSQL and Hadoop pieces and use MySQL exclusively to minimize the expertise, hiring, and different tools you use, and reducing costs dramatically.
RUPP: For those applications that do not require a database with SQL support, the benefits of a key/value store like hamsterdb will be attractive: high performance, low resource requirements. For embedded SQL databases, SQLite is the most obvious choice.
How do current embedded database technologies facilitate the storage and analysis of sensor inputs that could scale from the hundreds or thousands into possibly the millions?
GRAVES: There are many dimensions to managing the massive data sets produced by sensor networks in the Internet of Things. Support for multiple database indexes is a must-have if a DBMS is going to support applications’ varying patterns of data access. At minimum it should offer:
- Hash index for very fast lookup of a specific object by a key (simple or compound)
- B-tree index for pattern matching, range retrieval, and sorted results (the B-tree can be optimized for in-memory data storage)
- R-tree index for geospatial data
- PATRICIA Trie for network communications/telecom systems’ indexing of IP addresses and telephone numbers
- Trigram index for “fuzzy search” use cases
One characteristic of DBMSs that can cause them to bog down at Big Data scale is the depth of the index tree. This can be mitigated by using the hash index. In eXtremeDB, we’ve also modified the B-tree algorithm in order to keep the tree more shallow than a conventional B-tree.
Some embedded database systems (such as SQLite) are single-tasking and therefore unable to leverage multiple cores, which are becoming more prevalent in embedded systems. Ideally, a DBMS will be multi-tasking with an optimistic concurrency model that allows embedded systems developers to take full advantage of the resources of the target system.
In some cases, embedded systems engaged in sensor data fusion must give priority to servicing the interrupts that signal the arrival of some data. In a DBMS, the ability to prioritize transactions at run-time accommodates this requirement. Absence of such a feature can mean lost data, for instance when a unit of sensor data is not grabbed before another arrives.
RUPP: Expensive operations (like analytical queries) might have to be offloaded to a server. For gathering data and for simple queries, developers can resort to key/value stores, a stripped-down, NoSQL-like approach to databases. Some key/value stores are available as embedded libraries, which avoid the communication overhead of a client/server architecture. These also usually provide a variety of configuration options to optimize for specific use cases.
I usually recommend performing post-processing on the server. Post-processing is often changed frequently depending on the product evolution or business requirements, and will therefore require regular software updates. Deploying updates to IoT devices in the field is far more brittle than deploying to a single server that is directly controlled by the ISV. If the sensor data is too big for transferring to a server, then devices can often perform very simple merge strategies without sacrificing data quality, for example by sending only one value per second instead of many values. Also, data can often be compressed efficiently.
What do you suggest development-wise for database engineers looking to reconcile multiple sensor data formats in the same database environment?
GRAVES: In sensor data fusion, the format of the sensor data is usually known in advance. When it is not, there is a well-known mechanism (SCADA) for discovery. In either case, it is structured data and therefore a DBMS that is designed for structured data is most suitable. That means a DBMS with a data definition (schema) language.
Conversely, DBMSs that are geared toward unstructured data (and are therefore schema-less) are less suitable. Having, and being able to declare, a schema makes formulating queries infinitely easier (with or without SQL) and makes it easier for the DBMS to be an asset instead of an obstacle. Having a schema also enables other high-level features such as event handling. When you have a schema, you can tell the DBMS to “notify me when this field of this object of this class is modified.” Without a schema, the DBMS doesn’t know about fields, and therefore can’t be told to generate a notification when one changes.
How do current database technologies help ensure secure database transactions, and for industries that have certain regulatory obligations, how can databases be architected for compliance?
GRAVES: There are a variety of standards, technologies, and features that an embedded DBMS can incorporate to safeguard transactions and stored data. In the case of eXtremeDB, we have built in cyclic redundancy check (CRC) on the database page level to detect whether unauthorized modification to stored data has occurred; RC4 encryption, which employs a user-provided cipher to prevent access or tampering (and we’ve extended RC4 encryption to safeguard data stored in RAM, which really matters for an in-memory DBMS); and support for SSL in networked DBMS components that are part of clustering and high availability editions.
And here again, having a schema is advantageous: a database design can include a table to store an audit trail (with data documenting who changed what data and when). Notifications can be used to validate the data (not allow bogus values in the database). A schema can support this, too, by specifying a range of permitted values, and default values when none are explicitly provided.
How do you see database development evolving to meet the diverse requirements of the IoT? What’s on the horizon?
MONTASERI: Gartner predicts there will be 25 billion devices connected to the Internet of things by 2020, and IoT developers are looking for easy-to-use software tools to store, manage, analyze, and communicate data for their things. Whether assigning identity to an object or offering sensors the ability to gain inputs, data, and data management, the ability to communicate is very important.
Modern devices often generate a large amount of data that needs to be managed locally and transmitted to other devices in a coherent way. Developers of applications for embedded systems – including home appliances, consumer electronics, medical devices, automotive software, and mobile devices – often encounter serious data management challenges in which a large amount of data is collected in one location but needs to be analyzed in another.
Our general offering is database software for local data management and analysis on networked “things,” so they gain the ability to provide intelligent feedback even when disconnected. ITTIA DB SQL enables developers of Internet of Things software to store information and share local changes through database replication. With ITTIA DB SQL, device applications can continuously exchange select local changes with other interested devices and servers, and data produced on the device or retrieved from other systems can be locally queried for rapid decision making to ensure that each device functions autonomously. Local queries are not interrupted by a background data exchange, which is important for maintaining responsive user interaction, and the flexibility to adapt to shifting network topologies also simplifies data management and distribution on the Internet of Things.
GRAVES: A great deal of innovation is occurring in the ability of IoT nodes to work together, manage Big Data in a coordinated way, and leverage hardware resources in a manner that enables “upstream” applications to extract value from the sea of data that is generated.
This is where we’ve focused enhancements in our technology. We’ve released versions of eXtremeDB that further leverage multicore systems with database sharding and distributed query processing. A single logical database can be divided into an arbitrarily large number of shards and still be viewed as a single database instance by querying applications. The shards can be deployed on a single server and leverage every available core and/or be deployed across multiple servers. This provides for elastic scalability – as the number of “things” increases and the volume of Big Data increases, more shards and compute nodes can be put to work without rewriting a single line of code.
RUPP: One interesting aspect of IoT is unstable Internet connections. Imagine a device that sends sensor data to a server. After transmission, it deletes the data from its internal database and updates its internal state, then sends a message to the server. What happens if the Internet connection fails after the sensor data was sent and deleted, but before the server received the final message? Developers will have to provide transactional mechanisms wrapping client/server communication, local database operations, and other business operations into a single atomic unit of work. It will be very interesting to see the solutions for these problems.