Eheap Part 3: An embedded system heap with self-healing features

A embedded systems heap can protect data while providing adaptability, safety, and high .

This is the third in a series of three papers on eheap, an embedded system heap that provides self-healing features to help exposed embedded systems better deal with environmental disturbances, malware, cyber-attacks, and other problems. Part 1 covered Configuration, while Part 2 was on Enhanced Debugging. Here, we’ll look at self-healing.

A data structure may contain valuable information, but each data item in it can be subjected to “reasonableness” checks and discarded if the checks fail. Data being transmitted or received is generally protected by error codes. Heaps, on the other hand, are particularly vulnerable to bit flips and overwrites because they have high a density of pointers. One bad pointer can put the system “into the weeds.”

For an average chunk size of 32 bytes (common with object-oriented code), pointer density is 25%. This is probably higher than the pointer density in code and code can probably be put into a safer type of memory (e.g. ROM). Of course, there are other RAM areas with high pointer densities, but heaps seem to be a good place to start. If we can come up with a solution here, there’s hope for solutions in other areas.

Self-healing assumptions

Implementing self-healing requires making reasonable assumptions. For eheap, which is freely available for non-commercial use, it’s assumed that only one word can be damaged at a time. It’s further assumed that a heap has hundreds to hundreds of thousands of chunks and that millions to billions of normal heap operations will occur between events that damage the heap. Hence, the odds are good that there’s sufficient time to fix the heap before a malloc() or free() steps on a rotten pointer. This extra bit of protection could be vital if as many systems are running in exposed environments as is currently envisioned for the IoT. Self-healing is implemented in eheap primarily by heap and bin scanning.

Heap scanning

Because of its physical structure, eheap can be scanned forward and backward. This is necessary to find and fix broken links, as well as to fix other broken fields in chunk headers. The eheap heap scan service is intended to be run during idle time so that it won’t consume valuable processing time. As discussed previously, real-time systems must have significant idle time to handle peak asynchronous event loads and to be expandable for future requirements.

Each chunk in the heap is scanned by eheap from the start to the end of the heap, fixing broken links, flags, sizes, and fences, as it goes. In the debug version, the scan stops on a broken fence so that the fence can be examined for clues concerning what happened. In the release version, the fence is fixed. Fixes are reported and saved in the event buffer for later analysis. Note that during scans, all pointers are heap-range tested before use to avoid data abort exceptions that might stop the system.

If a broken forward link is found, and the chunk is either a free chunk or a debug chunk, the chunk size field is used to attempt a fix; chunk size is effectively a redundant forward link. Otherwise, a backward scan is started from the end of the heap. It proceeds until the break is found and fixed.

While back scanning, other broken forward links are also fixed, if encountered. However, a broken backward link stops the backward scan and a bridge is formed between the chunk with the broken forward link and the chunk with the broken backward link. When this happens, many chunks may be bridged over. Bridging serves to allow the scan to complete and may give the system a chance to limp along and to possibly fix itself, but generally it doesn’t fix the problem and thus a HEAP_BRKN error is reported.

Need for incremental scanning

In a hard real-time system, it’s imperative that high-priority tasks not be blocked from running for so long that they miss their deadlines. However, the heap scan must not be preempted by other heap functions, or errors will occur. Hence, heap scan must be a system service routine (SSR), which is non-pre-emptible. To resolve this conflict, heap scan has been designed to operate in an incremental fashion. It is called as follows:

smx_HeapScan(cp, fnum, bnum);

where cp points at the chunk from which to start this run, fnum is the number of chunks to scan forward, and bnum is the number of chunks to scan backward. bnum is usually larger than fnum because backward scanning is faster and more urgent because a broken forward link has been found that needs to be fixed. For example, fnum might be 2 and bnum might be 50. Hence, higher priority tasks would usually be blocked only for the time needed to scan two chunks.

Heap scan operation

Each call of HeapScan() is called a run; each time the full heap is scanned is called a scan. A scan may require thousands of runs, or more. HeapScan() is usually called as:

smx_HeapScan(NULL, fnum, bnum);

cp = NULL means to use the eheap static forward or backward pointer to start the run. These pointers point to the first chunk to scan when a run begins. If the hs_fwd mode is ON, the scan direction is forward (the normal direction); if it’s OFF, the direction is backward. If a free() merges the chunk pointed to by either pointer, the pointer is moved to point at the resultant merged chunk.

When a broken forward link is found, hs_fwd is turned OFF and the static backward scan pointer is set to point at the end of the heap. The next run starts a backward scan. When the backward scan has fixed or bridged the break, hs_fwd is turned back ON and forward scanning resumes on the next run from where it stopped.

When the end of the heap is reached, or an unfixable error occurs, HeapScan() returns TRUE. Otherwise it returns FALSE, which means it must keep scanning. Hence, calling it from idle can be done as follows:

if (--heap_scan == 0) {
	smx_HeapScan(NULL, 2, 50);
	heap_scan = HEAP_SCAN_CNT;

Then HeapScan() will be called once per HEAP_SCAN_CNT idle passes to limit processing time spent scanning. An unfixable error would normally be handled by the error manager, but it might be useful to deal with it here. In either case, hs_fwd is turned ON and the static forward pointer is set to NULL, causing the next run to start from the beginning of the heap.

Heap scan steadily patrols the heap looking for trouble and fixing what it can. For one run per tick, fnum = 2, and a 100,000 chunk heap, 50,000 ticks would be required to complete a full heap scan. That is 500 seconds at 100 ticks/sec, which should be adequate. As noted, this provides extra protection against the many things that can go wrong in a heap over a long period of time.

Bin scanning

Whereas bins are in local memory and may be less susceptible to bit flips, most bin-free list pointers are in the free chunks out in the heap and thus are just as susceptible to damage as the heap links. Thus, bin-free list scanning is necessary to provide a consistent level of protection. Bin scan is similar to heap scan; it’s incremental, scans doubly-linked chunk lists, and fixes broken links, if it can. It’s called with:

smx_HeapBinScan(binno, fnum, bnum);

Bin scan has three parameters: binno, fnum, and bnum. Like heap scan, it returns FALSE until it’s done with a bin, or an unfixable error is found. If a preempting malloc() uses a bin being scanned, the bin scan is restarted. Note that bin free lists are assumed to be much smaller than the heap.

Like heap scan, bin scan scans backward to fix broken forward links, fixes are reported, and double breaks are bridged. In that case, the bridged chunks can no longer be allocated, but heap operation can continue normally, as long as merging is OFF. Thus, bridging is a partial solution, but HEAP_BRKN is still reported to allow a better solution to be implemented. If merging is ON, free may fail due to encountering a bad pointer when it attempts to merge a bridged chunk.

Bin scan is also intended to be called from idle, such as:

void IdleMain(void)
	static u32  i = 0;
	if (smx_HeapBinScan(i, 2, 10))
		i = (i == smx_top_bin ? 0 : i++);

This results in all bins being scanned, two chunks forward at a time. For simplicity, unfixable errors are assumed to be handled by the error manager.

Heap recovery and extend

Due to its use of selective deferred free chunk merging, eheap runs a higher risk of allocation failure than does a heap like dlmalloc, which enables merging all the time. As a consequence, a heap recovery service has been implemented:

smx_HeapRecover(sz, fnum);

which searches the heap incrementally for adjacent free chunks that can be merged to form a block >= sz. If found, this service merges the chunks and puts the merged chunk into a bin, so that malloc() can be recalled to get it.

eheap also provides a heap extension service:

smx_HeapExtend(xsz, xsp);

to extend the heap into adjacent or non-adjacent RAM to recover from an allocation failure. This could be useful if there’s some unused RAM in a system due to, for example a feature not being installed, provision of expansion RAM for the future, or availability of lower performance RAM.

Error handling

As noted in the debugging paper, eheap does extensive testing of heap service parameters, links, flags, fences, etc., and reports any errors to the Error Manager (EM). In a thoroughly debugged, mature system, these kinds of errors are likely due to malware or cyber-attacks, rather than bugs. Application code to recover from them can be put into smx_EMHook(). Alternatively, most error testing can be turned off by compiling with HEAP_SAFE OFF. This improves performance at the risk of reduced safety.

If all else fails, the HEAP_BRKN error can handled by reinitializing the heap and restarting all tasks that use it. This can be a workable solution, if high-priority, mission-critical tasks use only block pools. In this case, valuable data may be lost, but the system will continue performing its vital functions. By taking action before malloc() or free() encounter a broken link, the system won’t go out of control and fail completely.

Ralph Moore, President and Founder of Micro Digital, graduated with a degree in Physics from Caltech. He spent his early career in computer research, then moved into mainframe design and consulting.

Topics covered in this article