Handling RTOS errors and timeouts

Make time for error-handling aids. It'll pay off in the long run.

Most RTOSs provide monitoring and reporting of service call and other types of errors. Although it's generally recognized that good error handling aids debugging and improves reliability, there is seldom time for it. The problem is that we're already challenged to achieve the main project objectives and dealing with errors is an unwelcome nuisance. This is unfortunate, because good error detection and handling can reduce debug time and can help to deal with problems in delivered systems.

What's needed are simple ways to incorporate error and timeout management into embedded systems. There are two kinds of error reporting, local and central. Since not all RTOSs offer the latter, let's start with the former.

Local error reporting

There are basically two methods for local error reporting. The first returns status information directly and puts the desired result into an address provided as an argument. For example:

In this case, the task control block (TCB) has been statically defined. Using its address, OSTaskCreate() fills in the TCB fields and returns status = OK. If an error occurs, such as an invalid TCB address, then status is the error type. Disadvantages of this method are the need for an additional argument telling where to put the results and the need to statically define TCBs.

The second method returns the result directly and puts the status into a field in the current task's TCB. For example:

In this case, the TCB is dynamically allocated from a pool and the RTOS returns its handle, if successful. (The handle is taskA, which is a pointer to the TCB for taskA.) If not, it returns NULL. In that case, an error or timeout has occurred, and the status has been stored in smx_ct->err, where smx_ct is the current task, i.e. the one trying to create taskA. These methods are equivalent, but I prefer the second because it's more direct.

Local error and timeout handling

Here's a way to handle local errors and timeouts, without complicating the main code:

In this example, the while (1) loop does the main processing. It waits for a message at the port_in exchange. If a message is received within TMO ticks, it's processed, then released and control goes back to message receive. If a message isn't received due to an error or timeout, control goes to the else statement. If a timeout (SMXE_TMO) has occurred, it's handled and control goes back to message receive, unless there have been too many timeouts. Otherwise, control breaks out of the while (1) loop and goes to the switch statement. Here, there are cases for all smx_MsgReceive() error types, as well as a case for too many timeouts.

After error processing, taskA_Main() returns to the scheduler, which stops it. This insures that taskA won't cause further damage until the problem can be fixed. Once the problem has been fixed, taskA can be restarted and it will go back into its main processing loop. While debugging, stopping the task helps to diagnose the problem and decide how to fix it in released systems.

taskA can also be restarted, without stopping, as shown for the invalid MCB case. In this case, the message is discarded and taskA goes back to its main processing loop. Restarting taskA is the extent of the error processing. Also, there's no break after task restart, because no statements after it will execute.

This example shows how to distinguish a timeout from an error and how to distinguish error types. Note that timeout handling, which probably requires notification then retry, stays in the main while loop, whereas error handling is performed outside of the loop. This makes the main loop easier to understand since it's not cluttered with error handling code.

By separating error handling code from main processing code, it's easier to focus time and effort on the latter. During debug, the switch statement might be used just for a place to put a breakpoint. Later, it might be fleshed out with cases, as shown, or replaced with central error handling.

Central error handling

Central error handling reduces the need for local error handling and should be used if provided by the RTOS. The RTOS error manager, EM(), is called whenever an error is detected by an RTOS service or by the RTOS itself. Ideally, EM() runs in the system stack so handling an error can't cause a task stack to overflow. That could be a mess because it's unexpected and EM() isn't likely to be reentrant. EM() should do some or all of the following:

  1. Load the error number into a global variable, errno. This is the last error experienced by the system.
  2. Load the error number into the tcb.err field of the current task. If this field is cleared at the start of every RTOS service, it'll reflect what happened in the last service used by this task. This is necessary for local error management, as shown previously.
  3. Increment a global error counter, errctr. This indicates how many errors have occurred since the system was started.
  4. Increment a specific counter for the error type, errctrs[e]. This helps determine what problems are occurring in delivered systems.
  5. Save error information in an error buffer, EB, such as time of occurrence, errno, and the thread in which the error occurred.
  6. Save error information in an event buffer, EVB, so the error can be displayed relative to other events, when using the kernel-aware plug-in.
  7. Call a user hook function, EMHook(), to add error- and thread-specific processing.
  8. Stop the current task if it's damaged, or call EMExitHook() to reboot the system if the error is irrecoverable.

EM() adds overhead to the system only when errors occur. Hence, substantial error processing can be put into EM() with negligible impact on normal system operation. Also, the increase in code size is small relative to total code size.

Deciding what to use

Local error management tends to add complexity to the main code and make it larger and slower. However, errors can usually be handled most effectively at the point of call and in some cases, it's essential to do so. For example, in networking software, running out of free blocks for incoming packets might be normal under heavy-load conditions. Hence, dealing with it locally is necessary. As shown previously, this can be done without seriously complicating the main code.

However, once debugging is done, most errors should never occur again. Timeouts might be the only frequent occurrences to deal with, and they can be handled rather simply. Thus, in many systems, relying on central error management might be the best choice because it doesn't add much overhead, nor does it complicate the main code. Yet, it allows seeing what errors are occurring where they're occurring, and their frequencies.

Using EMHook() provides a middle ground, where additional information can be gathered for specific errors and threads. Then error- or thread-specific recoveries can be initiated.

In a typical system, many paths flow through the RTOS and thus it's in a good position to detect and handle errors. Generally speaking, return values from functions, especially RTOS services, shouldn't be used without checking them. Failure to do so can result in serious malfunctions and can provide entry points for malware. When the main code is done, attention can turn to the amount and type of error handling is appropriate for the released system. Probably some combination of the above methods will be chosen.

For more information on this design methodology see the smx User's Guide.

Ralph Moore, President and Founder of Micro Digital, graduated with a degree in Physics from Caltech. He spent his early career in computer research, then moved into mainframe design and consulting.