Best practices for designing high-throughput, real-time SoC systems

Xilinx's Nick Ni explains the pros and cons of emphasizing determinism or latency during SoC development.

Today's System-on-Chip (SoC)-based systems are routinely asked to perform multiple, disparate tasks such as running a base Operating Systems (OS) while also handling throughput-intensive applications. As the number of cores within SoCs also continues to grow, developers are left with many hardware and software considerations that often result in a tradeoff between real-time determinism and low latency performance.

Modern SoC software often consists of multiple applications, ranging from hard real-time (such as automotive motor control) to high-throughput (such as HD video streaming). These hybrid system designs are becoming more challenging as the modern SoC evolves rapidly into a high-throughput system with an increasing number of processor cores and high-bandwidth interconnects. Achieving hard real-time – µs-level response with less than one µs jitter – on such a system requires careful tradeoff analysis and system partitioning. It is also essential to consider future-proofing strategies for ever-increasing SoC complexity. Today, three main approaches exist from which designers can choose to optimize hybrid SoC systems: Asymmetric Multi-Processing (AMP), hypervisor, and Symmetric Multi-Processing (SMP) with core isolation (Table 1).

21
Table 1: Comparison of AMP, hypervisor and SMP with core isolation approaches to hybrid system design.
(Click graphic to zoom by 1.5x)
   

Asymmetric Multi-Processing

AMP is fundamentally a port of multiple OSs on physically different processor cores. An example would be to run a bare metal OS dedicated to handling real-time tasks on the first core, and to execute a full-blown OS, such as embedded Linux, on the other cores. Most of time, the initial porting of the OSs onto the cores is straightforward. However, the start-up code and resource management, such as memory, cache, and peripherals, are very error prone. When multiple OSs access the same peripheral, behaviors become non-deterministic and could become extremely time consuming to debug. Hence, it often requires that a careful protection architecture, like ARM TrustZone, is in place.

To add more complexity, message passing between OSs requires memory sharing and needs to be managed together with the other protection measures. Because the cache is usually not sharable between different OSs, message passing needs to happen through non-cache memory regions, which adds latency and jitter to overall performance. It is also a poor software architecture from a scalability viewpoint, as it requires significant re-porting when the number of cores increases.

Hypervisors

A hypervisor is a low-level software layer that runs directly on the hardware and manages multiple independent OSs on top of it. Though the initial porting is similar to AMP, the benefit is that the hypervisor hides the non-trivial details of the resource management and message passing. One drawback, on the other hand, is that it incurs a performance overhead due to the extra software layer, degrading the throughput and real-time performance.

Symmetric Multi-Processing

SMP with core isolation runs a single OS on multiple cores with internal core partitioning. An example is to instruct an SMP OS to assign a real-time application on the first core and the rest of the non-real-time applications on the remaining cores. This approach is very scalable, as the SMP OS is designed to port seamlessly to an increasing number of cores. Because all cores are managed by a single OS, message passing between cores can happen at the L1 data cache level, resulting in faster communication with less jitter.

Core isolation can reserve a core for the hard real-time application to shield effects from other high-throughput cores, preserving the low-jitter, real-time data response. This is generally a good software architecture decision because it allows designers to consider which OSs to use instead of re-inventing error-prone, low-level software to manage multiple OSs. The initial porting may require some effort if there are multiple OSs from the outset, but this effort can be significantly reduced when starting from an SMP architecture.

Optimizing a high-throughput, real-time SoC with SMP

Based on analyzing the alternatives, SMP with core isolation offers the best architecture for optimizing high-throughput, real-time SoC systems. However, before further analyzing the tradeoffs of this approach, it is first essential to understand what a real-time response (or loop time) consists of:

  1. Transfer new data to the system memory from an I/O (Direct Memory Access (DMA))
  2. Processor detects the new data in the system memory (core isolation)
  3. Copy the data to a private memory (processor memory access (memcpy))
  4. Compute the data
  5. Copy the results back to the system memory (memcpy)
  6. Transfer the results back to an I/O (DMA)

Because jitter and latency are an accumulation of the six steps, it is essential to optimize each step. For a Real-Time Operating System (RTOS) such as VxWorks with core isolation, the polling/interrupt response can be bound in the ns range (step 2), and data computation is application specific and also fairly predictable (step 4). Therefore, designers should focus on the tradeoff between the DMA and the memcpy (steps 1, 3, 5, 6).

There are two major means of transferring data – with or without cache coherency – and the two methods have very different consequences for DMA and memcpy. As shown in Table 2, for example, although using the ARM Cache Coherency Port (ACP) results in a longer path for DMA, the processor only needs to access the L1 cache to obtain the transferred data. Therefore, memcpy time is significantly lower using cache coherency, and dominates the small degradation in DMA performance (Figure 1). This means cache coherent transfers result in much lower latency and jitter due to the direct cache access.

Case study: Best-practices in SoC design

22
Table 2: Direct Memory Access (DMA) and memcpy transfer paths with and without cache coherency.
   

21
Figure 1: Direct Memory Access (DMA) and memcpy transfer performance with and without cache coherency.
   

Case study: Best-practices in SoC design

A complete system based off the previous example can be demonstrated with a reference design using a Cyclone V SoC FPGA development kit. The device consists of a dual-core ARM Cortex-A9 core subsystem (called the Hard Processor System (HPS)) and a 28 nm FPGA in a single chip (Figure 2). The system architecture is as follows:

22
Figure 2: Experimental system reference design using the Cyclone V SoC FPGA development kit
(Click graphic to zoom by 1.5x)
   

Hardware architecture

  • Two DMAs to transfer data from the FPGA I/O to the ARM processors and vice versa
  • Both DMAs are connected to the ACP to transfer data directly to and from the ARM processor cache
  • Real-time control unit IP to initiate message passing between the ARM processors and the DMA engines in the fastest way possible
  • Jitter monitor collects the real-time performance and jitter by directly probing the DMA signals, achieving accuracy within ±6.7 ns

Software architecture

  • VxWorks RTOS running in SMP mode on the dual-core ARM processors
  • Core isolation used to assign the real-time application to the first core and the rest of non-real-time applications to the second core
  • The real-time application continuously fetches data from I/O, computes the data, and sends the results back to the I/O
  • The non-real-time applications stress the ARM core and other I/O performance by continuously running FTP transfers and decryption of the data

Loop time and jitter experiments were run on the system using different data sizes that ranged from 32 bytes to 2,048 bytes. Each data size was run millions of times to collect a histogram of the loop time for jitter analysis, or the difference between the maximum and minimum loop times. As shown in Figure 3, even with heavy FTP traffic running on the second core, µs-level latency with less than 300 ps jitter was achieved over millions of test runs. There is some variation in jitter swings, but they are controlled within a range of 200 ps, which is insignificant at these throughput levels.

23
Figure 3: Loop time and jitter results of Cyclone V SoC FPGA development kit-based experimental system.
(Click graphic to zoom by 1.6x)
   

The same FTP application was also run on the VxWorks SMP utilizing both cores and achieved close to a 2x speed increase. Therefore, the SMP with core isolation technique does degrade throughput, resulting in a tradeoff decision between throughput and hard real-time applications. An AMP solution also exhibits the same degradation due to hard partitioning of the cores, however with much less scalability for an increased number of cores.

Best practices yield tradeoff considerations

Designing a balanced SoC system with high-throughput and real-time applications requires a number of tradeoff considerations, such as:

  • DMA data transfer
  • Cache coherency
  • Message passing between the processor core and the DMA
  • OS partitioning
  • Software scalability with increasing number of processor cores

The “best-practice” system design using SMP with core isolation and cache coherent transfer described here achieved low-latency and low-jitter, real-time performance while maintaining software scalability for future generations of SoCs.

Nick Ni is Embedded Applications Engineer at Altera Corporation.

Altera Corporation

www.altera.com

@alteracorp

www.linkedin.com/company/altera plus.google.com/+altera