Embedded Computing Design

Subscribe

Receive our complimentary magazine via U.S. Mail or E-mail.

MES

Managing network traffic flow for multicore x86 processors at 40/100G

Embedded Computing Design — February 7, 2012

2Part 2 in a 2-part series: Embedded systems migrating to 40G today and 100G in the next few years demand an intelligent in-line preprocessor capable of handling traffic at this high line rate, while communicating with the x86 CPU subsystem over a high-performance, virtualized PCI Express interface. Part 1 in this series examined the challenges of processing network traffic at 100G and some of the commercially available solutions attempting to solve such challenges. Part 2 highlights the need for a coprocessor that is tightly coupled to a multicore x86 CPU and can manage functions such as intelligent L2/L3 switching, flow classification, in-line security processing, virtualization, and load balancing for x86 CPU cores and virtual machines.

To keep pace with the explosion of traffic in the enterprise and carrier network, embedded designers have tried a variety of methods to meet the demand for 100G secure communication, including embedding hardware accelerators into multicore processors or using devices such as network processors, Ethernet switches, or Ethernet controllers. These approaches each come with their own drawbacks that limit performance and increase complexity. Furthermore, attempts to use a single-chip heterogeneous multicore processor to bypass performance issues have led to proprietary architectures that are not operating system friendly.

A high-performance multicore heterogeneous architecture builds on a single-chip multicore heterogeneous processor, but divides the solution into two processors: a general-purpose multicore x86 CPU focused on application and control plane processing and a separate in-line multicore coprocessor focused on L2-L4 processing and accelerating L4-L7 applications. The key to this architecture is having a tightly coupled interface between the two processors that is in-line, secure, virtualized, and high-performance (see Figure 1). A good analogy here is the use of a graphics processor unit alongside a multicore x86 processor in workstations and other graphics-intensive servers.

Figure1
Figure 1: In a multicore heterogeneous architecture, an external coprocessor integrates all of the hardware acceleration functions to optimize power and performance.
(click graphic to zoom)

Efficient processing and memory utilization

The coprocessor needs to access packets, forwarding the packet and its associated metadata (packet state) in a timely manner with minimal latencies. This dictates having hierarchical memory architecture of on- and off-chip memories, with packet data effectively managed through the hierarchy. For example, first-level lookup tables can be in on-chip memories, while large volumes of data can be stored in external memory tables.

In addition, the use of multiple threads per processing core can bypass the memory wall problem. A core or thread continues to execute until an external memory access is needed, at which point another processing thread takes over. The resulting asynchronous memory architecture decouples external memory accesses from processing, maximizing overall system performance. This allows for bulk memory transactions, where many memory accesses are pooled together into one memory transaction, further increasing the efficiency of the external memory interface.

In-line or look-aside processing with security and virtualization

The coprocessor is in-line with ingress and egress traffic and should be able to, on-the-fly, encrypt and decrypt the packets, classify packets into flows, and look up the flow state table to determine the action needed on the flow. The coprocessor also implements I/O virtualization, allowing the x86 cores and their Virtual Machines (VMs) to share the I/O subsystem. In addition, the coprocessor should be able to dynamically load balance the traffic to the x86 cores and VMs based on flows.

Fast interconnect with x86

Supporting a heterogeneous processing architecture requires a high-performance interconnect to the x86 processor with I/O virtualization capability. For example, an 8-lane PCI Express Gen 2 interface supports up to 40 Gbaud of traffic to an x86 CPU socket. (Note: Overhead on read and write cycles brings this number down to the low 20s.)

Cut through/intra-flow cut through

Ideally, not all flows need to be transmitted to the x86 processor, as the coprocessor is intelligent enough to classify packets into flows. Based on the flow state table, an action can be taken to cut through, drop, or forward to x86. In some cases, the first few packets of a flow are forwarded to the x86 subsystem for inspection. The x86 processor can then instruct the underlying coprocessor to cut through the remainder of the packets in the same flow.

Inter-VM switching

The advent of VMs and the need for I/O virtualization have created a new set of requirements that mandates a more intelligent approach for managing I/O. This has prompted the need for an intelligent way to interconnect VMs on different cores to handle the so-called east-west traffic. Such VMs can belong to different tiers of servers in the data center. Having the VM-aware switch on the coprocessor can achieve the VM interconnect.

Passive NIC mode

The coprocessor should be able to support a mode where all network I/O traffic is passed to the x86 processor. This mode is required for monitoring and statistics or for applications requiring 100 percent of x86 CPU processing.

Implementing the OpenFlow protocol

Software-Defined Networking (SDN) allows users to bring the benefits of virtualization – including shared resources, user customization, and fast adaptation – to the switched network by defining traffic flows and deciding how these flows are treated in the network. In other words, it allows the system user to remotely control the network hardware with software in a dynamic and programmable fashion.

SDN puts the intelligence of the network into a hierarchy of controllers. In this hierarchy, switching paths are centrally calculated based on IT-defined parameters and then downloaded to the distributed switching architecture. A hardware-agnostic architecture that uses standard open interfaces to the hardware can change the way we build networking systems today.

The new OpenFlow protocol supports SDN. An OpenFlow controller typically runs on a multicore x86 processor and implements the control plane protocols. It downloads the state information onto multiple flow state tables in the coprocessor, implementing the data-switching plane through a standard OpenFlow API. The coprocessor, being a stateful flow processor, can be optimized to support the OpenFlow architecture.

Best of breeds for the future

As line speeds continue to grow, it remains to be seen how application workloads will be divided among x86 general-purpose multicore CPUs and external supporting coprocessors. The flexibility of riding a product roadmap for multicore x86 processors separate from that of coprocessors gives designers the choice to use the best of breeds in trying to meet ever-increasing future challenges.

Editor’s note: Read Part 1 in this series online at http://embedded-computing.com/managing-processors-40100g-part-of-2.

Nabil G. Damouny is senior director of strategic marketing at Netronome.

Netronome 408-496-0022 info@netronome.com @netronome www.netronome.com