Heterogeneous system architecture: Multicore image processing using a mix of CPU and GPU elements

2Image processing is computationally intensive, requiring immense resources in CPU and memory throughput. Parallelism through multiple CPU cores helps, but even with today's dual-, quad-, and higher count core devices, multimedia tasks demand either a great deal of power or the addition of dedicated image processing hardware. Adding a pool of parallel processing elements as a programmable accelerator to the CPU creates a proper balance of general-purpose processing, high performance, and low power, as demonstrated in an example showing 2x-3x performance and performance-per-watt improvements.

Although multimedia processing has benefited tremendously from the increased availability of processors, until recently, the tendency has been to consider multicore from the standpoint of largely homogeneous CPU architectures. A typical approach offers dual, quad, and higher numbers of cores arranged in Symmetric Multi-Processor (SMP) or cache-coherent Non-Uniform Memory Architecture (ccNUMA) groups. These devices sometimes come with additional hardware acceleration to offload particularly difficult tasks, such as the Context-Adaptive Variable-Length Coding (CAVLC) or processor-intensive Context-Adaptive Binary Arithmetic Coding (CABAC) entropy encoding functions used in H.264 .

Many companies now offer integrated graphics in their devices, largely as a way to save board space and bill of material costs. An additional benefit of having an integrated Graphics Processing Unit (GPU) is that it can provide another path to attacking challenging computational tasks. As an example, AMD’s R-Series processors integrate the Bulldozer CPU architecture with a discrete-class AMD Radeon GPU to produce an Accelerated Processing Unit (APU).

Figure 1 shows a high-level view of the architecture. The GPU is an array of Single Instruction Multiple Data (SIMD) processing elements. Note the common memory controller between the array and x86 cores. This architecture is an early step in the movement to AMD’s (HSA), which breaks significantly from the traditional multicore approach by treating integrated graphics processing elements as a pool of computational resources that can be applied to nongraphics data. It is similar to the way that an attached hardware accelerator can be used, but with the advantage of being a large cluster of programmable devices – up to 384 cores at the high end – with many hundreds of GFLOPS worth of computational throughput.

21
Figure 1: A high-level view of the AMD R-Series APU architecture shows a common memory controller between the array of SIMD processing elements and x86 cores.
(Click graphic to zoom)

Table 1 illustrates the basic operating parameters of an AMD desktop part with the same underlying architecture as the AMD R-Series APU. The Bulldozer-based CPU’s floating-point performance at 121.6 GFLOPS is dwarfed by the GPU, which offers 614 GFLOPS to applications that can utilize the resource effectively. The key to leveraging the added GPU processors in imaging is to consider the time needed to transfer a block of data to the GPU, process it, then return the results to the main thread.

21
Table 1: A comparison of operating parameters demonstrates CPU versus GPU performance in an AMD desktop part.
(Click graphic to zoom by 1.8x)

Performance analysis in a video transcoding application

Although current-generation AMD R-Series processors require a buffer transfer between the GPU and CPU, data-parallel applications can achieve an improvement in throughput and power efficiency that significantly outweighs the cost of the data transfer.

One example of a data-parallel application is Handbrake, a widely used video transcoding utility that runs on a variety of operating systems and hardware platforms. The utility is notable in that it has well-integrated multicore support and therefore serves as an appropriate platform to study the effectiveness of multicore . Handbrake uses x264, a full-featured H.264 encoder used in several open-source and commercial products that provide adequate multicore support.

OpenCL has been used in both of these projects to maximize the GPU compute resources available within the APU. In x264, the quality optimizations associated with the lookahead function were ported completely to OpenCL on the GPU. This can be used in transcoding applications to improve the output video quality and normally consumes up to about 20 percent of CPU time. Handbrake uses x264 directly as the video encoder. In addition, Handbrake uses OpenCL to perform video scaling and color space conversion from the RGB to YUV color space. AMD’s hardware video decoder (UVD) was also used in Handbrake to perform some of the video decoding tasks.

Data was produced on a Trinity 100 W A10 desktop machine; however, the underlying architecture is identical to that used in the embedded product.

To simplify analysis of the APU’s complex power management architecture, the top Thermal Design Power (TDP) was evaluated instead of breaking out the power consumed by each subsystem. The data displayed in Table 2 shows that when processing is performed purely in the CPU, a 14.8 fps frame rate is achieved, and processing is accomplished in 62 seconds. Involving the GPU and UVD blocks in the processing increased the frame rate to 18.3 fps, reducing processing time and hence power consumption by 14 seconds or 22 percent.

22
Table 2: An analysis of a Handbrake application on an A10 desktop machine shows the performance difference when using a CPU alone versus using a GPU and UVD blocks for processing.
(Click graphic to zoom by 1.9x)

The floating-point operations total in Table 2 is the product of the floating-point throughput from Table 1 and the processing time, from which can be drawn a figure of merit: operations per watt. Looking at the total floating-point operations and normalized operations per watt, it becomes clear that processing resources are being left on the table. CPU-only throughput is 75 operations per watt, while it is boosted to 353 operations per watt when the GPU is added for an improvement in theoretical floating-point throughput of nearly 5x (4.7).

Twenty-two percent is nontrivial savings, but while this generation of APU makes significant progress in the direction of HSA, all memory accesses are not uniform from the standpoint of latency and throughput. When the kernel accesses memory, it does so directly. Such operations occur at a theoretical rate of 22 GBps or 16 to 18 GBps in typical applications. When a buffer is created in host memory to access GPU data, however, that data goes through a different path where the effective bandwidth is closer to 8 MBps. This means that the gap in memory bandwidth between what the kernel can expect locally and what it can expect if the programmer sets up a local buffer to GPU memory is nearly 3:1. This is symmetrical. If the GPU needed data residing in CPU memory, the same relationship would hold true.

Future devices adhering to the full HSA architecture will have a uniform memory model, eliminating the need to copy data between the CPU and GPU memory regions. Algorithms such as the lookahead functions of x264, when moved into the GPU’s processing domain, will have near parity from the standpoint of memory bandwidth.

Performance analysis in a benchmarking application

A second example of a data-parallel application is LuxMark, a graphics-oriented benchmarking package that uses OpenCL and illustrates what can be achieved when differences in memory access times are removed from the equation. The default integrated benchmark was run on the AMD Embedded R-464L APU, whose characteristics are shown in Table 3.

23
Table 3: A comparison of operating parameters demonstrates CPU versus GPU performance in an AMD Embedded R-464L APU.
(Click graphic to zoom by 1.8x)

The results in Table 4 clearly demonstrate that the GPU outperforms the CPU by more than 25 percent for this class of processing task. With a full HSA implementation, one would expect to see the addition of the CPU plus GPU to be close to the sum of GPU plus CPU: 344. In practice, the resulting number is 230, which is a nontrivial 63 percent boost but not reflective of fully utilized resources. The gap is largely explained by the overhead of the CPU parsing data and transferring the data back and forth between the CPU and GPU at the lower speeds described earlier. Such features will enable the utilization of both the CPU and the GPU with minimal overhead, allowing the combined performance of both CPU and GPU to be significantly better than either one alone.

24
Table 4: Results from running the LuxMark 2.0 benchmark on the AMD Embedded R-464L APU indicate how the GPU outperforms the CPU in this processing task.
(Click graphic to zoom by 1.9x)

Maximizing computational resources

Newly released APU products provide significant computational resources to designers of embedded computing products. Taking advantage of those resources requires a solid understanding of the underlying hardware architecture, insight into the data flow issues within the target application, and familiarity with the latest tools such as OpenCL.

Frank Altschuler is senior product manager at AMD.

Jonathan Gallmeier is a SMTS/CAS engineer at AMD.

AMD www.amd.com

Follow: @AMDembedded Facebook Linkedin YouTube