Multi-Core FFT Performance on Intel(r) Sandy Bridge Processors

This paper examines scalability of computational performance with the (r) architecture, particularly when used on Mercury .

We look specifically at two device types, an 8-core (E5-2648L clocked at 1.8 GHz) and a 4-core Core (2715QE clocked at 2.1 GHz). These are relatively low-power devices which are suitable for use on embedded processor boards. The algorithm of interest is a complex (fft_copx) from the Mercury , which represents the numerically intensive processing requirements typical of many embedded systems. One question to be addressed is whether there is a falloff in performance as the FFT algorithm is run on one or more cores of the same device simultaneously. Even though the cores are independent and have their own L1 and L2 cache, they share access to an L3 cache and to the . As the FFT size increases from 1K (1024 points) to 1M (1,048,576 points), demands on the increase as the cores compete for this limited resource.

This means that for some applications, you cannot simply measure the timing on a single core, and then scale the results to size your system - the falloff in performance as all cores are utilized must be taken into account.