We look specifically at two device types, an 8-core Xeon (E5-2648L clocked at 1.8 GHz) and a 4-core Core i7 (2715QE clocked at 2.1 GHz). These are relatively low-power devices which are suitable for use on embedded processor boards. The algorithm of interest is a complex FFT (fft_copx) from the Mercury MathPack library, which represents the numerically intensive processing requirements typical of many embedded signal processing systems. One question to be addressed is whether there is a falloff in performance as the FFT algorithm is run on one or more cores of the same device simultaneously. Even though the cores are independent and have their own L1 and L2 cache, they share access to an L3 cache and to the DRAM memory controller. As the FFT size increases from 1K (1024 points) to 1M (1,048,576 points), demands on the memory subsystem increase as the cores compete for this limited resource.
This means that for some applications, you cannot simply measure the timing on a single core, and then scale the results to size your system - the falloff in performance as all cores are utilized must be taken into account.