SpikeFun 0.65 - NUMA Optimizations

SpikeFun v0.65 contains memory and work scheduling optimizations addressing NUMA architectures which are becoming de facto standard in the x86 server and workstation markets.

NUMA systems (such as Intel Xeon E5-2600, previously known as "Romley" platform) consists of separate nodes of CPUs and memory. This effectively means that memory access is not uniform and the latency and bandwidth depend whether the memory addressed is in the local or remote NUMA node. By ignoring NUMA layout, memory-intensive applications (such as biological neural network simulations) will work slower than optimal due to memory-access penalties.

To ensure optimal performance on NUMA systems, SpikeFun 0.65 employs several optimization strategies:

  • Neuron and synaptic memory is divided between present NUMA nodes
  • Each neuron has its own parent NUMA node, and all memory related to this neuron is local within the node
  • Worker threads are partitioned so that they reflect the NUMA system layout
  • Work scheduling done by the work manager is NUMA-aware
  • Spike delivery is separated for post-synaptic neurons belonging to the same NUMA node and neurons belonging to foreign NUMA nodes
  • Spikes to post-synaptic neurons belonging to the foreign NUMA node are delivered by their respective NUMA node bound worker threads

Results on the reference platform:

For performance comparison, we used dual-CPU Intel(R) Xeon E5-2600 system:

  • CPU: 2 x Intel Xeon E5-2687W (3.1 GHz, 8-cores each, 16-cores total)
  • Memory: 32 GB DDR3-2133 MHz (out of Romley spec, overclocked IMC) achieving 68.2 GB/s bandwidth between RAM and the CPU
  • Motherboard: ASUS Z9PE D8 WS workstation motherboard
  • Simulation: 32768 multi-compartment neurons, 1.8 million synapses

The system architecture can be described with the following diagram:


Following table shows the performance difference between SpikeFun 0.64 and 0.65 on the NUMA platform:

Test Description: # CPUs: NUMA Mode Enabled in BIOS Simulation Performance:
SpikeFun 0.64 (ref) 1 (8 cores total) N/A 0.245x Real Time
SpikeFun 0.64 2 (16 cores total) NO 0.321x Real Time
SpikeFun 0.64 2 (16 cores total) YES 0.276x Real Time
SpikeFun 0.65 2 (16 cores total) YES 0.409x Real Time
SpikeFun 0.67 2 (16 cores total) YES 0.510x Real Time

As it could be seen, SpikeFun 0.65 (and, further, SpikeFun 0.67) delivers significant simulation speed improvement when it is running on a NUMA system. Another interesting observation related to NUMA unoptimized SpikeFun 0.64 is that it runs slower when NUMA system partitioning is known to Windows scheduler compared to mode when NUMA hardware layout is not published by the Firmware (BIOS/EFI). However, when memory allocation and thread management is done in accordance with the NUMA system layout, performance starts to approach theoretical 2x speedup that could be gained with 2x cores present on the system.