DigiCortex 1.21 - Support for NVIDIA Pascal - 1.8 Million Neurons Simulated in Real-Time!

Click here to download DigiCortex demo...

Update notice: since the writing of this article, we got access to NVIDIA P100 ("Big Pascal") system. please click here for updated benchmarking data which includes P100

DigiCortex v1.21 brings improvements in the GPU area: it now includes support for multi-GPU CUDA-accelerated simulations and a full support for the latest
generation NVIDIA Pascal GPU architecture. In this article, we will provide a short overview of performance improvements when running DigiCortex on the NVIDIA Pascal architecture compared to the previous GPU generation (Maxwell).

For the purpose of this article, we tested performance using a consumer-class Pascal GPU (GP102) available at hand. We achieved a real-time simulation performance of 1.8 million neurons with 70 million synapses on a single Pascal GPU.

The results below are not even close to the maximum NVIDIA Pascal architecture can offer, because of a significant difference in memory bandwidth between professional and consumer-level cards: A top-of-the-line Pascal GP100 (720 GB/s) comes with HBM2 memory instead of Pascal GP102's GDDR5X (484 GB/s bandwidth), therefore theoretically offering almost 50% higher memory bandwidth. In order to already estimate impact of memory bandwidth on DigiCortex performance, we have simulated higher memory bandwidths on our consumer-level GPU by over-clocking the RAM clock speed while keeping the rest of the benchmark environment intact.


Benchmark Overview

With 3584 CUDA cores (up to 3840 in certain hardware configurations) and coupled with GDDR5X memory, NVIDIA Pascal architecture represents a significant improvement to the previous Maxwell architecture. Using GTX 1080 Ti "Pascal" and Titan X "Maxwell" NVIDIA GPUs, theoretical FMA TFLOPs improvement is 71% (11.3 vs. 6.6 TFLOPs) while the hardware memory bandwidth improvement is 44% (484 vs. 336 GB/s).

Table below summarizes hardware details of the GPUs that were used for the performance benchmarks:

NVIDIA GPU Specification Comparison:
GTX 1080 Ti ("Pascal") Titan X ("Maxwell")
GPU GP102 GP100
CUDA Cores 3584 3072
Core Clock 1481 MHz 1000 MHz
TFLOPs (FP32 FMA) 11.3 TFLOPs 6.6 TFLOPs
Memory Type GDDR5x (11 Gbps) GDDR5 (7 Gbps)
Memory Bus Width 352-bit 384-bit
Memory Bandwidth 484 GB/s 336 GB/s

For additional details regarding the simulation parameters used in performance benchmarks, please scroll to the end of this article (Appendix 1) .


Simulation Performance using Single-Compartment Neurons

Using a single compartment per neuron, we have created simulations of up to 4.2 million neurons / 166.6 million synapses. Simulation performance is shown in the table below. The "real-time factor" in this table is a ratio of simulation run time to simulation time (i.e., 1.5x real-time means that 1 second of simulation time is computed within 0.66 seconds on our hardware).

SpikeFun

Performance improvements achieved on Pascal architecture over Maxwell are within 1.28x - 1.48x, with the largest improvement seen in the simulation run with 4 million neurons which runs in 0.55x real-time on GP102. The highest performance improvement on the largest simulation is expected as DigiCortex GPU utilization is the highest when running large simulations (time spent in GPU compute is much larger compared to e.g. time spent doing kernel launches, barriers etc.).

To test at which point real-time performance is available on GP102 GPU, we have decreased the number of neurons until simulation time equals real-time (e.g. 1000 time steps of 1ms are computed in 1 second). As a result, we were able to simulate 1.8 million single-compartment neurons with 70 million synapses in real-time.


Simulation Performance using Multi-Compartment Neurons

Following single-compartment neuron benchmarks, we have also analyzed performance gains of Pascal architecture when running simulations with multi-compartment neurons. In this case, each neuron is modeled as a tree of electrically coupled compartments. Higher number of compartments increases the level of details that can be simulated but it also increases the computational demand, as each compartment can be seen as a separate "point" neuron with its own variables that need to be simulated.

SpikeFun

With multiple compartments per neuron, GP102 brings performance gains in the range of 1.36x to 2.08x. Like with the simulations using single compartment neurons, the largest performance gains are achieved when running large simulations allowing maximum GPU utilization, in this case when running simulation of 1.05 million neurons with 6.2 million compartments and 209.1 million synapses.

With GP102 and DigiCortex 1.21, it is possible to simulate 0.5 million neurons with 2 million compartments in real-time on a single GP102 GPU.


Memory Bandwidth Impact on Simulation Performance

Even when using reduced complexity phenomenological models of spiking neurons, which is known to be less data intensive than more complex models such as Hodgkin-Huxley, the biological neural network simulation is a "memory bound" computational task: updating a model variable for a given neuron, compartment or synapse, requires a large number of memory locations to be fetched. The available memory bandwidth has thus the most profound effect onto simulation performance. To test performance scaling with available hardware memory bandwidth we have performed benchmarks on the same GP102 GPU but with two different GDDR5X clock settings, resulting in total available memory bandwidths of 484.4 GB/s and 532.8 GB/s (10% difference). The impact on the simulation performance is shown below:

SpikeFun

Performance gains of extra 10% memory bandwidth range from 6% to 9.89%. Fixing memory bandwidth to 484.4 GB/s (factory GDDR5X clock) and increasing GPU core clock by 10% does not result in measurable performance gains regardless of the simulation size, showing that the simulation performance is, indeed, bounded by the RAM bandwidth on this generation of NVIDIA GPUs. One of the next steps will be measuring simulation performance on a GP100 GPU which has considerably more RAM bandwidth to work with (720 GB/s).

Conclusion

NVIDIA Pascal GP102 GPU brings considerable performance gains (up to 2.08x) compared to the previous generation Maxwell GPUs, allowing DigiCortex to simulate in real-time 1.7 million single-compartment neurons or 0.5 million multi-compartment neurons with >2 million compartments on a single GPU. Due to the high memory requirements of the biological neural network simulation models, further gains are expected using GPUs equipped with high-bandwidth HBM2 RAM.


Appendix 1: Simulation Details

Simulation Details:
Integration Step Size 1 ms
Numerical Solver Forward + Backward Euler Method
Neuron Models Single and multiple compartment Adaptive Quadratic Integrate & Fire ("Izhikevich")
Neuron Types Simulated Excitatory: Pyramidal, Spiny Stellate, Thalamocortical Relay; Inhibitory: Basket, Non-basket, Reticular Thalamic, Thalamic Interneuron
Number of cells Test specific, please see charts above
Excitatory / Inhibitory Cell Ratio 77% Excitatory, 23% Inhibitory
Synaptic Receptor Types AMPA, NMDA, GABAa, GABAb
Short-Term Synaptic Plasticity Yes (Tsodyks & Markram model), enabled for all synapses
Long-Term Synaptic Plasticity Yes (Voltage-based STDP), enabled for excitatory synapses
Network Geometry Thalamocortical system with 6-layer cortex, 3D cell positioning based on anatomical MRI data, axonal guidance using Diffusion MRI data