DigiCortex NVIDIA P100 ("Pascal") Update: 5M Neurons Simulated in Real-time on Dual GP100 GPU

Click here to download DigiCortex demo...

In our previous article we have tested DigiCortex engine on NVIDIA's consumer "Pascal" GPUs and logged significant improvements in simulation performance as compared to the previous GPU generation ("Maxwell"). Consumer "Pascal" GPUs from NVIDIA are equipped with GDDR5x RAM and the highest memory bandwidth available is 484 GB/s. This is significantly than what is available with P100 accelerator, which is using HBM2 memory with the total available bandwidth of 720 GB/s.
Today we show the results of DigiCortex 1.22 running on single and dual NVIDIA P100 accelerators


Benchmark Overview

Table below summarizes hardware details of the GPUs that were used for the performance benchmarks:

NVIDIA GPU Specification Comparison:
Tesla P100 ("Pascal") GTX 1080 Ti ("Pascal")
GPU GP100 GP102
CUDA Cores 3584 3584
Core Clock 1328 MHz 1481 MHz
TFLOPs (FP32 FMA) 10.6 TFLOPs 11.3 TFLOPs
Memory Type HBM2 (1.4 Gbps) GDDR5x (11 Gbps)
Memory Bus Width 4096-bit 352-bit
Memory Bandwidth 720 GB/s 484 GB/s
For additional details regarding the simulation parameters used in performance benchmarks, please scroll to the end of this article (Appendix 1)


Single GPU Simulation Performance using "Point" (Single-Compartment) Neurons

Using a single compartment per neuron, we show results for simulations up to 4.2 million neurons / 166.6 million synapses. Simulation performance is shown in the table below. The "real-time factor" in this table is a ratio of simulation run time to simulation time (i.e., 1.5x real-time means that 1 second of simulation time is computed within 0.66 seconds on our hardware).

SpikeFun

As it could be seen in the chart above, there is a significant performance gain when using P100 accelerator. Under certain workloads, speed-up compared to consumer GP102 Pascal GPU is almost double. This can be explained partially by the significantly higher available memory bandwidth (720 GB/s vs. 484 GB/s) as well as Tesla-specific features such as overlapped data transfers between host and the GPU. This allows us to simulate up to 2.6 million neurons and 105 million synapses in real time on a single P100 accelerator.


Multi-GPU Simulation Performance using "Point" (Single-Compartment) Neurons

Since the system available to us had two P100 accelerators, and DigiCortex recently got multi-GPU support (for up to 32 GPUs), we have also tested performance scaling with both P100 accelerators used. Table below shows results:

SpikeFun

Performance increase on dual-GPU setup depends on the size of the workload and, at least 0.5 million neurons are needed in order to reach the same speed as single GPU. This is due to additional inter-GPU communication and data transfer overhead. However, once the simulation becomes large enough to minimize relative impact of additional I/O kernel launches, performance scales almost linearly, reaching 1.211x real-time for simulation of 4.2 million neurons and 166.6 million synapses.

With two P100 accelerators and using DigiCortex 1.22, it is possible to simulate 5 million neurons with 200 million synapses in real-time


Conclusion

NVIDIA Pascal P100 accelerator provides significant gain in terms of simulation performance for large-scale spiking neural networks, thanks to the very high bandwidth RAM as well as support for overlapping data transfers, which is a feature not available on consumer-class GPUs. DigiCortex engine efficiently utilizes P100 resources and allows millions of neurons to be simulated in real-time.