Home Page


 



| Home | Research Groups |User Documentation | System Resources |

NAMD

Test Cases

We ran the standard NAMD benchmark simulations that are listed here.

  • ApoA1 benchmark (92,224 atoms, periodic, PME) ApoA1 has been the standard NAMD cross-platform benchmark for years.
  • ATPase benchmark (327,506 atoms, periodic, PME)
  • STMV (virus) benchmark (1,066,628 atoms, periodic, PME) STMV is useful for demonstrating scaling to thousands of processors.

To make sure the simulations run long enough to give meaningful timings, we increased the number of steps from 500 (500 fs) to 3000 (3000 fs) in each test case.

The simulations are available at /home/btemelso/benchmarks/NAMD/Marcy until they are moved to a more standard location.

Code/binaries

We used the precompiled NAMD 2.9 binaries provided here

  • Linux-x86_64-ibverbs (InfiniBand via OpenFabrics OFED, no MPI needed) - for CPU only tests
  • Linux-x86_64-multicore-CUDA (NVIDIA CUDA acceleration) - for hybrid CPU+GPU tests

To test the scaling and efficiency of NAMD on Marcy, we ran the simulations on CPU cores first and proceeded to run them in a hybrid CPU+GPU setup. Those benchmarks are reported below.

Benchmarks

NAMD (CPU only)

A typical batch submission file for these calculations looks like this. We varied the NUMBER_OF_NODES parameter as needed, set up the path to CPU version of NAMD-2.9 by loading the 'namd/2.9' module and run each simulation.

#!/bin/tcsh
#PBS -q mercury
#PBS -l mem=30gb
#PBS -l nodes=__NUMBER_OF_NODES__:ppn=16
#PBS -l walltime=120:00:00
#PBS -j oe
#PBS -e jNAMD-CPU
#PBS -N jNAMD-CPU
#PBS -V

set echo
cd $PBS_O_WORKDIR

source /usr/local/Modules/3.2.10/init/tcsh
setenv CONV_RSH ssh

module load namd/2.9

#namd2 +idlepoll +p1 stmv.namd > 1.log 
#charmrun +p2 /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 2.log
#charmrun +p4 /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 4.log
#charmrun +p8 /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 8.log
#charmrun +p16 /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 16.log
#charmrun +p32 ++mpiexec /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 32.log
#charmrun +p64 ++mpiexec /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 64.log
#charmrun +p128 ++mpiexec /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 128.log
#charmrun +p192 ++mpiexec /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 192.log
#charmrun +p256 ++mpiexec /usr/local/Dist/NAMD-2.9/namd2 stmv.namd > 256.log

The parallel scaling, defined as the ratio of the time it takes to run the simulation on 1-core to n-cores, is given below.

Parallel Scaling of NAMD benchmark tests (t_1/t_n)
n_CPU_Cores APoA1 APTase STMV
1 1.00 1.00 1.00
2 1.96 2.04 1.92
4 3.87 3.94 3.81
8 7.49 7.72 7.21
16 14.90 14.64 13.76
32 27.04 28.05 25.55
64 49.09 54.03 47.79
128 95.97 94.27 77.64
192 130.42 144.30 117.15
256 152.60 177.45 139.16

The parallel efficiency is defined as t_1/(n*t_n) where n is the number of cores used.

Parallel Efficiency of NAMD benchmark tests (linear = 1.00)
n_CPU_Cores APoA1 APTase STMV
1 1.00 1.00 1.00
2 0.98 1.02 0.96
4 0.97 0.99 0.95
8 0.94 0.97 0.90
16 0.93 0.91 0.86
32 0.84 0.88 0.80
64 0.77 0.84 0.75
128 0.75 0.74 0.61
192 0.68 0.75 0.61
256 0.60 0.69 0.54

NAMD's scaling is quite impressive. We may need to run other simulations to see if this type of scaling holds true outside the benchmark tests provided with NAMD.

NAMD (CPU + GPU)

We then proceeded to run these same three simulations in one of our two GPU-containing nodes. In fact, the tests were all run on node22 since node21 still appears to have issues.

A typical batch submission file for these calculations looks like this. We set up the path to CPU+GPU hybrid version of NAMD-2.9 by loading the 'namd/2.9-cuda' module and run each simulation. There is a CUDA version of NAMD-2.9 that can run across multiple GPUs and nodes (Linux-x86_64-ibverbs-smp-CUDA (NVIDIA CUDA with InfiniBand), but we didn't play around with that. So, all CPU+GPU hybrid tests we ran are limited to a single node.

#!/bin/tcsh
#PBS -q gpu
#PBS -l mem=30gb
#PBS -l nodes=1:ppn=16
#PBS -l walltime=120:00:00
#PBS -j oe
#PBS -e jNAMD-CPU+GPU
#PBS -N jNAMD-CPU+GPU
#PBS -V

set echo
cd $PBS_O_WORKDIR

source /usr/local/Modules/3.2.10/init/tcsh
setenv CONV_RSH ssh

module load namd/2.9-cuda

nvidia-smi --loop=1 > gputil.log &
namd2 +p16 +idlepoll stmv.namd > 16.log
namd2 +p8 +idlepoll stmv.namd > 8.log
namd2 +p4 +idlepoll stmv.namd > 4.log
namd2 +p2 +idlepoll stmv.namd > 2.log
namd2 +p1 +idlepoll stmv.namd > 1.log
kill `pgrep  nvidia-smi`

The 'nvidia-smi –loop=1 > gputil.log &' line was added to monitor the GPU utilization every second as the calculations were running.

APoA1 (wall time in seconds)
nCPU_Cores tCPU tCPU+1 K20 GPU GPU Speedup
1 4,404 384 11
2 2,242 194 12
4 1,138 128 9
8 588 121 5
16 296 119 2
32 163
64 90

ATPase (wall time in seconds)
nCPU_Cores tCPU tCPU+1 K20 GPU GPU Speedup
1 13,404 1,200 11
2 6,575 603 11
4 3,399 413 8
8 1,735 381 5
16 916 373 2
32 478
64 248

STMV (wall time in seconds)
nCPU_Cores tCPU tCPU+1 K20 GPU GPU Speedup
1 7,530 693 11
2 3,930 364 11
4 1,978 219 9
8 1,045 177 6
16 547 167 3
32 295
64 158

The timing of the hybrid GPGPU runs for the three test cases are very similar. It looks like one GPU-containing node (16 CPU cores + 1 NVidia Tesla K20 GPU) performs equivalently to about three CPU-only nodes.

(16 CPU cores + 1 NVidia Tesla K20 GPU) ~ 3*(16 CPU cores)

Conclusions

  • NAMD scales very well up to 256 cores we tested. So, one should safely use a large number of cores without seeing significant decline in parallel efficiency.
  • If GPU nodes are available, one should use them to get about a 3X performance boost – (16 CPU cores + 1 NVidia Tesla K20 GPU) ~ 3*(16 CPU cores)

That's a solid performance boost, but not as impressive as expected. Based on Adam's benchmarks which showed that one node containing a cheaper ($600) and older GTX 680 performed equivalently to 128 CPU cores (~8 16-core nodes), I would have expected our newer and more expensive ($3500) GPU to perform a lot better. Perhaps the difference in expected vs. actual performance can be attributed to

  • Different benchmarks cases
  • Different codes (AMBER vs. NAMD)
  • Compilation and optimization of the codes (we use the plain binaries from NAMD site instead of compiling and optimizing the code for our hardware)
documentation/benchmarks/namd.txt · Last modified: 2013/08/10 17:05 by btemelso



| Home | Research Groups |User Documentation | System Resources |



Sponsored by the Mercury Consortium.

Please direct any questions to: support@mercuryconsortium.org

 
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki Site Design by Sly Media Networks LLC