NAMD#
NAMD is developed by the Theoretical and Computational Biophysics Group at the University of Illinois Urbana-Champaign. It uses the Charm++ parallel programming model with an SMP (Symmetric Multi-Processing) runtime, enabling efficient scaling across both CPU cores and GPU accelerators. Common use cases include protein folding, membrane dynamics, and free energy calculations. For more information and documentation, see the NAMD website.
GPU Nodes
This page covers the GPU-accelerated NAMD module (namd/3.0.2-mpi-smp-cuda). Jobs using this module must be submitted from a GPU login node and must request GPU resources via --gres=gpu. CPU-only modules are also available; see Accessing NAMD on Kestrel below.
Accessing NAMD on Kestrel#
NAMD is available through the module system on both GPU and CPU nodes.
GPU module (covered by this guide)#
module load namd/3.0.2-mpi-smp-cuda
Loading this module automatically pulls in the required dependencies:
PrgEnv-gnu(GCC compiler environment)cray-mpich/8.1.28(MPI library with OFI/CXI transport)libfabric(Slingshot-11 fabric interface)cuda/12.9
CPU modules#
For CPU-only runs (no GPU required), the following modules are available on standard compute nodes:
namd/2.14
namd/2.14_cray
namd/2.14_cray_abi
namd/3.0_cray
namd/3.0_intel
namd/3.0_intel2
namd/3.0_intel_mpich
The rest of this guide focuses on the GPU module.
Running NAMD on Kestrel#
Required GPU Configuration Flags
Specifying +devices on the command line tells NAMD which GPUs are available, but GPU offloading is not enabled by default. Without the following lines in your .namd configuration file, the simulation will run on CPU only:
GPUresident off
usePMEGPU on
GPUresident off enables the standard GPU force-offloading mode, where non-bonded forces and PME electrostatics are computed on the GPU while integration remains on the CPU. usePMEGPU on additionally offloads particle mesh Ewald (long-range electrostatics) to the GPU. Both flags are required together to make effective use of the H100 GPUs on Kestrel.
NAMD uses Charm++ SMP mode: one MPI rank is launched per node, and that rank spawns multiple worker threads (PEs) across the node's CPU cores. The key launch parameters are:
--ntasks-per-node=1— one MPI rank per node--cpus-per-task=<N>— expose N cores to each rank+ppn <N-1>— worker PE thread count (one less thancpus-per-task, reserving one core for the Charm++ communication thread)+setcpuaffinity— bind threads to cores for better NUMA locality+devices <list>— comma-separated list of GPU device indices to use (e.g.,0for one GPU,0,1for two)
Single-Node, Single-GPU#
Allocate a single GPU node interactively:
salloc -A <account> -t 00:30:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=26 --mem=80G --gres=gpu:1 -p debug
Once on the node:
module load namd/3.0.2-mpi-smp-cuda
cd /path/to/your/simulation
srun -N1 --ntasks-per-node=1 --cpus-per-task=26 \
namd3 +ppn 25 +setcpuaffinity +devices 0 input.namd
Single-Node, Two GPUs#
Using 2 GPUs on a single node can improve throughput for larger systems. Allocate a node with 2 GPUs:
salloc -A <account> -t 00:30:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=52 --mem=160G --gres=gpu:2 -p debug
Once on the node:
module load namd/3.0.2-mpi-smp-cuda
cd /path/to/your/simulation
srun -N1 --ntasks-per-node=1 --cpus-per-task=52 \
namd3 +ppn 51 +setcpuaffinity +devices 0,1 input.namd
When to use multiple GPUs
Using 2 GPUs per node is beneficial when your system is large enough to fully utilize both. NAMD distributes work in units called patches; if the total number of patches is too small relative to the number of GPUs, performance can decrease rather than improve. As a rule of thumb, aim for at least ~100 patches per GPU. For small systems, using a single GPU is often faster.
Multi-Node#
For larger systems that benefit from more than one node, use srun across multiple nodes. NAMD requires one MPI rank per node in SMP mode.
salloc -A <account> -t 01:00:00 --nodes=2 --ntasks-per-node=1 --cpus-per-task=26 --mem=80G --gres=gpu:1 -p debug
Once the allocation is granted:
module load namd/3.0.2-mpi-smp-cuda
cd /path/to/your/simulation
srun -N2 --ntasks-per-node=1 --cpus-per-task=26 \
namd3 +ppn 25 +setcpuaffinity +devices 0 input.namd
Sample Slurm Scripts#
Single-Node, Single-GPU Batch Job#
#!/bin/bash
#SBATCH -J namd_1node_1gpu
#SBATCH -A <account>
#SBATCH -t 04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=26
#SBATCH --mem=80G
#SBATCH --gres=gpu:1
#SBATCH -o namd_%j.out
module load namd/3.0.2-mpi-smp-cuda
cd /path/to/your/simulation
srun namd3 +ppn 25 +setcpuaffinity +devices 0 input.namd
Single-Node, Two-GPU Batch Job#
#!/bin/bash
#SBATCH -J namd_1node_2gpu
#SBATCH -A <account>
#SBATCH -t 04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=52
#SBATCH --mem=160G
#SBATCH --gres=gpu:2
#SBATCH -o namd_%j.out
module load namd/3.0.2-mpi-smp-cuda
cd /path/to/your/simulation
# 51 worker threads + 1 communication thread = 52 cores total
srun namd3 +ppn 51 +setcpuaffinity +devices 0,1 input.namd
Multi-Node Batch Job#
#!/bin/bash
#SBATCH -J namd_multinode
#SBATCH -A <account>
#SBATCH -t 04:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=26
#SBATCH --mem=80G
#SBATCH --gres=gpu:1
#SBATCH -o namd_%j.out
# Loads PrgEnv-gnu, cray-mpich/8.1.28, libfabric, and cuda automatically
module load namd/3.0.2-mpi-smp-cuda
cd /path/to/your/simulation
# 1 MPI rank per node, 25 worker threads per rank, 1 GPU per node
srun namd3 +ppn 25 +setcpuaffinity +devices 0 input.namd
Checking Performance#
NAMD prints periodic timing summaries during the run. To extract performance data from the output log:
grep "PERFORMANCE\|Benchmark time" namd_output.log | tail -10
The output will look similar to:
Info: Benchmark time: 25 CPUs X.XXXXX s/step X.XXXXX days/ns XXX MB memory
PERFORMANCE: 500 averaging XX.X ns/day, X.XXXXXX sec/step ...
Higher ns/day and lower s/step values indicate better performance.
Hints and Additional Resources#
-
+ppntuning: Set+ppnto one less than--cpus-per-task. This reserves exactly one core per node for Charm++'s communication thread. Using the full core count for worker threads can cause communication stalls and degrade performance. -
Multiple GPUs per node: For multi-node runs, whether
+devices 0,1helps depends on the size of your system. Enabling 2 GPUs on each node increases the total GPU count, which reduces the number of patches per GPU. If your system is not large enough to saturate all GPUs, this can slow the simulation down. Test on a short run first. -
GPU offloading flags: The two configuration file settings required to actually use the GPU are
GPUresident offandusePMEGPU on.GPUresident offenables standard GPU force-offloading (non-bonded forces computed on GPU, integration on CPU).usePMEGPU onadditionally moves long-range electrostatics (PME) to the GPU. Without both flags, NAMD will run on CPU only even when+devicesis specified. For fully GPU-resident execution (all computation on GPU, including integration), useGPUresident oninstead — this can give further speedups on large systems but has some restrictions on supported features. -
Input files: NAMD accepts a
.namdconfiguration file that references your PSF, PDB, and parameter files. For generation of input files compatible with NAMD, see VMD (Visual Molecular Dynamics), which is developed by the same group. -
For additional documentation, tutorials, and mailing list support, see the NAMD documentation page and the NAMD mailing list.