GPU Algorithms

PandaDock provides 3 GPU-accelerated docking algorithms that deliver 50-200x speedup over CPU equivalents while maintaining comparable accuracy.

Prerequisites

Hardware Requirements:

NVIDIA GPU with CUDA Compute Capability 6.0+ (Pascal architecture or newer)
Minimum 4GB GPU memory (8GB+ recommended for large libraries)
CUDA Toolkit 11.0 or higher

Software Requirements:

# Install CuPy for CUDA 11.x
pip install cupy-cuda11x

# Or for CUDA 12.x
pip install cupy-cuda12x

Verify GPU availability:

pandadock list-algorithms

Available GPU Algorithms

Enhanced Hierarchical GPU

Algorithm ID: enhanced_hierarchical_gpu

Speedup: 50-100x over CPU version

GPU Memory: 1-4 GB

Best for: High-throughput high-accuracy docking

Implements the same 3-stage hierarchical search as the CPU version, but with massive parallelization:

Parallel pose generation (1000s simultaneously)
Batch scoring on GPU
Asynchronous CPU-GPU communication
Optimized memory management

Usage:

pandadock dock -r protein.pdb -l ligand.sdf \
               --algorithm enhanced_hierarchical_gpu \
               --gpu \
               --center 10 20 30 --box 20 20 20

GPU Parameters:

--gpu-batch-size: Poses per GPU batch (default: 1000)
--gpu-memory-limit: GPU memory limit in GB (default: 4.0)
--gpuid: GPU device ID (default: 0)

Performance:

RMSD: ~0.08 Å (same as CPU)
Runtime: 2-5 seconds per ligand
Throughput: 720-1800 ligands/hour

CUDA Monte Carlo

Algorithm ID: cuda_monte_carlo

Speedup: 100-200x over CPU version

GPU Memory: 0.5-2 GB

Best for: Ultra-fast virtual screening

Massively parallel Monte Carlo with 10,000+ independent walkers:

Each GPU thread runs independent MC walker
Shared memory for receptor grids
Warp-level optimizations
Minimal divergence

Usage:

pandadock dock -r protein.pdb -l ligand.sdf \
               --algorithm cuda_monte_carlo \
               --gpu \
               --center 10 20 30 --box 20 20 20

Performance:

RMSD: ~0.5-1.5 Å
Runtime: 0.5-2 seconds per ligand
Throughput: 1800-7200 ligands/hour

CUDA Genetic Algorithm

Algorithm ID: cuda_genetic_algorithm

Speedup: 80-150x over CPU version

GPU Memory: 1-3 GB

Best for: GPU-accelerated complex site docking

GPU-parallelized evolutionary algorithm:

Population stored in GPU memory
Parallel fitness evaluation
GPU-accelerated crossover/mutation
Efficient device-side selection

Usage:

pandadock dock -r protein.pdb -l ligand.sdf \
               --algorithm cuda_genetic_algorithm \
               --gpu \
               --center 10 20 30 --box 20 20 20

Performance:

RMSD: ~0.3-0.8 Å
Runtime: 1-3 seconds per ligand
Throughput: 1200-3600 ligands/hour

GPU Performance Optimization

Batch Size Tuning

Optimize GPU batch size for your hardware:

# For 8GB GPU (larger batches)
pandadock dock --algorithm enhanced_hierarchical_gpu \
               --gpu-batch-size 2000 \
               --gpu-memory-limit 6.0

# For 4GB GPU (smaller batches)
pandadock dock --algorithm enhanced_hierarchical_gpu \
               --gpu-batch-size 1000 \
               --gpu-memory-limit 3.0

Multi-GPU Support

Run on specific GPU device:

# Use GPU 0
pandadock dock --algorithm enhanced_hierarchical_gpu \
               --gpu --gpuid 0

# Use GPU 1
pandadock dock --algorithm enhanced_hierarchical_gpu \
               --gpu --gpuid 1

For parallel screening across multiple GPUs, launch multiple processes:

# Terminal 1 (GPU 0)
CUDA_VISIBLE_DEVICES=0 pandadock dock -l library_part1.sdf --gpu

# Terminal 2 (GPU 1)
CUDA_VISIBLE_DEVICES=1 pandadock dock -l library_part2.sdf --gpu

Memory Management

Monitor GPU memory usage:

# Watch GPU utilization
watch -n 1 nvidia-smi

If out-of-memory errors occur:

Reduce batch size: --gpu-batch-size 500
Lower memory limit: --gpu-memory-limit 2.0
Reduce number of poses: --num-poses 10

Performance Comparison

Throughput Comparison (ligands/hour):

Algorithm	CPU	GPU
Enhanced Hierarchical	14-24	720-1800
Monte Carlo	60-120	1800-7200
Genetic Algorithm	18-30	1200-3600

Accuracy Comparison:

GPU algorithms maintain the same accuracy as CPU versions:

Enhanced Hierarchical GPU: RMSD ~0.08 Å (same as CPU)
CUDA Monte Carlo: RMSD ~0.5-1.5 Å (same as CPU)
CUDA Genetic Algorithm: RMSD ~0.3-0.8 Å (same as CPU)

Best Practices

Start with small test: Run 10 ligands first to verify GPU setup
Monitor GPU memory: Use nvidia-smi to check utilization
Tune batch size: Find optimal batch size for your GPU
Use fast GPUs: Modern GPUs (RTX 30/40 series, A100, etc.) provide best performance
Minimize data transfer: Keep receptor on GPU across multiple ligands

Virtual Screening Example

High-throughput screening of large library:

# Screen 10,000 compound library
pandadock dock -r target.pdb -l library_10k.sdf \
               --algorithm cuda_monte_carlo \
               --gpu \
               --num-poses 5 \
               --fast \
               --center 15 20 25 --box 20 20 20 \
               -o screening_results/

Expected runtime: 1-2 hours on modern GPU (vs 30-60 hours on CPU)

Troubleshooting

CUDA not available:

Error: GPU acceleration requested but CUDA not available

Solution: Install CuPy matching your CUDA version

Out of memory:

RuntimeError: out of memory

Solutions: 1. Reduce batch size: --gpu-batch-size 500 2. Lower memory limit: --gpu-memory-limit 2.0 3. Use smaller grid box 4. Reduce number of poses

Slow performance:

Possible causes: 1. GPU throttling due to temperature 2. Insufficient batch size (too small) 3. PCIe bottleneck (old PCIe version) 4. Competing processes on GPU

GPU Algorithms

Prerequisites

Available GPU Algorithms

Enhanced Hierarchical GPU

CUDA Monte Carlo

CUDA Genetic Algorithm

GPU Performance Optimization

Batch Size Tuning

Multi-GPU Support

Memory Management

Performance Comparison

Best Practices

Virtual Screening Example

Troubleshooting

See Also