GPU Algorithms
PandaDock provides 3 GPU-accelerated docking algorithms that deliver 50-200x speedup over CPU equivalents while maintaining comparable accuracy.
Prerequisites
Hardware Requirements:
NVIDIA GPU with CUDA Compute Capability 6.0+ (Pascal architecture or newer)
Minimum 4GB GPU memory (8GB+ recommended for large libraries)
CUDA Toolkit 11.0 or higher
Software Requirements:
# Install CuPy for CUDA 11.x
pip install cupy-cuda11x
# Or for CUDA 12.x
pip install cupy-cuda12x
Verify GPU availability:
pandadock list-algorithms
Available GPU Algorithms
Enhanced Hierarchical GPU
Algorithm ID: enhanced_hierarchical_gpu
Speedup: 50-100x over CPU version
GPU Memory: 1-4 GB
Best for: High-throughput high-accuracy docking
Implements the same 3-stage hierarchical search as the CPU version, but with massive parallelization:
Parallel pose generation (1000s simultaneously)
Batch scoring on GPU
Asynchronous CPU-GPU communication
Optimized memory management
Usage:
pandadock dock -r protein.pdb -l ligand.sdf \
--algorithm enhanced_hierarchical_gpu \
--gpu \
--center 10 20 30 --box 20 20 20
GPU Parameters:
--gpu-batch-size: Poses per GPU batch (default: 1000)--gpu-memory-limit: GPU memory limit in GB (default: 4.0)--gpuid: GPU device ID (default: 0)
Performance:
RMSD: ~0.08 Å (same as CPU)
Runtime: 2-5 seconds per ligand
Throughput: 720-1800 ligands/hour
CUDA Monte Carlo
Algorithm ID: cuda_monte_carlo
Speedup: 100-200x over CPU version
GPU Memory: 0.5-2 GB
Best for: Ultra-fast virtual screening
Massively parallel Monte Carlo with 10,000+ independent walkers:
Each GPU thread runs independent MC walker
Shared memory for receptor grids
Warp-level optimizations
Minimal divergence
Usage:
pandadock dock -r protein.pdb -l ligand.sdf \
--algorithm cuda_monte_carlo \
--gpu \
--center 10 20 30 --box 20 20 20
Performance:
RMSD: ~0.5-1.5 Å
Runtime: 0.5-2 seconds per ligand
Throughput: 1800-7200 ligands/hour
CUDA Genetic Algorithm
Algorithm ID: cuda_genetic_algorithm
Speedup: 80-150x over CPU version
GPU Memory: 1-3 GB
Best for: GPU-accelerated complex site docking
GPU-parallelized evolutionary algorithm:
Population stored in GPU memory
Parallel fitness evaluation
GPU-accelerated crossover/mutation
Efficient device-side selection
Usage:
pandadock dock -r protein.pdb -l ligand.sdf \
--algorithm cuda_genetic_algorithm \
--gpu \
--center 10 20 30 --box 20 20 20
Performance:
RMSD: ~0.3-0.8 Å
Runtime: 1-3 seconds per ligand
Throughput: 1200-3600 ligands/hour
GPU Performance Optimization
Batch Size Tuning
Optimize GPU batch size for your hardware:
# For 8GB GPU (larger batches)
pandadock dock --algorithm enhanced_hierarchical_gpu \
--gpu-batch-size 2000 \
--gpu-memory-limit 6.0
# For 4GB GPU (smaller batches)
pandadock dock --algorithm enhanced_hierarchical_gpu \
--gpu-batch-size 1000 \
--gpu-memory-limit 3.0
Multi-GPU Support
Run on specific GPU device:
# Use GPU 0
pandadock dock --algorithm enhanced_hierarchical_gpu \
--gpu --gpuid 0
# Use GPU 1
pandadock dock --algorithm enhanced_hierarchical_gpu \
--gpu --gpuid 1
For parallel screening across multiple GPUs, launch multiple processes:
# Terminal 1 (GPU 0)
CUDA_VISIBLE_DEVICES=0 pandadock dock -l library_part1.sdf --gpu
# Terminal 2 (GPU 1)
CUDA_VISIBLE_DEVICES=1 pandadock dock -l library_part2.sdf --gpu
Memory Management
Monitor GPU memory usage:
# Watch GPU utilization
watch -n 1 nvidia-smi
If out-of-memory errors occur:
Reduce batch size:
--gpu-batch-size 500Lower memory limit:
--gpu-memory-limit 2.0Reduce number of poses:
--num-poses 10
Performance Comparison
Throughput Comparison (ligands/hour):
Algorithm |
CPU |
GPU |
|---|---|---|
Enhanced Hierarchical |
14-24 |
720-1800 |
Monte Carlo |
60-120 |
1800-7200 |
Genetic Algorithm |
18-30 |
1200-3600 |
Accuracy Comparison:
GPU algorithms maintain the same accuracy as CPU versions:
Enhanced Hierarchical GPU: RMSD ~0.08 Å (same as CPU)
CUDA Monte Carlo: RMSD ~0.5-1.5 Å (same as CPU)
CUDA Genetic Algorithm: RMSD ~0.3-0.8 Å (same as CPU)
Best Practices
Start with small test: Run 10 ligands first to verify GPU setup
Monitor GPU memory: Use
nvidia-smito check utilizationTune batch size: Find optimal batch size for your GPU
Use fast GPUs: Modern GPUs (RTX 30/40 series, A100, etc.) provide best performance
Minimize data transfer: Keep receptor on GPU across multiple ligands
Virtual Screening Example
High-throughput screening of large library:
# Screen 10,000 compound library
pandadock dock -r target.pdb -l library_10k.sdf \
--algorithm cuda_monte_carlo \
--gpu \
--num-poses 5 \
--fast \
--center 15 20 25 --box 20 20 20 \
-o screening_results/
Expected runtime: 1-2 hours on modern GPU (vs 30-60 hours on CPU)
Troubleshooting
CUDA not available:
Error: GPU acceleration requested but CUDA not available
Solution: Install CuPy matching your CUDA version
Out of memory:
RuntimeError: out of memory
Solutions:
1. Reduce batch size: --gpu-batch-size 500
2. Lower memory limit: --gpu-memory-limit 2.0
3. Use smaller grid box
4. Reduce number of poses
Slow performance:
Possible causes: 1. GPU throttling due to temperature 2. Insufficient batch size (too small) 3. PCIe bottleneck (old PCIe version) 4. Competing processes on GPU
See Also
CPU Algorithms - CPU algorithm documentation
<no title> - Performance optimization guide
<no title> - GPU tutorial