GPU Algorithms ============== PandaDock provides 3 GPU-accelerated docking algorithms that deliver 50-200x speedup over CPU equivalents while maintaining comparable accuracy. Prerequisites ------------- **Hardware Requirements:** * NVIDIA GPU with CUDA Compute Capability 6.0+ (Pascal architecture or newer) * Minimum 4GB GPU memory (8GB+ recommended for large libraries) * CUDA Toolkit 11.0 or higher **Software Requirements:** .. code-block:: bash # Install CuPy for CUDA 11.x pip install cupy-cuda11x # Or for CUDA 12.x pip install cupy-cuda12x Verify GPU availability: .. code-block:: bash pandadock list-algorithms Available GPU Algorithms ------------------------ Enhanced Hierarchical GPU ^^^^^^^^^^^^^^^^^^^^^^^^^ **Algorithm ID:** ``enhanced_hierarchical_gpu`` **Speedup:** 50-100x over CPU version **GPU Memory:** 1-4 GB **Best for:** High-throughput high-accuracy docking Implements the same 3-stage hierarchical search as the CPU version, but with massive parallelization: * Parallel pose generation (1000s simultaneously) * Batch scoring on GPU * Asynchronous CPU-GPU communication * Optimized memory management **Usage:** .. code-block:: bash pandadock dock -r protein.pdb -l ligand.sdf \ --algorithm enhanced_hierarchical_gpu \ --gpu \ --center 10 20 30 --box 20 20 20 **GPU Parameters:** * ``--gpu-batch-size``: Poses per GPU batch (default: 1000) * ``--gpu-memory-limit``: GPU memory limit in GB (default: 4.0) * ``--gpuid``: GPU device ID (default: 0) **Performance:** * RMSD: ~0.08 Å (same as CPU) * Runtime: 2-5 seconds per ligand * Throughput: 720-1800 ligands/hour CUDA Monte Carlo ^^^^^^^^^^^^^^^^ **Algorithm ID:** ``cuda_monte_carlo`` **Speedup:** 100-200x over CPU version **GPU Memory:** 0.5-2 GB **Best for:** Ultra-fast virtual screening Massively parallel Monte Carlo with 10,000+ independent walkers: * Each GPU thread runs independent MC walker * Shared memory for receptor grids * Warp-level optimizations * Minimal divergence **Usage:** .. code-block:: bash pandadock dock -r protein.pdb -l ligand.sdf \ --algorithm cuda_monte_carlo \ --gpu \ --center 10 20 30 --box 20 20 20 **Performance:** * RMSD: ~0.5-1.5 Å * Runtime: 0.5-2 seconds per ligand * Throughput: 1800-7200 ligands/hour CUDA Genetic Algorithm ^^^^^^^^^^^^^^^^^^^^^^ **Algorithm ID:** ``cuda_genetic_algorithm`` **Speedup:** 80-150x over CPU version **GPU Memory:** 1-3 GB **Best for:** GPU-accelerated complex site docking GPU-parallelized evolutionary algorithm: * Population stored in GPU memory * Parallel fitness evaluation * GPU-accelerated crossover/mutation * Efficient device-side selection **Usage:** .. code-block:: bash pandadock dock -r protein.pdb -l ligand.sdf \ --algorithm cuda_genetic_algorithm \ --gpu \ --center 10 20 30 --box 20 20 20 **Performance:** * RMSD: ~0.3-0.8 Å * Runtime: 1-3 seconds per ligand * Throughput: 1200-3600 ligands/hour GPU Performance Optimization ----------------------------- Batch Size Tuning ^^^^^^^^^^^^^^^^^ Optimize GPU batch size for your hardware: .. code-block:: bash # For 8GB GPU (larger batches) pandadock dock --algorithm enhanced_hierarchical_gpu \ --gpu-batch-size 2000 \ --gpu-memory-limit 6.0 # For 4GB GPU (smaller batches) pandadock dock --algorithm enhanced_hierarchical_gpu \ --gpu-batch-size 1000 \ --gpu-memory-limit 3.0 Multi-GPU Support ^^^^^^^^^^^^^^^^^ Run on specific GPU device: .. code-block:: bash # Use GPU 0 pandadock dock --algorithm enhanced_hierarchical_gpu \ --gpu --gpuid 0 # Use GPU 1 pandadock dock --algorithm enhanced_hierarchical_gpu \ --gpu --gpuid 1 For parallel screening across multiple GPUs, launch multiple processes: .. code-block:: bash # Terminal 1 (GPU 0) CUDA_VISIBLE_DEVICES=0 pandadock dock -l library_part1.sdf --gpu # Terminal 2 (GPU 1) CUDA_VISIBLE_DEVICES=1 pandadock dock -l library_part2.sdf --gpu Memory Management ^^^^^^^^^^^^^^^^^ Monitor GPU memory usage: .. code-block:: bash # Watch GPU utilization watch -n 1 nvidia-smi If out-of-memory errors occur: 1. Reduce batch size: ``--gpu-batch-size 500`` 2. Lower memory limit: ``--gpu-memory-limit 2.0`` 3. Reduce number of poses: ``--num-poses 10`` Performance Comparison ---------------------- **Throughput Comparison (ligands/hour):** +------------------------------+-------------+---------------+ | Algorithm | CPU | GPU | +==============================+=============+===============+ | Enhanced Hierarchical | 14-24 | 720-1800 | +------------------------------+-------------+---------------+ | Monte Carlo | 60-120 | 1800-7200 | +------------------------------+-------------+---------------+ | Genetic Algorithm | 18-30 | 1200-3600 | +------------------------------+-------------+---------------+ **Accuracy Comparison:** GPU algorithms maintain the same accuracy as CPU versions: * Enhanced Hierarchical GPU: RMSD ~0.08 Å (same as CPU) * CUDA Monte Carlo: RMSD ~0.5-1.5 Å (same as CPU) * CUDA Genetic Algorithm: RMSD ~0.3-0.8 Å (same as CPU) Best Practices -------------- 1. **Start with small test**: Run 10 ligands first to verify GPU setup 2. **Monitor GPU memory**: Use ``nvidia-smi`` to check utilization 3. **Tune batch size**: Find optimal batch size for your GPU 4. **Use fast GPUs**: Modern GPUs (RTX 30/40 series, A100, etc.) provide best performance 5. **Minimize data transfer**: Keep receptor on GPU across multiple ligands Virtual Screening Example -------------------------- High-throughput screening of large library: .. code-block:: bash # Screen 10,000 compound library pandadock dock -r target.pdb -l library_10k.sdf \ --algorithm cuda_monte_carlo \ --gpu \ --num-poses 5 \ --fast \ --center 15 20 25 --box 20 20 20 \ -o screening_results/ Expected runtime: 1-2 hours on modern GPU (vs 30-60 hours on CPU) Troubleshooting --------------- **CUDA not available:** .. code-block:: bash Error: GPU acceleration requested but CUDA not available Solution: Install CuPy matching your CUDA version **Out of memory:** .. code-block:: bash RuntimeError: out of memory Solutions: 1. Reduce batch size: ``--gpu-batch-size 500`` 2. Lower memory limit: ``--gpu-memory-limit 2.0`` 3. Use smaller grid box 4. Reduce number of poses **Slow performance:** Possible causes: 1. GPU throttling due to temperature 2. Insufficient batch size (too small) 3. PCIe bottleneck (old PCIe version) 4. Competing processes on GPU See Also -------- * :doc:`cpu_algorithms` - CPU algorithm documentation * :doc:`../guide/performance` - Performance optimization guide * :doc:`../tutorials/gpu_acceleration` - GPU tutorial