pandadock-ml - ML-Enhanced Docking Command
===========================================

The ``pandadock-ml`` command performs machine learning-enhanced molecular docking with deep learning scoring and pose prediction. It leverages graph neural networks and 3D convolutional networks for state-of-the-art accuracy.

Synopsis
--------

.. code-block:: bash

   pandadock-ml [OPTIONS]

Description
-----------

Performs molecular docking with ML-enhanced scoring:

* **Deep learning scoring function** - Graph Neural Network (GNN) or 3D CNN
* **Pose ranking refinement** - ML-based re-ranking of docked poses
* **Transfer learning** - Pre-trained on PDBBind dataset
* **Uncertainty quantification** - Confidence estimates for predictions
* **Ensemble models** - Multiple models for robust predictions

Best accuracy: R = 0.91 correlation with experimental binding affinities.

Required Options
----------------

``-r, --receptor PATH``
    Receptor PDB file (protein structure)

``-l, --ligand PATH``
    Ligand file (SDF, MOL2, or PDB format)

``--center X Y Z``
    Grid box center coordinates (X Y Z in Angstroms)

``--box X Y Z``
    Grid box dimensions (X Y Z in Angstroms)

ML Model Options
----------------

``--model-type TYPE``
    ML model architecture. Default: ``gnn``

    Options:

    * ``gnn`` - Graph Neural Network (recommended, fastest)
    * ``cnn3d`` - 3D Convolutional Network (higher accuracy, slower)
    * ``hybrid`` - Combined GNN + CNN (best accuracy)
    * ``transformer`` - Transformer-based model (experimental)

``--ml-scoring-mode MODE``
    How to use ML scoring. Default: ``combined``

    Options:

    * ``combined`` - Combine physics-based + ML scoring
    * ``ml_only`` - Use only ML scoring
    * ``refinement`` - Use ML for pose re-ranking only

``--use-ensemble / --no-ensemble``
    Use ensemble of ML models for robust predictions. Default: enabled

    Ensemble averages predictions from 5 models trained on different data splits.

``--model-weights PATH``
    Path to custom model weights (optional)

    Use pre-trained weights or your own fine-tuned model.

ML Feature Options
------------------

``--include-protein-features / --no-protein-features``
    Include protein pocket features. Default: enabled

    Protein features: pocket shape, hydrophobicity, electrostatics

``--include-interaction-features / --no-interaction-features``
    Include protein-ligand interaction features. Default: enabled

    Interaction features: H-bonds, À-stacking, hydrophobic contacts

``--include-pharmacophore / --no-pharmacophore``
    Include pharmacophore features. Default: enabled

``--grid-resolution FLOAT``
    Grid resolution for 3D CNN (Angstroms). Default: 0.5

    Only used with ``--model-type cnn3d``

Docking Algorithm
-----------------

``-a, --algorithm ALGORITHM``
    Docking algorithm for pose generation. Default: ``enhanced_hierarchical_cpu``

    ML scoring can be combined with any docking algorithm.

Scoring Options
---------------

``-s, --scoring FUNCTION``
    Physics-based scoring for initial docking. Default: ``physics_based``

``--ml-weight FLOAT``
    Weight for ML score in combined mode. Default: 0.6

    Final score = (1 - weight) × physics + weight × ML

``--physics-weight FLOAT``
    Weight for physics score in combined mode. Default: 0.4

Uncertainty Quantification
---------------------------

``--estimate-uncertainty / --no-estimate-uncertainty``
    Estimate prediction uncertainty. Default: enabled with ensemble

``--uncertainty-threshold FLOAT``
    Maximum uncertainty for accepting predictions. Default: 1.0

    Predictions with uncertainty > threshold are flagged as low confidence.

``--monte-carlo-dropout / --no-monte-carlo-dropout``
    Use Monte Carlo dropout for uncertainty estimation. Default: disabled

    More accurate but slower uncertainty estimates.

Output Options
--------------

``-o, --output-dir PATH``
    Output directory. Default: ``ml_docking_output``

``-n, --num-poses N``
    Number of poses to generate. Default: 20

``--visualize / --no-visualize``
    Generate visualization plots. Default: enabled

``--save-ml-features``
    Save extracted ML features for analysis

``--save-attention-maps``
    Save attention maps (for GNN/Transformer models)

Performance Options
-------------------

``--cpuworkers N``
    Number of CPU workers. Default: auto-detect

``--gpu``
    Enable GPU acceleration for ML inference

    **Highly recommended** - 10-50x speedup for ML models

``--gpu-batch-size N``
    Batch size for GPU ML inference. Default: 32

``--fast``
    Fast mode with reduced sampling

Examples
--------

Basic ML Docking
^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r protein.pdb -l ligand.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                -o ml_results/

Uses default GNN model with ensemble scoring.

High-Accuracy ML Docking
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r protein.pdb -l ligand.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                --model-type hybrid \\
                --use-ensemble \\
                --algorithm enhanced_hierarchical_cpu \\
                --num-poses 50 \\
                -o high_accuracy_ml/

GPU-Accelerated ML Docking
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r target.pdb -l ligands.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                --model-type gnn \\
                --gpu \\
                --gpu-batch-size 64 \\
                -o gpu_ml_docking/

3D CNN Model
^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r protein.pdb -l ligand.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                --model-type cnn3d \\
                --grid-resolution 0.5 \\
                --gpu \\
                -o cnn3d_results/

ML-Only Scoring
^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r protein.pdb -l ligand.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                --ml-scoring-mode ml_only \\
                --model-type gnn \\
                --use-ensemble \\
                -o ml_only/

ML Pose Refinement
^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # First: Standard docking
   pandadock dock -r protein.pdb -l ligands.sdf \\
                  --num-poses 100 \\
                  -o initial_docking/

   # Second: ML re-ranking
   pandadock-ml -r protein.pdb -l ligands.sdf \\
                --ml-scoring-mode refinement \\
                --model-type hybrid \\
                --use-ensemble \\
                -o ml_refined/

With Uncertainty Filtering
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r protein.pdb -l library.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                --use-ensemble \\
                --estimate-uncertainty \\
                --uncertainty-threshold 0.8 \\
                -o filtered_predictions/

Only accepts predictions with uncertainty < 0.8

Custom Model Weights
^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock-ml -r kinase.pdb -l inhibitors.sdf \\
                --center 10 20 30 --box 20 20 20 \\
                --model-type gnn \\
                --model-weights kinase_finetuned.pt \\
                -o custom_model/

Target-Specific Fine-Tuned Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Kinase-specific model
   pandadock-ml -r kinase.pdb -l ligands.sdf \\
                --model-type gnn \\
                --model-weights models/kinase_specialist.pt \\
                --center 10 20 30 --box 20 20 20

   # GPCR-specific model
   pandadock-ml -r gpcr.pdb -l ligands.sdf \\
                --model-type gnn \\
                --model-weights models/gpcr_specialist.pt \\
                --center 10 20 30 --box 20 20 20

Output Files
------------

**Structures:**

* ``complex1.pdb, complex2.pdb, ...`` - Protein-ligand complexes
* ``pose1.pdb, pose2.pdb, ...`` - Ligand poses only

**Analysis:**

* ``ml_docking_results.json`` - Complete results with ML scores
* ``ml_predictions.csv`` - ML scores, uncertainties, features
* ``uncertainty_analysis.json`` - Uncertainty quantification results
* ``feature_importance.json`` - ML feature importance
* ``summary.txt`` - Human-readable summary

**ML-Specific:**

* ``attention_maps/`` - Attention visualizations (if requested)
* ``ml_features/`` - Extracted features (if requested)

**Visualizations:**

* ``ml_scores.png`` - ML score distribution
* ``uncertainty_plot.png`` - Uncertainty vs score
* ``feature_importance.png`` - Important features visualization

ML Predictions Output
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: json

   {
     "pose_1": {
       "ml_score": -9.8,
       "physics_score": -8.5,
       "combined_score": -9.2,
       "uncertainty": 0.45,
       "confidence": "high",
       "predicted_pKd": 8.5,
       "predicted_Ki_nM": 3.2,
       "feature_importance": {
         "hydrophobic_contacts": 0.35,
         "hydrogen_bonds": 0.28,
         "shape_complementarity": 0.22,
         "electrostatics": 0.15
       }
     }
   }

Performance Characteristics
----------------------------

**Accuracy:**

* R = 0.91 correlation with experimental data (hybrid model, ensemble)
* R = 0.88 (GNN model)
* R = 0.89 (3D CNN model)

**Speed:**

+------------------+-----------+----------+
| Model Type       | CPU Time  | GPU Time |
+==================+===========+==========+
| GNN              | 0.1-0.2 s | 0.01 s   |
+------------------+-----------+----------+
| 3D CNN           | 0.5-1.0 s | 0.05 s   |
+------------------+-----------+----------+
| Hybrid           | 0.3-0.5 s | 0.02 s   |
+------------------+-----------+----------+
| Ensemble (x5)    | 0.5-2.0 s | 0.05-0.1s|
+------------------+-----------+----------+

**Throughput:**

* CPU: 30-120 ligands/hour
* GPU: 300-600 ligands/hour (10-20x speedup)

ML Models Details
-----------------

Graph Neural Network (GNN)
^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Architecture:**

* Node features: Atomic properties (element, hybridization, charge)
* Edge features: Bond type, distance, angle
* Graph convolutions: 6 layers
* Attention mechanism: Multi-head attention
* Output: Binding affinity prediction

**Advantages:**

* Fastest ML model
* Rotationally/translationally invariant
* Captures long-range interactions
* Good generalization

3D Convolutional Network (CNN)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Architecture:**

* Input: 3D voxel grid (protein + ligand channels)
* Convolution layers: 8 layers with batch normalization
* Pooling: Max pooling between layers
* Fully connected: 3 dense layers
* Output: Binding affinity

**Advantages:**

* Captures 3D spatial patterns
* Good for shape complementarity
* Handles electrostatics well

Hybrid Model
^^^^^^^^^^^^

Combines GNN + 3D CNN:

* GNN branch: Graph-based features
* CNN branch: Spatial features
* Late fusion: Concatenate features before final layers
* Best accuracy but slower

Uncertainty Quantification
---------------------------

**Methods:**

1. **Ensemble Disagreement**

   Uncertainty = standard deviation across ensemble predictions

2. **Monte Carlo Dropout**

   Multiple forward passes with dropout enabled

3. **Evidential Deep Learning**

   Direct uncertainty estimation (experimental)

**Interpretation:**

* Low uncertainty (<0.5): High confidence
* Medium uncertainty (0.5-1.0): Moderate confidence
* High uncertainty (>1.0): Low confidence, novel chemical space

**Use uncertainty to:**

* Filter unreliable predictions
* Identify compounds requiring experimental validation
* Detect out-of-distribution samples

Best Practices
--------------

When to Use ML Docking
^^^^^^^^^^^^^^^^^^^^^^^

 **Maximum accuracy required** - Lead optimization, critical predictions

 **Novel scaffolds** - ML can capture patterns physics-based scoring misses

 **Large datasets available** - Can fine-tune models

 **GPU available** - Makes ML inference fast

When Not to Use
^^^^^^^^^^^^^^^

 **Ultra-large screening** (>100k compounds) - Too slow even with GPU

 **Very novel chemical space** - May not generalize well

 **No GPU available** and speed critical - Use faster scoring

Optimization Tips
^^^^^^^^^^^^^^^^^

**For maximum accuracy:**

.. code-block:: bash

   --model-type hybrid \\
   --use-ensemble \\
   --estimate-uncertainty

**For maximum speed:**

.. code-block:: bash

   --model-type gnn \\
   --no-ensemble \\
   --gpu \\
   --gpu-batch-size 128

**Balanced:**

.. code-block:: bash

   --model-type gnn \\
   --use-ensemble \\
   --gpu

Troubleshooting
---------------

Slow ML Inference
^^^^^^^^^^^^^^^^^

**Problem:** ML scoring very slow

**Solutions:**

1. Use GPU: ``--gpu``
2. Increase batch size: ``--gpu-batch-size 64``
3. Use simpler model: ``--model-type gnn``
4. Disable ensemble: ``--no-ensemble``

High Uncertainty
^^^^^^^^^^^^^^^^

**Problem:** Many predictions have high uncertainty

**Possible causes:**

* Novel chemical scaffolds not in training data
* Unusual binding modes
* Protein family not well-represented in training

**Solutions:**

* Use physics-based or hybrid scoring as fallback
* Fine-tune model on your target family
* Flag high-uncertainty predictions for manual review

Model Loading Errors
^^^^^^^^^^^^^^^^^^^^^

**Problem:** Cannot load ML model weights

**Solutions:**

1. Verify model file exists
2. Check PyTorch/TensorFlow version compatibility
3. Re-download default models
4. Check file permissions

Out of GPU Memory
^^^^^^^^^^^^^^^^^

**Problem:** GPU out of memory during ML inference

**Solutions:**

1. Reduce batch size: ``--gpu-batch-size 16``
2. Use smaller model: ``--model-type gnn`` (not hybrid)
3. Disable ensemble: ``--no-ensemble``
4. Use CPU inference (remove ``--gpu``)

Fine-Tuning ML Models
----------------------

You can fine-tune models on your own data:

.. code-block:: bash

   # Train on custom dataset
   pandadock-ml-train \\
       --training-data my_protein_ligand_complexes.csv \\
       --model-type gnn \\
       --output-weights custom_model.pt

   # Use fine-tuned model
   pandadock-ml -r protein.pdb -l ligands.sdf \\
                --model-weights custom_model.pt

Exit Status
-----------

Returns 0 on success, non-zero on error.

See Also
--------

* :doc:`pandadock` - Standard docking
* :doc:`pandadock_flex` - Flexible docking
* :doc:`../scoring/hybrid` - Hybrid ML scoring
* :doc:`../algorithms/specialized_modes` - Specialized docking modes