GNN Training ============ This guide covers training the PandaDock-GNN model on protein-ligand datasets. Prerequisites ------------- Install the GNN dependencies: .. code-block:: bash pip install -e ".[gnn]" This installs PyTorch, PyTorch Geometric, and related packages. Supported Datasets ------------------ PandaDock-GNN supports training on multiple datasets: * **ULVSH**: 942 compounds, 10 protein targets (pEC50 values) * **BindingDB**: 8,891 protein-ligand complexes with experimental pK values * **PDBbind**: 5,316 complexes with pKd/pKi values (v2020 refined set) Dataset Preparation ------------------- **ULVSH Format** PandaDock-GNN is designed for the ULVSH dataset format: .. code-block:: text ULVSH/ ├── TARGET1/ │ └── raw/ │ ├── protein.mol2 # Protein structure │ ├── vitro.tsv # EC50 values │ └── COMPOUND_ID/ │ ├── ligand.mol2 # Ligand structure │ └── site.mol2 # Binding site atoms ├── TARGET2/ │ └── ... └── ... The ``vitro.tsv`` file should contain: .. code-block:: text ID EC50[uM] Activity compound1 0.5 1 compound2 10.0 0 ... **BindingDB Format** For BindingDB training, prepare a TSV file with protein-ligand complex paths: .. code-block:: text complex_id protein_file ligand_file pK complex_1 proteins/1abc.mol2 ligands/lig1.mol2 7.5 complex_2 proteins/2def.mol2 ligands/lig2.mol2 6.2 ... Training Command ---------------- Basic training: .. code-block:: bash pandadock gnn train -d ULVSH/ -o models/ --epochs 100 Full options: .. code-block:: bash pandadock gnn train \ --dataset ULVSH/ \ --output models/ \ --epochs 100 \ --batch-size 32 \ --lr 1e-4 \ --hidden-dim 256 \ --num-layers 6 \ --dropout 0.1 \ --patience 20 \ --gpu BindingDB Training ------------------ Train on BindingDB dataset: .. code-block:: bash # BindingDB only training python BindingDB_training/train_bindingdb.py \ --bindingdb BindingDB_training/bindingdb_affinity.tsv \ --output models/ \ --epochs 100 \ --batch-size 16 Combined training with ULVSH (recommended for generalization): .. code-block:: bash # BindingDB + ULVSH combined training python BindingDB_training/train_bindingdb.py \ --bindingdb BindingDB_training/bindingdb_affinity.tsv \ --ulvsh ULVSH/ \ --combined \ --output models/ \ --epochs 100 **Benchmark Results:** +---------------------------+----------------+-----------+ | Training Configuration | Test Pearson R | Test RMSE | +===========================+================+===========+ | BindingDB Only | 0.81 | - | +---------------------------+----------------+-----------+ | BindingDB + ULVSH | 0.79 | 0.96 | +---------------------------+----------------+-----------+ Training Options ---------------- +------------------+----------+---------------------------------------------+ | Option | Default | Description | +==================+==========+=============================================+ | ``--epochs`` | 100 | Number of training epochs | +------------------+----------+---------------------------------------------+ | ``--batch-size`` | 32 | Batch size | +------------------+----------+---------------------------------------------+ | ``--lr`` | 1e-4 | Learning rate | +------------------+----------+---------------------------------------------+ | ``--hidden-dim`` | 256 | Hidden dimension | +------------------+----------+---------------------------------------------+ | ``--num-layers`` | 6 | Number of EGNN layers | +------------------+----------+---------------------------------------------+ | ``--dropout`` | 0.1 | Dropout rate | +------------------+----------+---------------------------------------------+ | ``--patience`` | 20 | Early stopping patience | +------------------+----------+---------------------------------------------+ | ``--split`` | random | Data split strategy (random or target) | +------------------+----------+---------------------------------------------+ | ``--gpu/--cpu`` | --gpu | Use GPU if available | +------------------+----------+---------------------------------------------+ | ``--seed`` | 42 | Random seed for reproducibility | +------------------+----------+---------------------------------------------+ Training Output --------------- After training, the output directory contains: * ``best_model.pt``: Best model checkpoint (by validation loss) * ``final_model.pt``: Final model checkpoint * ``training_log.csv``: Training metrics per epoch * ``training_results.json``: Final metrics summary Model Checkpoint Format ----------------------- The checkpoint contains: .. code-block:: python { 'config': ModelConfig, # Model configuration 'state_dict': dict, # Model weights 'training_config': dict, # Training configuration 'best_metrics': dict, # Best validation metrics 'epoch': int # Checkpoint epoch } Monitoring Training ------------------- The training loop outputs: * Loss values (total, affinity, activity) * Validation metrics (Pearson R, Spearman ρ, RMSE, MAE) * Early stopping status * Best model updates Example output: .. code-block:: text Epoch 1/100 [====================] Train Loss: 0.5234 | Val Loss: 0.4123 | Val R: 0.45 Epoch 2/100 [====================] Train Loss: 0.3421 | Val Loss: 0.3012 | Val R: 0.58 * New best model saved ... Tips for Better Training ------------------------ 1. **Use GPU**: Training is ~10x faster on GPU 2. **Start with default hyperparameters**: They work well for ULVSH 3. **Monitor validation R**: Should steadily increase 4. **Early stopping**: 20 epochs patience is usually sufficient 5. **Batch size**: 32 works well; reduce if memory issues Evaluating the Model -------------------- After training, benchmark on test set: .. code-block:: bash pandadock gnn benchmark -m models/best_model.pt \ -d ULVSH/ -o results/ Compare against baselines: .. code-block:: bash pandadock gnn compare -m models/best_model.pt \ -d ULVSH/ -o comparison/