GNN Training
This guide covers training the PandaDock-GNN model on protein-ligand datasets.
Prerequisites
Install the GNN dependencies:
pip install -e ".[gnn]"
This installs PyTorch, PyTorch Geometric, and related packages.
Supported Datasets
PandaDock-GNN supports training on multiple datasets:
ULVSH: 942 compounds, 10 protein targets (pEC50 values)
BindingDB: 8,891 protein-ligand complexes with experimental pK values
PDBbind: 5,316 complexes with pKd/pKi values (v2020 refined set)
Dataset Preparation
ULVSH Format
PandaDock-GNN is designed for the ULVSH dataset format:
ULVSH/
├── TARGET1/
│ └── raw/
│ ├── protein.mol2 # Protein structure
│ ├── vitro.tsv # EC50 values
│ └── COMPOUND_ID/
│ ├── ligand.mol2 # Ligand structure
│ └── site.mol2 # Binding site atoms
├── TARGET2/
│ └── ...
└── ...
The vitro.tsv file should contain:
ID EC50[uM] Activity
compound1 0.5 1
compound2 10.0 0
...
BindingDB Format
For BindingDB training, prepare a TSV file with protein-ligand complex paths:
complex_id protein_file ligand_file pK
complex_1 proteins/1abc.mol2 ligands/lig1.mol2 7.5
complex_2 proteins/2def.mol2 ligands/lig2.mol2 6.2
...
Training Command
Basic training:
pandadock gnn train -d ULVSH/ -o models/ --epochs 100
Full options:
pandadock gnn train \
--dataset ULVSH/ \
--output models/ \
--epochs 100 \
--batch-size 32 \
--lr 1e-4 \
--hidden-dim 256 \
--num-layers 6 \
--dropout 0.1 \
--patience 20 \
--gpu
BindingDB Training
Train on BindingDB dataset:
# BindingDB only training
python BindingDB_training/train_bindingdb.py \
--bindingdb BindingDB_training/bindingdb_affinity.tsv \
--output models/ \
--epochs 100 \
--batch-size 16
Combined training with ULVSH (recommended for generalization):
# BindingDB + ULVSH combined training
python BindingDB_training/train_bindingdb.py \
--bindingdb BindingDB_training/bindingdb_affinity.tsv \
--ulvsh ULVSH/ \
--combined \
--output models/ \
--epochs 100
Benchmark Results:
Training Configuration |
Test Pearson R |
Test RMSE |
|---|---|---|
BindingDB Only |
0.81 |
|
BindingDB + ULVSH |
0.79 |
0.96 |
Training Options
Option |
Default |
Description |
|---|---|---|
|
100 |
Number of training epochs |
|
32 |
Batch size |
|
1e-4 |
Learning rate |
|
256 |
Hidden dimension |
|
6 |
Number of EGNN layers |
|
0.1 |
Dropout rate |
|
20 |
Early stopping patience |
|
random |
Data split strategy (random or target) |
|
–gpu |
Use GPU if available |
|
42 |
Random seed for reproducibility |
Training Output
After training, the output directory contains:
best_model.pt: Best model checkpoint (by validation loss)final_model.pt: Final model checkpointtraining_log.csv: Training metrics per epochtraining_results.json: Final metrics summary
Model Checkpoint Format
The checkpoint contains:
{
'config': ModelConfig, # Model configuration
'state_dict': dict, # Model weights
'training_config': dict, # Training configuration
'best_metrics': dict, # Best validation metrics
'epoch': int # Checkpoint epoch
}
Monitoring Training
The training loop outputs:
Loss values (total, affinity, activity)
Validation metrics (Pearson R, Spearman ρ, RMSE, MAE)
Early stopping status
Best model updates
Example output:
Epoch 1/100 [====================]
Train Loss: 0.5234 | Val Loss: 0.4123 | Val R: 0.45
Epoch 2/100 [====================]
Train Loss: 0.3421 | Val Loss: 0.3012 | Val R: 0.58
* New best model saved
...
Tips for Better Training
Use GPU: Training is ~10x faster on GPU
Start with default hyperparameters: They work well for ULVSH
Monitor validation R: Should steadily increase
Early stopping: 20 epochs patience is usually sufficient
Batch size: 32 works well; reduce if memory issues
Evaluating the Model
After training, benchmark on test set:
pandadock gnn benchmark -m models/best_model.pt \
-d ULVSH/ -o results/
Compare against baselines:
pandadock gnn compare -m models/best_model.pt \
-d ULVSH/ -o comparison/