GNN Training

This guide covers training the PandaDock-GNN model on protein-ligand datasets.

Prerequisites

Install the GNN dependencies:

pip install -e ".[gnn]"

This installs PyTorch, PyTorch Geometric, and related packages.

Supported Datasets

PandaDock-GNN supports training on multiple datasets:

ULVSH: 942 compounds, 10 protein targets (pEC50 values)
BindingDB: 8,891 protein-ligand complexes with experimental pK values
PDBbind: 5,316 complexes with pKd/pKi values (v2020 refined set)

Dataset Preparation

ULVSH Format

PandaDock-GNN is designed for the ULVSH dataset format:

ULVSH/
├── TARGET1/
│   └── raw/
│       ├── protein.mol2      # Protein structure
│       ├── vitro.tsv         # EC50 values
│       └── COMPOUND_ID/
│           ├── ligand.mol2   # Ligand structure
│           └── site.mol2     # Binding site atoms
├── TARGET2/
│   └── ...
└── ...

The vitro.tsv file should contain:

ID         EC50[uM]    Activity
compound1  0.5         1
compound2  10.0        0
...

BindingDB Format

For BindingDB training, prepare a TSV file with protein-ligand complex paths:

complex_id    protein_file              ligand_file           pK
complex_1     proteins/1abc.mol2        ligands/lig1.mol2     7.5
complex_2     proteins/2def.mol2        ligands/lig2.mol2     6.2
...

Training Command

Basic training:

pandadock gnn train -d ULVSH/ -o models/ --epochs 100

Full options:

pandadock gnn train \
    --dataset ULVSH/ \
    --output models/ \
    --epochs 100 \
    --batch-size 32 \
    --lr 1e-4 \
    --hidden-dim 256 \
    --num-layers 6 \
    --dropout 0.1 \
    --patience 20 \
    --gpu

BindingDB Training

Train on BindingDB dataset:

# BindingDB only training
python BindingDB_training/train_bindingdb.py \
    --bindingdb BindingDB_training/bindingdb_affinity.tsv \
    --output models/ \
    --epochs 100 \
    --batch-size 16

Combined training with ULVSH (recommended for generalization):

# BindingDB + ULVSH combined training
python BindingDB_training/train_bindingdb.py \
    --bindingdb BindingDB_training/bindingdb_affinity.tsv \
    --ulvsh ULVSH/ \
    --combined \
    --output models/ \
    --epochs 100

Benchmark Results:

Training Configuration	Test Pearson R	Test RMSE
BindingDB Only	0.81
BindingDB + ULVSH	0.79	0.96

Training Options

Option	Default	Description
`--epochs`	100	Number of training epochs
`--batch-size`	32	Batch size
`--lr`	1e-4	Learning rate
`--hidden-dim`	256	Hidden dimension
`--num-layers`	6	Number of EGNN layers
`--dropout`	0.1	Dropout rate
`--patience`	20	Early stopping patience
`--split`	random	Data split strategy (random or target)
`--gpu/--cpu`	–gpu	Use GPU if available
`--seed`	42	Random seed for reproducibility

Training Output

After training, the output directory contains:

best_model.pt: Best model checkpoint (by validation loss)
final_model.pt: Final model checkpoint
training_log.csv: Training metrics per epoch
training_results.json: Final metrics summary

Model Checkpoint Format

The checkpoint contains:

{
    'config': ModelConfig,       # Model configuration
    'state_dict': dict,          # Model weights
    'training_config': dict,     # Training configuration
    'best_metrics': dict,        # Best validation metrics
    'epoch': int                 # Checkpoint epoch
}

Monitoring Training

The training loop outputs:

Loss values (total, affinity, activity)
Validation metrics (Pearson R, Spearman ρ, RMSE, MAE)
Early stopping status
Best model updates

Example output:

Epoch 1/100 [====================]
Train Loss: 0.5234 | Val Loss: 0.4123 | Val R: 0.45

Epoch 2/100 [====================]
Train Loss: 0.3421 | Val Loss: 0.3012 | Val R: 0.58
* New best model saved

...

Tips for Better Training

Use GPU: Training is ~10x faster on GPU
Start with default hyperparameters: They work well for ULVSH
Monitor validation R: Should steadily increase
Early stopping: 20 epochs patience is usually sufficient
Batch size: 32 works well; reduce if memory issues

Evaluating the Model

After training, benchmark on test set:

pandadock gnn benchmark -m models/best_model.pt \
                        -d ULVSH/ -o results/

Compare against baselines:

pandadock gnn compare -m models/best_model.pt \
                      -d ULVSH/ -o comparison/