Dataset Preparation Guide

This guide covers how to prepare datasets for training PandaDock-GNN models. PandaDock supports three dataset formats: ULVSH, PDBbind, and BindingDB.

Overview 

PandaDock-GNN can be trained on:

Dataset	Complexes	Affinity Type	Best Test R
PDBbind	5,316	pKd/pKi	0.88
ULVSH	942	pEC50	0.82
BindingDB	8,891+	pK (various)	0.81

You can train on any single dataset or combine them for better generalization.

—

ULVSH Dataset 

The ULVSH (Ultra-Large Virtual Screening Hits) dataset contains 942 compounds across 10 protein targets with experimental EC50 values.

Directory Structure 

ULVSH/
├── ADRA2B/                    # Target name (one of 10 targets)
│   ├── raw/
│   │   └── vitro.tsv          # Experimental binding data
│   └── minimized/
│       ├── ZINC000001234567/  # Compound directory (ZINC ID)
│       │   ├── protein.mol2   # Full protein structure
│       │   ├── ligand.mol2    # Ligand structure (docked pose)
│       │   └── site.mol2      # Binding site atoms only
│       └── ZINC000007654321/
│           └── ...
├── CNR1/
│   └── ...
└── DRD4/
    └── ...

Required Files 

vitro.tsv - Tab or whitespace-separated file with experimental data:

ID                  EC50[uM]    Active
ZINC000001234567    0.5         Yes
ZINC000007654321    15.2        No
ZINC000009876543    n.d.        No

ID: Compound identifier (must match directory name in minimized/)
EC50[uM]: EC50 value in micromolar (use “n.d.” for not determined)
Active: “Yes” or “No” for activity classification

MOL2 Files - Standard Tripos MOL2 format with:

protein.mol2: Full protein structure with all atoms
ligand.mol2: Ligand in docked/bound pose
site.mol2: Binding site atoms (typically within 5-10Å of ligand)

Preparing ULVSH Data 

Obtain structures: Get protein-ligand complexes from docking or crystal structures

Extract binding site:

from rdkit import Chem
import numpy as np

def extract_binding_site(protein_file, ligand_file, radius=10.0):
    """Extract protein atoms within radius of ligand centroid."""
    # Load structures
    protein = Chem.MolFromMol2File(protein_file)
    ligand = Chem.MolFromMol2File(ligand_file)

    # Get ligand centroid
    ligand_conf = ligand.GetConformer()
    ligand_coords = np.array([ligand_conf.GetAtomPosition(i)
                              for i in range(ligand.GetNumAtoms())])
    centroid = ligand_coords.mean(axis=0)

    # Filter protein atoms by distance
    # ... (save atoms within radius to site.mol2)

Create vitro.tsv: Compile experimental binding data from literature or assays
Organize directories: Follow the structure shown above

Training on ULVSH 

# Basic training
pandadock gnn train -d ULVSH/ -o models/ulvsh_model/

# With custom parameters
pandadock gnn train -d ULVSH/ -o models/ \
    --epochs 100 \
    --batch-size 32 \
    --hidden-dim 256 \
    --num-layers 6

—

PDBbind Dataset 

PDBbind is a curated database of protein-ligand complexes with experimentally measured binding affinities. The v2020 refined set contains 5,316 complexes.

Obtaining PDBbind 

Register at http://www.pdbbind.org.cn/
Download “PDBbind v2020 refined set”
Extract to create the directory structure below

Directory Structure 

PDBbind/
├── PDBbind_v2020/              # or just place files directly in PDBbind/
│   ├── index/
│   │   └── INDEX_refined_data.2020   # Binding affinity index
│   ├── 1a1e/                   # PDB ID directory
│   │   ├── 1a1e_protein.pdb    # Protein structure
│   │   ├── 1a1e_pocket.pdb     # Binding pocket (pre-extracted)
│   │   ├── 1a1e_ligand.mol2    # Ligand structure
│   │   └── 1a1e_ligand.sdf     # Ligand in SDF format
│   ├── 1a28/
│   │   └── ...
│   └── ...

Required Files 

INDEX_refined_data.2020 - Space-separated index file:

# PDB   resolution  year  -logKd/Ki  reference
1a1e   2.00        1998  6.52       // reference info
1a28   1.90        1997  4.62       // reference info
...

Column 1: PDB ID
Column 2: Resolution (Å)
Column 3: Year of structure
Column 4: Binding affinity as pKd or pKi (-log10 of Kd/Ki in M)

Structure Files (per complex):

{pdb_id}_protein.pdb: Full protein structure
{pdb_id}_pocket.pdb: Pre-extracted binding pocket (used for training)
{pdb_id}_ligand.mol2: Ligand structure

Training on PDBbind 

# PDBbind only
pandadock gnn train -p PDBbind/ -o models/pdbbind_model/

# Combined with ULVSH (recommended for generalization)
pandadock gnn train -d ULVSH/ -p PDBbind/ -o models/combined/ --balanced

—

BindingDB Dataset 

BindingDB is a public database of measured binding affinities. We provide tools to prepare BindingDB data for PandaDock training.

Obtaining BindingDB Data 

Download from https://www.bindingdb.org/bind/index.jsp
Select “Download” → “BindingDB_All_2D_NoSmi.tsv.zip” or use the API
Filter for entries with 3D structures

TSV File Format 

Create a TSV file with paths to protein and ligand structures:

complex_id    protein_file              ligand_file           pK       pdb_id
complex_001   proteins/1abc_protein.mol2   ligands/lig001.mol2   7.52     1abc
complex_002   proteins/2def_protein.mol2   ligands/lig002.mol2   6.31     2def
...

Required columns:

complex_id: Unique identifier for the complex
protein_file: Path to protein MOL2 file
ligand_file: Path to ligand MOL2 file
pK: Binding affinity as pKd, pKi, or pIC50 (-log10 scale)
pdb_id (optional): PDB ID if available

Directory Structure 

BindingDB_data/
├── bindingdb_affinity.tsv     # Main index file
├── proteins/                  # Protein structures
│   ├── 1abc_protein.mol2
│   ├── 2def_protein.mol2
│   └── ...
└── ligands/                   # Ligand structures
    ├── lig001.mol2
    ├── lig002.mol2
    └── ...

Preparing BindingDB Data 

Step 1: Download and filter BindingDB

import pandas as pd

# Load BindingDB TSV
df = pd.read_csv('BindingDB_All.tsv', sep='\t', low_memory=False)

# Filter for entries with:
# - Ki, Kd, or IC50 values
# - Associated PDB structures
# - Standard conditions

filtered = df[
    (df['Ki (nM)'].notna() | df['Kd (nM)'].notna() | df['IC50 (nM)'].notna()) &
    (df['PDB ID(s) of Target Chain'].notna())
]

print(f"Filtered to {len(filtered)} entries with binding data and structures")

Step 2: Obtain 3D structures

For each entry, you need protein and ligand 3D structures:

from rdkit import Chem
from rdkit.Chem import AllChem

def prepare_ligand(smiles, output_file):
    """Generate 3D ligand structure from SMILES."""
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, randomSeed=42)
    AllChem.MMFFOptimizeMolecule(mol)
    Chem.MolToMolFile(mol, output_file)

def download_protein(pdb_id, output_file):
    """Download protein structure from PDB."""
    import urllib.request
    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
    urllib.request.urlretrieve(url, output_file)

Step 3: Convert to MOL2 format

PandaDock requires MOL2 format for atom typing:

# Using Open Babel
obabel protein.pdb -O protein.mol2
obabel ligand.sdf -O ligand.mol2

Step 4: Create the TSV index

import pandas as pd
import numpy as np

def convert_to_pk(value_nm):
    """Convert nM to pK (-log10 M)."""
    if pd.isna(value_nm) or value_nm <= 0:
        return None
    return -np.log10(value_nm * 1e-9)

# Create index file
records = []
for idx, row in filtered.iterrows():
    # Get pK value (prefer Ki > Kd > IC50)
    pk = None
    if pd.notna(row.get('Ki (nM)')):
        pk = convert_to_pk(row['Ki (nM)'])
    elif pd.notna(row.get('Kd (nM)')):
        pk = convert_to_pk(row['Kd (nM)'])
    elif pd.notna(row.get('IC50 (nM)')):
        pk = convert_to_pk(row['IC50 (nM)'])

    if pk and 2.0 <= pk <= 12.0:  # Filter reasonable range
        records.append({
            'complex_id': f'complex_{idx}',
            'protein_file': f'proteins/{row["PDB ID(s) of Target Chain"]}_protein.mol2',
            'ligand_file': f'ligands/lig_{idx}.mol2',
            'pK': pk,
            'pdb_id': row['PDB ID(s) of Target Chain']
        })

df_index = pd.DataFrame(records)
df_index.to_csv('bindingdb_affinity.tsv', sep='\t', index=False)

Training on BindingDB 

# BindingDB only
pandadock gnn train -b bindingdb_affinity.tsv -o models/bindingdb_model/

# Combined with ULVSH (recommended)
pandadock gnn train -b bindingdb_affinity.tsv -d ULVSH/ -o models/combined/ \
    --balanced --epochs 100

—

Combined Training 

Training on multiple datasets improves generalization. Use the --balanced flag to prevent larger datasets from dominating training.

Recommended Combinations 

Best accuracy on diverse targets:

pandadock gnn train -d ULVSH/ -p PDBbind/ -o models/ \
    --balanced --epochs 200 --batch-size 32

Best for BindingDB-like screening:

pandadock gnn train -b bindingdb.tsv -d ULVSH/ -o models/ \
    --balanced --epochs 100

All datasets combined:

pandadock gnn train -d ULVSH/ -p PDBbind/ -b bindingdb.tsv -o models/ \
    --balanced --epochs 200

Warning

Combining PDBbind with other datasets may reduce performance due to affinity scale differences (pKd vs pEC50). Test on your specific use case.

—

Training Options Reference 

Option	Default	Description
`-d/--dataset`	None	Path to ULVSH dataset directory
`-p/--pdbbind`	None	Path to PDBbind dataset directory
``-b/–bindingdb``\| None		Path to BindingDB TSV file
`-o/--output`	Required	Output directory for model checkpoints
`--epochs`	100	Number of training epochs
`--batch-size`	32	Batch size (reduce if out of memory)
`--lr`	1e-4	Learning rate
`--hidden-dim`	256	Hidden layer dimension
`--num-layers`	6	Number of EGNN message passing layers
`--dropout`	0.1	Dropout rate for regularization
`--patience`	20	Early stopping patience (epochs without improve)
`--balanced`	False	Balance sampling across datasets
`--gpu/--cpu`	–gpu	Use GPU if available
`--seed`	42	Random seed for reproducibility

—

Troubleshooting 

“No valid complexes found”

Check that MOL2 files exist at the paths specified
Verify MOL2 files are valid (not empty, proper format)
Check that protein and ligand files are paired correctly

“Edge dimension mismatch”

Ensure you’re using the latest version of PandaDock
Edge features should be 23 dimensions (check featurizer)

Out of memory errors

Reduce --batch-size (try 16 or 8)
Use --cpu if GPU memory is limited
Filter out very large proteins (>1000 atoms)

Poor correlation (R < 0.5)

Check data quality (correct binding affinity values?)
Ensure structures are properly prepared (hydrogens added?)
Try training longer (--epochs 200)
Use --balanced when combining datasets

—

Best Practices 

Data Quality: Ensure binding affinities are accurate and structures are properly minimized
Structure Preparation:
- Add hydrogens to all structures
- Minimize/optimize ligand geometry
- Use consistent protonation states
Binding Site Extraction:
- Use 10Å radius around ligand centroid
- Include all residues with atoms in the cutoff
Training:
- Start with default hyperparameters
- Monitor validation R during training
- Use early stopping to prevent overfitting
Evaluation:
- Always evaluate on held-out test set
- Report Pearson R, Spearman ρ, RMSE, and MAE
- Compare against baseline methods

—

Example: End-to-End Training 

Here’s a complete example of preparing data and training a model:

# 1. Prepare your data (assuming you have structures ready)
mkdir -p my_dataset/proteins my_dataset/ligands

# 2. Create index file
cat > my_dataset/affinity.tsv << EOF
complex_id   protein_file    ligand_file     pK
complex_1    proteins/prot1.mol2     ligands/lig1.mol2       7.5
complex_2    proteins/prot2.mol2     ligands/lig2.mol2       6.2
complex_3    proteins/prot3.mol2     ligands/lig3.mol2       8.1
EOF

# 3. Train the model
pandadock gnn train -b my_dataset/affinity.tsv -o my_model/ \
    --epochs 100 --batch-size 16

# 4. Evaluate on test set
pandadock gnn benchmark -m my_model/best_model.pt -b my_dataset/affinity.tsv

# 5. Use for prediction
pandadock gnn predict -m my_model/best_model.pt \
    -p new_protein.mol2 -l new_ligand.mol2