Dataset Preparation Guide
=========================

This guide covers how to prepare datasets for training PandaDock-GNN models.
PandaDock supports three dataset formats: ULVSH, PDBbind, and BindingDB.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

PandaDock-GNN can be trained on:

+------------+------------------+------------------+-------------------+
| Dataset    | Complexes        | Affinity Type    | Best Test R       |
+============+==================+==================+===================+
| PDBbind    | 5,316            | pKd/pKi          | 0.88              |
+------------+------------------+------------------+-------------------+
| ULVSH      | 942              | pEC50            | 0.82              |
+------------+------------------+------------------+-------------------+
| BindingDB  | 8,891+           | pK (various)     | 0.81              |
+------------+------------------+------------------+-------------------+

You can train on any single dataset or combine them for better generalization.

---

ULVSH Dataset
-------------

The ULVSH (Ultra-Large Virtual Screening Hits) dataset contains 942 compounds
across 10 protein targets with experimental EC50 values.

Directory Structure
~~~~~~~~~~~~~~~~~~~

.. code-block:: text

   ULVSH/
   ├── ADRA2B/                    # Target name (one of 10 targets)
   │   ├── raw/
   │   │   └── vitro.tsv          # Experimental binding data
   │   └── minimized/
   │       ├── ZINC000001234567/  # Compound directory (ZINC ID)
   │       │   ├── protein.mol2   # Full protein structure
   │       │   ├── ligand.mol2    # Ligand structure (docked pose)
   │       │   └── site.mol2      # Binding site atoms only
   │       └── ZINC000007654321/
   │           └── ...
   ├── CNR1/
   │   └── ...
   └── DRD4/
       └── ...

Required Files
~~~~~~~~~~~~~~

**vitro.tsv** - Tab or whitespace-separated file with experimental data:

.. code-block:: text

   ID                  EC50[uM]    Active
   ZINC000001234567    0.5         Yes
   ZINC000007654321    15.2        No
   ZINC000009876543    n.d.        No

- ``ID``: Compound identifier (must match directory name in minimized/)
- ``EC50[uM]``: EC50 value in micromolar (use "n.d." for not determined)
- ``Active``: "Yes" or "No" for activity classification

**MOL2 Files** - Standard Tripos MOL2 format with:

- ``protein.mol2``: Full protein structure with all atoms
- ``ligand.mol2``: Ligand in docked/bound pose
- ``site.mol2``: Binding site atoms (typically within 5-10Å of ligand)

Preparing ULVSH Data
~~~~~~~~~~~~~~~~~~~~

1. **Obtain structures**: Get protein-ligand complexes from docking or crystal structures

2. **Extract binding site**:

   .. code-block:: python

      from rdkit import Chem
      import numpy as np

      def extract_binding_site(protein_file, ligand_file, radius=10.0):
          """Extract protein atoms within radius of ligand centroid."""
          # Load structures
          protein = Chem.MolFromMol2File(protein_file)
          ligand = Chem.MolFromMol2File(ligand_file)

          # Get ligand centroid
          ligand_conf = ligand.GetConformer()
          ligand_coords = np.array([ligand_conf.GetAtomPosition(i)
                                    for i in range(ligand.GetNumAtoms())])
          centroid = ligand_coords.mean(axis=0)

          # Filter protein atoms by distance
          # ... (save atoms within radius to site.mol2)

3. **Create vitro.tsv**: Compile experimental binding data from literature or assays

4. **Organize directories**: Follow the structure shown above

Training on ULVSH
~~~~~~~~~~~~~~~~~

.. code-block:: bash

   # Basic training
   pandadock gnn train -d ULVSH/ -o models/ulvsh_model/

   # With custom parameters
   pandadock gnn train -d ULVSH/ -o models/ \
       --epochs 100 \
       --batch-size 32 \
       --hidden-dim 256 \
       --num-layers 6

---

PDBbind Dataset
---------------

PDBbind is a curated database of protein-ligand complexes with experimentally
measured binding affinities. The v2020 refined set contains 5,316 complexes.

Obtaining PDBbind
~~~~~~~~~~~~~~~~~

1. Register at http://www.pdbbind.org.cn/
2. Download "PDBbind v2020 refined set"
3. Extract to create the directory structure below

Directory Structure
~~~~~~~~~~~~~~~~~~~

.. code-block:: text

   PDBbind/
   ├── PDBbind_v2020/              # or just place files directly in PDBbind/
   │   ├── index/
   │   │   └── INDEX_refined_data.2020   # Binding affinity index
   │   ├── 1a1e/                   # PDB ID directory
   │   │   ├── 1a1e_protein.pdb    # Protein structure
   │   │   ├── 1a1e_pocket.pdb     # Binding pocket (pre-extracted)
   │   │   ├── 1a1e_ligand.mol2    # Ligand structure
   │   │   └── 1a1e_ligand.sdf     # Ligand in SDF format
   │   ├── 1a28/
   │   │   └── ...
   │   └── ...

Required Files
~~~~~~~~~~~~~~

**INDEX_refined_data.2020** - Space-separated index file:

.. code-block:: text

   # PDB   resolution  year  -logKd/Ki  reference
   1a1e   2.00        1998  6.52       // reference info
   1a28   1.90        1997  4.62       // reference info
   ...

- Column 1: PDB ID
- Column 2: Resolution (Å)
- Column 3: Year of structure
- Column 4: Binding affinity as pKd or pKi (-log10 of Kd/Ki in M)

**Structure Files** (per complex):

- ``{pdb_id}_protein.pdb``: Full protein structure
- ``{pdb_id}_pocket.pdb``: Pre-extracted binding pocket (used for training)
- ``{pdb_id}_ligand.mol2``: Ligand structure

Training on PDBbind
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   # PDBbind only
   pandadock gnn train -p PDBbind/ -o models/pdbbind_model/

   # Combined with ULVSH (recommended for generalization)
   pandadock gnn train -d ULVSH/ -p PDBbind/ -o models/combined/ --balanced

---

BindingDB Dataset
-----------------

BindingDB is a public database of measured binding affinities. We provide
tools to prepare BindingDB data for PandaDock training.

Obtaining BindingDB Data
~~~~~~~~~~~~~~~~~~~~~~~~

1. Download from https://www.bindingdb.org/bind/index.jsp
2. Select "Download" → "BindingDB_All_2D_NoSmi.tsv.zip" or use the API
3. Filter for entries with 3D structures

TSV File Format
~~~~~~~~~~~~~~~

Create a TSV file with paths to protein and ligand structures:

.. code-block:: text

   complex_id    protein_file              ligand_file           pK       pdb_id
   complex_001   proteins/1abc_protein.mol2   ligands/lig001.mol2   7.52     1abc
   complex_002   proteins/2def_protein.mol2   ligands/lig002.mol2   6.31     2def
   ...

Required columns:

- ``complex_id``: Unique identifier for the complex
- ``protein_file``: Path to protein MOL2 file
- ``ligand_file``: Path to ligand MOL2 file
- ``pK``: Binding affinity as pKd, pKi, or pIC50 (-log10 scale)
- ``pdb_id`` (optional): PDB ID if available

Directory Structure
~~~~~~~~~~~~~~~~~~~

.. code-block:: text

   BindingDB_data/
   ├── bindingdb_affinity.tsv     # Main index file
   ├── proteins/                  # Protein structures
   │   ├── 1abc_protein.mol2
   │   ├── 2def_protein.mol2
   │   └── ...
   └── ligands/                   # Ligand structures
       ├── lig001.mol2
       ├── lig002.mol2
       └── ...

Preparing BindingDB Data
~~~~~~~~~~~~~~~~~~~~~~~~

**Step 1: Download and filter BindingDB**

.. code-block:: python

   import pandas as pd

   # Load BindingDB TSV
   df = pd.read_csv('BindingDB_All.tsv', sep='\t', low_memory=False)

   # Filter for entries with:
   # - Ki, Kd, or IC50 values
   # - Associated PDB structures
   # - Standard conditions

   filtered = df[
       (df['Ki (nM)'].notna() | df['Kd (nM)'].notna() | df['IC50 (nM)'].notna()) &
       (df['PDB ID(s) of Target Chain'].notna())
   ]

   print(f"Filtered to {len(filtered)} entries with binding data and structures")

**Step 2: Obtain 3D structures**

For each entry, you need protein and ligand 3D structures:

.. code-block:: python

   from rdkit import Chem
   from rdkit.Chem import AllChem

   def prepare_ligand(smiles, output_file):
       """Generate 3D ligand structure from SMILES."""
       mol = Chem.MolFromSmiles(smiles)
       mol = Chem.AddHs(mol)
       AllChem.EmbedMolecule(mol, randomSeed=42)
       AllChem.MMFFOptimizeMolecule(mol)
       Chem.MolToMolFile(mol, output_file)

   def download_protein(pdb_id, output_file):
       """Download protein structure from PDB."""
       import urllib.request
       url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
       urllib.request.urlretrieve(url, output_file)

**Step 3: Convert to MOL2 format**

PandaDock requires MOL2 format for atom typing:

.. code-block:: bash

   # Using Open Babel
   obabel protein.pdb -O protein.mol2
   obabel ligand.sdf -O ligand.mol2

**Step 4: Create the TSV index**

.. code-block:: python

   import pandas as pd
   import numpy as np

   def convert_to_pk(value_nm):
       """Convert nM to pK (-log10 M)."""
       if pd.isna(value_nm) or value_nm <= 0:
           return None
       return -np.log10(value_nm * 1e-9)

   # Create index file
   records = []
   for idx, row in filtered.iterrows():
       # Get pK value (prefer Ki > Kd > IC50)
       pk = None
       if pd.notna(row.get('Ki (nM)')):
           pk = convert_to_pk(row['Ki (nM)'])
       elif pd.notna(row.get('Kd (nM)')):
           pk = convert_to_pk(row['Kd (nM)'])
       elif pd.notna(row.get('IC50 (nM)')):
           pk = convert_to_pk(row['IC50 (nM)'])

       if pk and 2.0 <= pk <= 12.0:  # Filter reasonable range
           records.append({
               'complex_id': f'complex_{idx}',
               'protein_file': f'proteins/{row["PDB ID(s) of Target Chain"]}_protein.mol2',
               'ligand_file': f'ligands/lig_{idx}.mol2',
               'pK': pk,
               'pdb_id': row['PDB ID(s) of Target Chain']
           })

   df_index = pd.DataFrame(records)
   df_index.to_csv('bindingdb_affinity.tsv', sep='\t', index=False)

Training on BindingDB
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   # BindingDB only
   pandadock gnn train -b bindingdb_affinity.tsv -o models/bindingdb_model/

   # Combined with ULVSH (recommended)
   pandadock gnn train -b bindingdb_affinity.tsv -d ULVSH/ -o models/combined/ \
       --balanced --epochs 100

---

Combined Training
-----------------

Training on multiple datasets improves generalization. Use the ``--balanced``
flag to prevent larger datasets from dominating training.

Recommended Combinations
~~~~~~~~~~~~~~~~~~~~~~~~

**Best accuracy on diverse targets:**

.. code-block:: bash

   pandadock gnn train -d ULVSH/ -p PDBbind/ -o models/ \
       --balanced --epochs 200 --batch-size 32

**Best for BindingDB-like screening:**

.. code-block:: bash

   pandadock gnn train -b bindingdb.tsv -d ULVSH/ -o models/ \
       --balanced --epochs 100

**All datasets combined:**

.. code-block:: bash

   pandadock gnn train -d ULVSH/ -p PDBbind/ -b bindingdb.tsv -o models/ \
       --balanced --epochs 200

.. warning::

   Combining PDBbind with other datasets may reduce performance due to
   affinity scale differences (pKd vs pEC50). Test on your specific use case.

---

Training Options Reference
--------------------------

+------------------+----------+--------------------------------------------------+
| Option           | Default  | Description                                      |
+==================+==========+==================================================+
| ``-d/--dataset`` | None     | Path to ULVSH dataset directory                  |
+------------------+----------+--------------------------------------------------+
| ``-p/--pdbbind`` | None     | Path to PDBbind dataset directory                |
+------------------+----------+--------------------------------------------------+
| ``-b/--bindingdb``| None    | Path to BindingDB TSV file                       |
+------------------+----------+--------------------------------------------------+
| ``-o/--output``  | Required | Output directory for model checkpoints           |
+------------------+----------+--------------------------------------------------+
| ``--epochs``     | 100      | Number of training epochs                        |
+------------------+----------+--------------------------------------------------+
| ``--batch-size`` | 32       | Batch size (reduce if out of memory)             |
+------------------+----------+--------------------------------------------------+
| ``--lr``         | 1e-4     | Learning rate                                    |
+------------------+----------+--------------------------------------------------+
| ``--hidden-dim`` | 256      | Hidden layer dimension                           |
+------------------+----------+--------------------------------------------------+
| ``--num-layers`` | 6        | Number of EGNN message passing layers            |
+------------------+----------+--------------------------------------------------+
| ``--dropout``    | 0.1      | Dropout rate for regularization                  |
+------------------+----------+--------------------------------------------------+
| ``--patience``   | 20       | Early stopping patience (epochs without improve) |
+------------------+----------+--------------------------------------------------+
| ``--balanced``   | False    | Balance sampling across datasets                 |
+------------------+----------+--------------------------------------------------+
| ``--gpu/--cpu``  | --gpu    | Use GPU if available                             |
+------------------+----------+--------------------------------------------------+
| ``--seed``       | 42       | Random seed for reproducibility                  |
+------------------+----------+--------------------------------------------------+

---

Troubleshooting
---------------

**"No valid complexes found"**

- Check that MOL2 files exist at the paths specified
- Verify MOL2 files are valid (not empty, proper format)
- Check that protein and ligand files are paired correctly

**"Edge dimension mismatch"**

- Ensure you're using the latest version of PandaDock
- Edge features should be 23 dimensions (check featurizer)

**Out of memory errors**

- Reduce ``--batch-size`` (try 16 or 8)
- Use ``--cpu`` if GPU memory is limited
- Filter out very large proteins (>1000 atoms)

**Poor correlation (R < 0.5)**

- Check data quality (correct binding affinity values?)
- Ensure structures are properly prepared (hydrogens added?)
- Try training longer (``--epochs 200``)
- Use ``--balanced`` when combining datasets

---

Best Practices
--------------

1. **Data Quality**: Ensure binding affinities are accurate and structures are properly minimized

2. **Structure Preparation**:

   - Add hydrogens to all structures
   - Minimize/optimize ligand geometry
   - Use consistent protonation states

3. **Binding Site Extraction**:

   - Use 10Å radius around ligand centroid
   - Include all residues with atoms in the cutoff

4. **Training**:

   - Start with default hyperparameters
   - Monitor validation R during training
   - Use early stopping to prevent overfitting

5. **Evaluation**:

   - Always evaluate on held-out test set
   - Report Pearson R, Spearman ρ, RMSE, and MAE
   - Compare against baseline methods

---

Example: End-to-End Training
----------------------------

Here's a complete example of preparing data and training a model:

.. code-block:: bash

   # 1. Prepare your data (assuming you have structures ready)
   mkdir -p my_dataset/proteins my_dataset/ligands

   # 2. Create index file
   cat > my_dataset/affinity.tsv << EOF
   complex_id	protein_file	ligand_file	pK
   complex_1	proteins/prot1.mol2	ligands/lig1.mol2	7.5
   complex_2	proteins/prot2.mol2	ligands/lig2.mol2	6.2
   complex_3	proteins/prot3.mol2	ligands/lig3.mol2	8.1
   EOF

   # 3. Train the model
   pandadock gnn train -b my_dataset/affinity.tsv -o my_model/ \
       --epochs 100 --batch-size 16

   # 4. Evaluate on test set
   pandadock gnn benchmark -m my_model/best_model.pt -b my_dataset/affinity.tsv

   # 5. Use for prediction
   pandadock gnn predict -m my_model/best_model.pt \
       -p new_protein.mol2 -l new_ligand.mol2