Empirical Scoring
=================

The empirical scoring function is optimized for **ultra-fast virtual screening**. It uses statistical potentials derived from protein-ligand databases to rapidly evaluate binding poses with acceptable accuracy.

Overview
--------

**Scoring ID:** ``empirical``

**Type:** Knowledge-based statistical scoring

**Accuracy:** R = 0.72 correlation with experimental binding affinities

**Speed:** 0.001-0.005 seconds per pose (10-50x faster than physics-based)

**Best for:** Virtual screening, large library docking, initial filtering, rapid pose evaluation

Algorithm
---------

The empirical scoring function uses statistical potentials derived from known protein-ligand complexes:

.. math::

   S_{total} = S_{contact} + S_{lipophilic} + S_{hbond} + S_{metal} + S_{flexibility}

Scoring Components
^^^^^^^^^^^^^^^^^^

1. **Contact Score**

   .. math::

      S_{contact} = \\sum_{i,j} w_{ij} \\cdot f(d_{ij})

   * Atom-type pair potentials
   * Distance-dependent statistical preferences
   * Derived from observed contact frequencies in PDB

2. **Lipophilic Score**

   * Hydrophobic-hydrophobic contact rewards
   * Surface complementarity bonus
   * Burial of hydrophobic surface area

3. **Hydrogen Bond Score**

   * Geometry-independent H-bond detection
   * Fixed weight per hydrogen bond
   * Faster than physics-based H-bond evaluation

4. **Metal Coordination Score**

   * Bonus for coordinating metal ions
   * Simple distance-based detection
   * Fixed weights per metal type

5. **Flexibility Penalty**

   * Penalty for rotatable bonds
   * Accounts for conformational entropy loss
   * Simpler than torsional energy calculation

Training Data
^^^^^^^^^^^^^

Empirical parameters optimized on:

* **PDBBind General Set:** 10,000+ protein-ligand complexes
* **Refined Set:** High-quality structures with experimental affinities
* **Diverse Set:** Covering all protein families
* **Validation:** CASF-2016, Astex Diverse Set

Usage
-----

Basic Usage
^^^^^^^^^^^

.. code-block:: bash

   pandadock dock -r protein.pdb -l ligand.sdf \\
                  --scoring empirical \\
                  --center 10 20 30 --box 20 20 20

Virtual Screening
^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock dock -r target.pdb -l library_10k.sdf \\
                  --algorithm monte_carlo_cpu \\
                  --scoring empirical \\
                  --fast \\
                  --num-poses 3 \\
                  -o screening_results/

With GPU Acceleration
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock dock -r target.pdb -l library.sdf \\
                  --algorithm cuda_monte_carlo \\
                  --scoring empirical \\
                  --gpu \\
                  --gpu-batch-size 2000 \\
                  -o ultra_fast_screening/

Expected throughput: 5000-7200 ligands/hour

Performance Characteristics
---------------------------

Accuracy Benchmarks
^^^^^^^^^^^^^^^^^^^

+------------------+------------------+----------------+
| Dataset          | Correlation (R)  | RMSE (kcal/mol)|
+==================+==================+================+
| PDBBind Core     | 0.72             | 2.35           |
+------------------+------------------+----------------+
| CASF-2016        | 0.68             | 2.58           |
+------------------+------------------+----------------+
| Astex Diverse    | 0.70             | 2.42           |
+------------------+------------------+----------------+

**Note:** Lower accuracy than physics-based, but 10-50x faster

Speed Benchmarks
^^^^^^^^^^^^^^^^

* **Small ligand (<20 atoms):** 0.001-0.002 seconds/pose
* **Medium ligand (20-40 atoms):** 0.002-0.003 seconds/pose
* **Large ligand (>40 atoms):** 0.003-0.005 seconds/pose

Screening throughput:

* **CPU (monte_carlo_cpu):** 200-400 ligands/hour
* **GPU (cuda_monte_carlo):** 3600-7200 ligands/hour

Pose Prediction Accuracy
^^^^^^^^^^^^^^^^^^^^^^^^^

* **RMSD < 2Å:** 80-85% (with monte_carlo algorithm)
* **RMSD < 2Å:** 88-92% (with enhanced_hierarchical algorithm)
* **Top pose RMSD < 2Å:** 65-75%

Lower pose prediction accuracy than physics-based, but sufficient for filtering.

Strengths and Limitations
--------------------------

Strengths
^^^^^^^^^

 **Ultra-Fast Evaluation**
   10-50x faster than physics-based scoring

 **Good Pose Recognition**
   Can distinguish near-native from incorrect poses

 **Robust**
   Works across diverse protein families

 **Simple**
   Few parameters, easy to use

 **Parallelizes Well**
   Excellent GPU acceleration

Limitations
^^^^^^^^^^^

 **Lower Accuracy**
   R = 0.72 vs 0.85 for physics-based

 **Coarse Granularity**
   Less sensitive to subtle differences

 **No Energy Decomposition**
   Can't analyze individual interaction contributions

 **Training Set Bias**
   May perform poorly on novel binding modes

 **No Solvation Model**
   Doesn't explicitly account for desolvation

Best Practices
--------------

Recommended Use Cases
^^^^^^^^^^^^^^^^^^^^^

1. **Large Library Screening**

   .. code-block:: bash

      pandadock dock -r target.pdb -l library_50k.sdf \\
                     --algorithm cuda_monte_carlo \\
                     --scoring empirical \\
                     --gpu \\
                     --num-poses 3 \\
                     -o initial_screening/

   Screen 50,000 compounds in 7-14 hours (GPU)

2. **Initial Filtering Before Detailed Docking**

   .. code-block:: bash

      # Step 1: Fast empirical screening
      pandadock dock -r target.pdb -l library_10k.sdf \\
                     --scoring empirical \\
                     --fast \\
                     --num-poses 1 \\
                     -o empirical_filter/

      # Step 2: Rescore top 500 with physics-based
      pandadock dock -r target.pdb -l top_500.sdf \\
                     --scoring physics_based \\
                     --num-poses 20 \\
                     -o refined_results/

3. **Pose Filtering**

   Use empirical scoring to quickly identify poor poses

4. **Fragment Screening**

   Fast evaluation of small fragment libraries

Not Recommended For
^^^^^^^^^^^^^^^^^^^

L **Critical Lead Optimization**
   Use ``physics_based`` or ``hybrid`` scoring

L **Quantitative Affinity Prediction**
   Lower correlation with experimental data

L **Detailed Interaction Analysis**
   No energy decomposition available

L **Novel Binding Modes**
   May not generalize beyond training set

L **Charged Ligands**
   Electrostatics not well-represented

Optimization Tips
^^^^^^^^^^^^^^^^^

**Maximize Throughput:**

.. code-block:: bash

   pandadock dock -r target.pdb -l library.sdf \\
                  --algorithm cuda_monte_carlo \\
                  --scoring empirical \\
                  --gpu \\
                  --gpu-batch-size 2000 \\
                  --fast \\
                  --num-poses 1

Target: 7000+ ligands/hour

**Balance Speed and Accuracy:**

.. code-block:: bash

   pandadock dock -r target.pdb -l library.sdf \\
                  --algorithm monte_carlo_cpu \\
                  --scoring empirical \\
                  --num-poses 5 \\
                  --cpuworkers 16

**Two-Stage Screening:**

.. code-block:: bash

   # Stage 1: Empirical screening (fast)
   pandadock dock -r target.pdb -l library_100k.sdf \\
                  --scoring empirical \\
                  --fast \\
                  -o stage1/

   # Extract top 1000 by score
   # Stage 2: Rescore with physics-based
   pandadock dock -r target.pdb -l top_1000.sdf \\
                  --scoring physics_based \\
                  --rescoring mmgbsa \\
                  -o stage2/

Output Format
-------------

Scoring Output
^^^^^^^^^^^^^^

.. code-block:: json

   {
     "binding_score": -6.8,
     "components": {
       "contact_score": -8.5,
       "lipophilic_score": -2.3,
       "hbond_score": -3.2,
       "metal_score": 0.0,
       "flexibility_penalty": 1.8
     }
   }

**Note:** Empirical scores are unitless and calibrated to approximate kcal/mol

Ranking Output
^^^^^^^^^^^^^^

.. code-block:: text

   Rank  Ligand_ID      Score    RMSD
   1     compound_1523  -8.5     1.2
   2     compound_0942  -8.2     0.8
   3     compound_2341  -7.9     1.5
   ...

Comparison with Other Scoring Functions
----------------------------------------

vs Physics-Based
^^^^^^^^^^^^^^^^

+-------------------+--------------+--------------+
| Aspect            | Empirical    | Physics-Based|
+===================+==============+==============+
| Speed             |         |         |
+-------------------+--------------+--------------+
| Accuracy          |         |         |
+-------------------+--------------+--------------+
| Interpretability  |         |         |
+-------------------+--------------+--------------+
| Throughput        |         |         |
+-------------------+--------------+--------------+

**Choose empirical when:** Speed is paramount, screening large libraries

**Choose physics-based when:** Accuracy matters, need energy decomposition

vs Hybrid Scoring
^^^^^^^^^^^^^^^^^

+-------------------+--------------+--------------+
| Aspect            | Empirical    | Hybrid       |
+===================+==============+==============+
| Speed             |         |         |
+-------------------+--------------+--------------+
| Accuracy          |         |         |
+-------------------+--------------+--------------+
| Setup             |         |         |
+-------------------+--------------+--------------+

**Choose empirical when:** Ultra-fast initial screening

**Choose hybrid when:** Final ranking and lead optimization

Examples
--------

Ultra-Fast Virtual Screening
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Screen 100,000 compound library
   pandadock dock -r kinase.pdb -l library_100k.sdf \\
                  --algorithm cuda_monte_carlo \\
                  --scoring empirical \\
                  --gpu \\
                  --fast \\
                  --num-poses 1 \\
                  -o empirical_screening/

Expected runtime: 14-28 hours (GPU), output: top scoring compounds

Fragment Library Screening
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   pandadock dock -r protein.pdb -l fragments_5k.sdf \\
                  --algorithm monte_carlo_cpu \\
                  --scoring empirical \\
                  --fast \\
                  --num-poses 3 \\
                  --cpuworkers 8 \\
                  -o fragment_hits/

Two-Stage High-Throughput Screening
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Stage 1: Rapid empirical filter (10,000 ’ 500)
   pandadock dock -r target.pdb -l library_10k.sdf \\
                  --scoring empirical \\
                  --fast \\
                  --num-poses 1 \\
                  -o stage1_empirical/

   # Extract top 500 compounds by empirical score

   # Stage 2: Detailed physics-based rescoring (500 ’ 50)
   pandadock dock -r target.pdb -l top_500.sdf \\
                  --scoring physics_based \\
                  --num-poses 20 \\
                  -o stage2_physics/

   # Extract top 50 for experimental validation

Expected Workflow Results
^^^^^^^^^^^^^^^^^^^^^^^^^

**Input:** 10,000 compound library

**Stage 1 (empirical):**

* Time: 25-50 hours (CPU) or 1.5-3 hours (GPU)
* Output: Ranked list, select top 500

**Stage 2 (physics-based):**

* Time: 1-2 hours (top 500)
* Output: Refined ranking, select top 50

**Stage 3 (experimental):**

* Test top 50 compounds
* Expected hit rate: 10-30% (5-15 active compounds)

Validation Studies
------------------

Enrichment Performance
^^^^^^^^^^^^^^^^^^^^^^

Tested on DUD-E (Database of Useful Decoys: Enhanced):

* **Top 1% enrichment:** 12-18x
* **Top 5% enrichment:** 8-12x
* **AUC (ROC):** 0.72-0.78

**Conclusion:** Good enrichment for initial filtering, not optimal for final ranking

Pose Reproduction
^^^^^^^^^^^^^^^^^

Tested on Astex Diverse Set (85 complexes):

* **Success rate (RMSD < 2Å):** 80-85%
* **Top pose success:** 65-75%

**Conclusion:** Adequate pose recognition for screening

See Also
--------

* :doc:`overview` - Scoring functions overview
* :doc:`physics_based` - Physics-based scoring
* :doc:`hybrid` - Hybrid ML scoring
* :doc:`gpu_scoring` - GPU scoring
* :doc:`../algorithms/gpu_algorithms` - GPU algorithms for maximum throughput