Empirical Scoring
The empirical scoring function is optimized for ultra-fast virtual screening. It uses statistical potentials derived from protein-ligand databases to rapidly evaluate binding poses with acceptable accuracy.
Overview
Scoring ID: empirical
Type: Knowledge-based statistical scoring
Accuracy: R = 0.72 correlation with experimental binding affinities
Speed: 0.001-0.005 seconds per pose (10-50x faster than physics-based)
Best for: Virtual screening, large library docking, initial filtering, rapid pose evaluation
Algorithm
The empirical scoring function uses statistical potentials derived from known protein-ligand complexes:
Scoring Components
Contact Score
\[\begin{split}S_{contact} = \\sum_{i,j} w_{ij} \\cdot f(d_{ij})\end{split}\]Atom-type pair potentials
Distance-dependent statistical preferences
Derived from observed contact frequencies in PDB
Lipophilic Score
Hydrophobic-hydrophobic contact rewards
Surface complementarity bonus
Burial of hydrophobic surface area
Hydrogen Bond Score
Geometry-independent H-bond detection
Fixed weight per hydrogen bond
Faster than physics-based H-bond evaluation
Metal Coordination Score
Bonus for coordinating metal ions
Simple distance-based detection
Fixed weights per metal type
Flexibility Penalty
Penalty for rotatable bonds
Accounts for conformational entropy loss
Simpler than torsional energy calculation
Training Data
Empirical parameters optimized on:
PDBBind General Set: 10,000+ protein-ligand complexes
Refined Set: High-quality structures with experimental affinities
Diverse Set: Covering all protein families
Validation: CASF-2016, Astex Diverse Set
Usage
Basic Usage
pandadock dock -r protein.pdb -l ligand.sdf \\
--scoring empirical \\
--center 10 20 30 --box 20 20 20
Virtual Screening
pandadock dock -r target.pdb -l library_10k.sdf \\
--algorithm monte_carlo_cpu \\
--scoring empirical \\
--fast \\
--num-poses 3 \\
-o screening_results/
With GPU Acceleration
pandadock dock -r target.pdb -l library.sdf \\
--algorithm cuda_monte_carlo \\
--scoring empirical \\
--gpu \\
--gpu-batch-size 2000 \\
-o ultra_fast_screening/
Expected throughput: 5000-7200 ligands/hour
Performance Characteristics
Accuracy Benchmarks
Dataset |
Correlation (R) |
RMSE (kcal/mol) |
|---|---|---|
PDBBind Core |
0.72 |
2.35 |
CASF-2016 |
0.68 |
2.58 |
Astex Diverse |
0.70 |
2.42 |
Note: Lower accuracy than physics-based, but 10-50x faster
Speed Benchmarks
Small ligand (<20 atoms): 0.001-0.002 seconds/pose
Medium ligand (20-40 atoms): 0.002-0.003 seconds/pose
Large ligand (>40 atoms): 0.003-0.005 seconds/pose
Screening throughput:
CPU (monte_carlo_cpu): 200-400 ligands/hour
GPU (cuda_monte_carlo): 3600-7200 ligands/hour
Pose Prediction Accuracy
RMSD < 2?: 80-85% (with monte_carlo algorithm)
RMSD < 2?: 88-92% (with enhanced_hierarchical algorithm)
Top pose RMSD < 2?: 65-75%
Lower pose prediction accuracy than physics-based, but sufficient for filtering.
Strengths and Limitations
Strengths
- Ultra-Fast Evaluation
10-50x faster than physics-based scoring
- Good Pose Recognition
Can distinguish near-native from incorrect poses
- Robust
Works across diverse protein families
- Simple
Few parameters, easy to use
- Parallelizes Well
Excellent GPU acceleration
Limitations
- Lower Accuracy
R = 0.72 vs 0.85 for physics-based
- Coarse Granularity
Less sensitive to subtle differences
- No Energy Decomposition
Can’t analyze individual interaction contributions
- Training Set Bias
May perform poorly on novel binding modes
- No Solvation Model
Doesn’t explicitly account for desolvation
Best Practices
Recommended Use Cases
Large Library Screening
pandadock dock -r target.pdb -l library_50k.sdf \\ --algorithm cuda_monte_carlo \\ --scoring empirical \\ --gpu \\ --num-poses 3 \\ -o initial_screening/
Screen 50,000 compounds in 7-14 hours (GPU)
Initial Filtering Before Detailed Docking
# Step 1: Fast empirical screening pandadock dock -r target.pdb -l library_10k.sdf \\ --scoring empirical \\ --fast \\ --num-poses 1 \\ -o empirical_filter/ # Step 2: Rescore top 500 with physics-based pandadock dock -r target.pdb -l top_500.sdf \\ --scoring physics_based \\ --num-poses 20 \\ -o refined_results/
Pose Filtering
Use empirical scoring to quickly identify poor poses
Fragment Screening
Fast evaluation of small fragment libraries
Not Recommended For
- L Critical Lead Optimization
Use
physics_basedorhybridscoring- L Quantitative Affinity Prediction
Lower correlation with experimental data
- L Detailed Interaction Analysis
No energy decomposition available
- L Novel Binding Modes
May not generalize beyond training set
- L Charged Ligands
Electrostatics not well-represented
Optimization Tips
Maximize Throughput:
pandadock dock -r target.pdb -l library.sdf \\
--algorithm cuda_monte_carlo \\
--scoring empirical \\
--gpu \\
--gpu-batch-size 2000 \\
--fast \\
--num-poses 1
Target: 7000+ ligands/hour
Balance Speed and Accuracy:
pandadock dock -r target.pdb -l library.sdf \\
--algorithm monte_carlo_cpu \\
--scoring empirical \\
--num-poses 5 \\
--cpuworkers 16
Two-Stage Screening:
# Stage 1: Empirical screening (fast)
pandadock dock -r target.pdb -l library_100k.sdf \\
--scoring empirical \\
--fast \\
-o stage1/
# Extract top 1000 by score
# Stage 2: Rescore with physics-based
pandadock dock -r target.pdb -l top_1000.sdf \\
--scoring physics_based \\
--rescoring mmgbsa \\
-o stage2/
Output Format
Scoring Output
{
"binding_score": -6.8,
"components": {
"contact_score": -8.5,
"lipophilic_score": -2.3,
"hbond_score": -3.2,
"metal_score": 0.0,
"flexibility_penalty": 1.8
}
}
Note: Empirical scores are unitless and calibrated to approximate kcal/mol
Ranking Output
Rank Ligand_ID Score RMSD
1 compound_1523 -8.5 1.2
2 compound_0942 -8.2 0.8
3 compound_2341 -7.9 1.5
...
Comparison with Other Scoring Functions
vs Physics-Based
Aspect |
Empirical |
Physics-Based |
|---|---|---|
Speed |
||
Accuracy |
||
Interpretability |
||
Throughput |
Choose empirical when: Speed is paramount, screening large libraries
Choose physics-based when: Accuracy matters, need energy decomposition
vs Hybrid Scoring
Aspect |
Empirical |
Hybrid |
|---|---|---|
Speed |
||
Accuracy |
||
Setup |
Choose empirical when: Ultra-fast initial screening
Choose hybrid when: Final ranking and lead optimization
Examples
Ultra-Fast Virtual Screening
# Screen 100,000 compound library
pandadock dock -r kinase.pdb -l library_100k.sdf \\
--algorithm cuda_monte_carlo \\
--scoring empirical \\
--gpu \\
--fast \\
--num-poses 1 \\
-o empirical_screening/
Expected runtime: 14-28 hours (GPU), output: top scoring compounds
Fragment Library Screening
pandadock dock -r protein.pdb -l fragments_5k.sdf \\
--algorithm monte_carlo_cpu \\
--scoring empirical \\
--fast \\
--num-poses 3 \\
--cpuworkers 8 \\
-o fragment_hits/
Two-Stage High-Throughput Screening
# Stage 1: Rapid empirical filter (10,000 ? 500)
pandadock dock -r target.pdb -l library_10k.sdf \\
--scoring empirical \\
--fast \\
--num-poses 1 \\
-o stage1_empirical/
# Extract top 500 compounds by empirical score
# Stage 2: Detailed physics-based rescoring (500 ? 50)
pandadock dock -r target.pdb -l top_500.sdf \\
--scoring physics_based \\
--num-poses 20 \\
-o stage2_physics/
# Extract top 50 for experimental validation
Expected Workflow Results
Input: 10,000 compound library
Stage 1 (empirical):
Time: 25-50 hours (CPU) or 1.5-3 hours (GPU)
Output: Ranked list, select top 500
Stage 2 (physics-based):
Time: 1-2 hours (top 500)
Output: Refined ranking, select top 50
Stage 3 (experimental):
Test top 50 compounds
Expected hit rate: 10-30% (5-15 active compounds)
Validation Studies
Enrichment Performance
Tested on DUD-E (Database of Useful Decoys: Enhanced):
Top 1% enrichment: 12-18x
Top 5% enrichment: 8-12x
AUC (ROC): 0.72-0.78
Conclusion: Good enrichment for initial filtering, not optimal for final ranking
Pose Reproduction
Tested on Astex Diverse Set (85 complexes):
Success rate (RMSD < 2?): 80-85%
Top pose success: 65-75%
Conclusion: Adequate pose recognition for screening
See Also
Scoring Functions Overview - Scoring functions overview
Physics-Based Scoring - Physics-based scoring
Hybrid ML Scoring - Hybrid ML scoring
GPU Scoring Functions - GPU scoring
GPU Algorithms - GPU algorithms for maximum throughput