bge_finetune/README_VALIDATION.md
2025-07-23 14:54:46 +08:00

7.9 KiB

🎯 BGE Model Validation System - Streamlined & Simplified

Quick Start - One Command for Everything

The old confusing multiple validation scripts have been completely replaced with one simple interface:

# Quick check (5 minutes) - Did my training work?
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model

# Compare with baselines - How much did I improve?
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
    --retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl

# Complete validation suite (1 hour) - Is my model production-ready?
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
    --test_data_dir ./test_data

🧹 What We Cleaned Up

REMOVED (Redundant/Confusing Files):

  • scripts/validate_m3.py - Simple validation (redundant)
  • scripts/validate_reranker.py - Simple validation (redundant)
  • scripts/evaluate.py - Main evaluation (overlapped)
  • scripts/compare_retriever.py - Retriever comparison (merged)
  • scripts/compare_reranker.py - Reranker comparison (merged)

NEW STREAMLINED SYSTEM:

  • scripts/validate.py - Single entry point for all validation
  • scripts/compare_models.py - Unified model comparison
  • scripts/quick_validation.py - Improved quick validation
  • scripts/comprehensive_validation.py - Enhanced comprehensive validation
  • scripts/benchmark.py - Performance benchmarking
  • evaluation/ - Core evaluation modules (kept clean)

🚀 Complete Usage Examples

1. After Training - Quick Sanity Check

python scripts/validate.py quick \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model

Time: 5 minutes | Purpose: Verify models work correctly

2. Measure Improvements - Baseline Comparison

python scripts/validate.py compare \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
    --reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"

Time: 15 minutes | Purpose: Quantify how much you improved vs baseline

3. Thorough Testing - Comprehensive Validation

python scripts/validate.py comprehensive \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --test_data_dir "data/datasets/三国演义/splits"

Time: 30 minutes | Purpose: Detailed accuracy evaluation

4. Production Ready - Complete Suite

python scripts/validate.py all \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --test_data_dir "data/datasets/三国演义/splits"

Time: 1 hour | Purpose: Everything - ready for deployment


📊 Understanding Your Results

Validation Status:

  • 🌟 EXCELLENT - Significant improvements (>5% average)
  • GOOD - Clear improvements (2-5% average)
  • 👌 FAIR - Modest improvements (0-2% average)
  • POOR - No improvement or degradation

Key Metrics:

Retriever: Recall@5, Recall@10, MAP, MRR
Reranker: Accuracy, Precision, Recall, F1

Output Files:

  • validation_summary.md - Main summary report
  • validation_results.json - Complete detailed results
  • comparison/ - Baseline comparison results
  • comprehensive/ - Detailed validation metrics

🛠️ Advanced Options

Single Model Testing:

# Test only retriever
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model

# Test only reranker
python scripts/validate.py quick --reranker_model ./output/reranker/final_model

Performance Tuning:

# Speed up validation (for testing)
python scripts/validate.py compare \
    --retriever_model ./output/model \
    --retriever_data ./test_data/m3_test.jsonl \
    --batch_size 16 \
    --max_samples 1000

# Detailed comparison with custom baselines
python scripts/compare_models.py \
    --model_type both \
    --finetuned_retriever ./output/bge_m3/final_model \
    --finetuned_reranker ./output/reranker/final_model \
    --baseline_retriever "BAAI/bge-m3" \
    --baseline_reranker "BAAI/bge-reranker-base" \
    --retriever_data ./test_data/m3_test.jsonl \
    --reranker_data ./test_data/reranker_test.jsonl

Benchmarking Only:

# Test inference performance
python scripts/validate.py benchmark \
    --retriever_model ./output/model \
    --reranker_model ./output/model \
    --batch_size 32 \
    --max_samples 1000

🎯 Integration with Your Workflow

Complete Training → Validation Pipeline:

# 1. Split your datasets properly
python scripts/split_datasets.py \
    --input_dir "data/datasets/三国演义" \
    --output_dir "data/datasets/三国演义/splits"

# 2. Quick training test (optional)
python scripts/quick_train_test.py \
    --data_dir "data/datasets/三国演义/splits" \
    --samples_per_model 1000

# 3. Full training
python scripts/train_m3.py \
    --train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
    --eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
    --output_dir "./output/bge_m3_三国演义"

python scripts/train_reranker.py \
    --reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
    --eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
    --output_dir "./output/bge_reranker_三国演义"

# 4. Complete validation
python scripts/validate.py all \
    --retriever_model "./output/bge_m3_三国演义/final_model" \
    --reranker_model "./output/bge_reranker_三国演义/final_model" \
    --test_data_dir "data/datasets/三国演义/splits"

🚨 Troubleshooting

Common Issues:

  1. "Model not found" → Check if training completed and model path exists
  2. "Out of memory" → Reduce --batch_size or use --max_samples
  3. "No test data" → Ensure you ran split_datasets.py first
  4. Import errors → Run from project root directory

Performance:

  • Slow validation → Use --max_samples 1000 for quick testing
  • High memory → Reduce batch size to 8-16
  • GPU not used → Check CUDA/device configuration

💡 Best Practices

  1. Always start with quick mode - Verify models work before deeper testing
  2. Use proper test/train splits - Don't validate on training data
  3. Compare against baselines - Know how much you actually improved
  4. Keep validation results - Track progress across different experiments
  5. Test with representative data - Use diverse test sets
  6. Monitor resource usage - Adjust batch sizes for your hardware

🎉 Benefits of New System

Single entry point - No more confusion about which script to use
Clear modes - quick, compare, comprehensive, all
Unified output - Consistent result formats and summaries
Better error handling - Clear error messages and troubleshooting
Integrated workflow - Works seamlessly with training scripts
Comprehensive reporting - Detailed summaries and recommendations
Performance aware - Built-in benchmarking and optimization

The validation system is now clear, powerful, and easy to use! 🚀


Need help? The validation system provides detailed error messages and suggestions. Check the generated validation_summary.md for specific recommendations!