2025-07-23 14:54:46 +08:00

7.9 KiB

Raw Blame History

🎯 BGE Model Validation System - Streamlined & Simplified

⚡ Quick Start - One Command for Everything

The old confusing multiple validation scripts have been completely replaced with one simple interface:

# Quick check (5 minutes) - Did my training work?
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model

# Compare with baselines - How much did I improve?
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
    --retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl

# Complete validation suite (1 hour) - Is my model production-ready?
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
    --test_data_dir ./test_data

🧹 What We Cleaned Up

❌ REMOVED (Redundant/Confusing Files):

scripts/validate_m3.py - Simple validation (redundant)
scripts/validate_reranker.py - Simple validation (redundant)
scripts/evaluate.py - Main evaluation (overlapped)
scripts/compare_retriever.py - Retriever comparison (merged)
scripts/compare_reranker.py - Reranker comparison (merged)

✅ NEW STREAMLINED SYSTEM:

scripts/validate.py - Single entry point for all validation
scripts/compare_models.py - Unified model comparison
scripts/quick_validation.py - Improved quick validation
scripts/comprehensive_validation.py - Enhanced comprehensive validation
scripts/benchmark.py - Performance benchmarking
evaluation/ - Core evaluation modules (kept clean)

🚀 Complete Usage Examples

1. After Training - Quick Sanity Check

python scripts/validate.py quick \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model

Time: 5 minutes | Purpose: Verify models work correctly

2. Measure Improvements - Baseline Comparison

python scripts/validate.py compare \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
    --reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"

Time: 15 minutes | Purpose: Quantify how much you improved vs baseline

3. Thorough Testing - Comprehensive Validation

python scripts/validate.py comprehensive \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --test_data_dir "data/datasets/三国演义/splits"

Time: 30 minutes | Purpose: Detailed accuracy evaluation

4. Production Ready - Complete Suite

python scripts/validate.py all \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --test_data_dir "data/datasets/三国演义/splits"

Time: 1 hour | Purpose: Everything - ready for deployment

📊 Understanding Your Results

Validation Status:

🌟 EXCELLENT - Significant improvements (>5% average)
✅ GOOD - Clear improvements (2-5% average)
👌 FAIR - Modest improvements (0-2% average)
❌ POOR - No improvement or degradation

Key Metrics:

Retriever: Recall@5, Recall@10, MAP, MRR
Reranker: Accuracy, Precision, Recall, F1

Output Files:

validation_summary.md - Main summary report
validation_results.json - Complete detailed results
comparison/ - Baseline comparison results
comprehensive/ - Detailed validation metrics

🛠️ Advanced Options

Single Model Testing:

# Test only retriever
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model

# Test only reranker
python scripts/validate.py quick --reranker_model ./output/reranker/final_model

Performance Tuning:

# Speed up validation (for testing)
python scripts/validate.py compare \
    --retriever_model ./output/model \
    --retriever_data ./test_data/m3_test.jsonl \
    --batch_size 16 \
    --max_samples 1000

# Detailed comparison with custom baselines
python scripts/compare_models.py \
    --model_type both \
    --finetuned_retriever ./output/bge_m3/final_model \
    --finetuned_reranker ./output/reranker/final_model \
    --baseline_retriever "BAAI/bge-m3" \
    --baseline_reranker "BAAI/bge-reranker-base" \
    --retriever_data ./test_data/m3_test.jsonl \
    --reranker_data ./test_data/reranker_test.jsonl

Benchmarking Only:

# Test inference performance
python scripts/validate.py benchmark \
    --retriever_model ./output/model \
    --reranker_model ./output/model \
    --batch_size 32 \
    --max_samples 1000

🎯 Integration with Your Workflow

Complete Training → Validation Pipeline:

# 1. Split your datasets properly
python scripts/split_datasets.py \
    --input_dir "data/datasets/三国演义" \
    --output_dir "data/datasets/三国演义/splits"

# 2. Quick training test (optional)
python scripts/quick_train_test.py \
    --data_dir "data/datasets/三国演义/splits" \
    --samples_per_model 1000

# 3. Full training
python scripts/train_m3.py \
    --train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
    --eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
    --output_dir "./output/bge_m3_三国演义"

python scripts/train_reranker.py \
    --reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
    --eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
    --output_dir "./output/bge_reranker_三国演义"

# 4. Complete validation
python scripts/validate.py all \
    --retriever_model "./output/bge_m3_三国演义/final_model" \
    --reranker_model "./output/bge_reranker_三国演义/final_model" \
    --test_data_dir "data/datasets/三国演义/splits"

🚨 Troubleshooting

Common Issues:

"Model not found" → Check if training completed and model path exists
"Out of memory" → Reduce --batch_size or use --max_samples
"No test data" → Ensure you ran split_datasets.py first
Import errors → Run from project root directory

Performance:

Slow validation → Use --max_samples 1000 for quick testing
High memory → Reduce batch size to 8-16
GPU not used → Check CUDA/device configuration

💡 Best Practices

Always start with quick mode - Verify models work before deeper testing
Use proper test/train splits - Don't validate on training data
Compare against baselines - Know how much you actually improved
Keep validation results - Track progress across different experiments
Test with representative data - Use diverse test sets
Monitor resource usage - Adjust batch sizes for your hardware

🎉 Benefits of New System

✅ Single entry point - No more confusion about which script to use
✅ Clear modes - quick, compare, comprehensive, all
✅ Unified output - Consistent result formats and summaries
✅ Better error handling - Clear error messages and troubleshooting
✅ Integrated workflow - Works seamlessly with training scripts
✅ Comprehensive reporting - Detailed summaries and recommendations
✅ Performance aware - Built-in benchmarking and optimization

The validation system is now clear, powerful, and easy to use! 🚀

Training Guide - How to train BGE models
Data Formats - Dataset format specifications
Configuration - System configuration options

Need help? The validation system provides detailed error messages and suggestions. Check the generated validation_summary.md for specific recommendations!

7.9 KiB Raw Blame History