7.9 KiB
🎯 BGE Model Validation System - Streamlined & Simplified
⚡ Quick Start - One Command for Everything
The old confusing multiple validation scripts have been completely replaced with one simple interface:
# Quick check (5 minutes) - Did my training work?
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model
# Compare with baselines - How much did I improve?
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
--retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl
# Complete validation suite (1 hour) - Is my model production-ready?
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
--test_data_dir ./test_data
🧹 What We Cleaned Up
❌ REMOVED (Redundant/Confusing Files):
scripts/validate_m3.py- Simple validation (redundant)scripts/validate_reranker.py- Simple validation (redundant)scripts/evaluate.py- Main evaluation (overlapped)scripts/compare_retriever.py- Retriever comparison (merged)scripts/compare_reranker.py- Reranker comparison (merged)
✅ NEW STREAMLINED SYSTEM:
scripts/validate.py- Single entry point for all validationscripts/compare_models.py- Unified model comparisonscripts/quick_validation.py- Improved quick validationscripts/comprehensive_validation.py- Enhanced comprehensive validationscripts/benchmark.py- Performance benchmarkingevaluation/- Core evaluation modules (kept clean)
🚀 Complete Usage Examples
1. After Training - Quick Sanity Check
python scripts/validate.py quick \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model
Time: 5 minutes | Purpose: Verify models work correctly
2. Measure Improvements - Baseline Comparison
python scripts/validate.py compare \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model \
--retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
--reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"
Time: 15 minutes | Purpose: Quantify how much you improved vs baseline
3. Thorough Testing - Comprehensive Validation
python scripts/validate.py comprehensive \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model \
--test_data_dir "data/datasets/三国演义/splits"
Time: 30 minutes | Purpose: Detailed accuracy evaluation
4. Production Ready - Complete Suite
python scripts/validate.py all \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model \
--test_data_dir "data/datasets/三国演义/splits"
Time: 1 hour | Purpose: Everything - ready for deployment
📊 Understanding Your Results
Validation Status:
- 🌟 EXCELLENT - Significant improvements (>5% average)
- ✅ GOOD - Clear improvements (2-5% average)
- 👌 FAIR - Modest improvements (0-2% average)
- ❌ POOR - No improvement or degradation
Key Metrics:
Retriever: Recall@5, Recall@10, MAP, MRR
Reranker: Accuracy, Precision, Recall, F1
Output Files:
validation_summary.md- Main summary reportvalidation_results.json- Complete detailed resultscomparison/- Baseline comparison resultscomprehensive/- Detailed validation metrics
🛠️ Advanced Options
Single Model Testing:
# Test only retriever
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model
# Test only reranker
python scripts/validate.py quick --reranker_model ./output/reranker/final_model
Performance Tuning:
# Speed up validation (for testing)
python scripts/validate.py compare \
--retriever_model ./output/model \
--retriever_data ./test_data/m3_test.jsonl \
--batch_size 16 \
--max_samples 1000
# Detailed comparison with custom baselines
python scripts/compare_models.py \
--model_type both \
--finetuned_retriever ./output/bge_m3/final_model \
--finetuned_reranker ./output/reranker/final_model \
--baseline_retriever "BAAI/bge-m3" \
--baseline_reranker "BAAI/bge-reranker-base" \
--retriever_data ./test_data/m3_test.jsonl \
--reranker_data ./test_data/reranker_test.jsonl
Benchmarking Only:
# Test inference performance
python scripts/validate.py benchmark \
--retriever_model ./output/model \
--reranker_model ./output/model \
--batch_size 32 \
--max_samples 1000
🎯 Integration with Your Workflow
Complete Training → Validation Pipeline:
# 1. Split your datasets properly
python scripts/split_datasets.py \
--input_dir "data/datasets/三国演义" \
--output_dir "data/datasets/三国演义/splits"
# 2. Quick training test (optional)
python scripts/quick_train_test.py \
--data_dir "data/datasets/三国演义/splits" \
--samples_per_model 1000
# 3. Full training
python scripts/train_m3.py \
--train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
--eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
--output_dir "./output/bge_m3_三国演义"
python scripts/train_reranker.py \
--reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
--eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
--output_dir "./output/bge_reranker_三国演义"
# 4. Complete validation
python scripts/validate.py all \
--retriever_model "./output/bge_m3_三国演义/final_model" \
--reranker_model "./output/bge_reranker_三国演义/final_model" \
--test_data_dir "data/datasets/三国演义/splits"
🚨 Troubleshooting
Common Issues:
- "Model not found" → Check if training completed and model path exists
- "Out of memory" → Reduce
--batch_sizeor use--max_samples - "No test data" → Ensure you ran
split_datasets.pyfirst - Import errors → Run from project root directory
Performance:
- Slow validation → Use
--max_samples 1000for quick testing - High memory → Reduce batch size to 8-16
- GPU not used → Check CUDA/device configuration
💡 Best Practices
- Always start with
quickmode - Verify models work before deeper testing - Use proper test/train splits - Don't validate on training data
- Compare against baselines - Know how much you actually improved
- Keep validation results - Track progress across different experiments
- Test with representative data - Use diverse test sets
- Monitor resource usage - Adjust batch sizes for your hardware
🎉 Benefits of New System
✅ Single entry point - No more confusion about which script to use
✅ Clear modes - quick, compare, comprehensive, all
✅ Unified output - Consistent result formats and summaries
✅ Better error handling - Clear error messages and troubleshooting
✅ Integrated workflow - Works seamlessly with training scripts
✅ Comprehensive reporting - Detailed summaries and recommendations
✅ Performance aware - Built-in benchmarking and optimization
The validation system is now clear, powerful, and easy to use! 🚀
📚 Related Documentation
- Training Guide - How to train BGE models
- Data Formats - Dataset format specifications
- Configuration - System configuration options
Need help? The validation system provides detailed error messages and suggestions. Check the generated validation_summary.md for specific recommendations!