14 KiB
BGE Fine-tuning Validation Guide
This guide provides comprehensive instructions for validating your fine-tuned BGE models to ensure they actually perform better than the baseline models.
🎯 Overview
After fine-tuning BGE models, it's crucial to validate that your models have actually improved. This guide covers multiple validation approaches:
- Quick Validation - Fast sanity checks
- Comprehensive Validation - Detailed performance analysis
- Comparison Benchmarks - Head-to-head baseline comparisons
- Pipeline Validation - End-to-end retrieval + reranking tests
- Statistical Analysis - Significance testing and detailed metrics
🚀 Quick Start
Option 1: Run Full Validation Suite (Recommended)
The easiest way to validate your models is using the validation suite:
# Auto-discover trained models and run complete validation
python scripts/run_validation_suite.py --auto-discover
# Or specify models explicitly
python scripts/run_validation_suite.py \
--retriever_model ./output/bge-m3-enhanced/final_model \
--reranker_model ./output/bge-reranker/final_model
This runs all validation tests and generates a comprehensive report.
Option 2: Quick Validation Only
For a fast check if your models are working correctly:
python scripts/quick_validation.py \
--retriever_model ./output/bge-m3-enhanced/final_model \
--reranker_model ./output/bge-reranker/final_model \
--test_pipeline
📊 Validation Tools
1. Quick Validation (quick_validation.py)
Purpose: Fast sanity checks to verify models are working and show basic improvements.
Features:
- Tests model loading and basic functionality
- Uses predefined test queries in multiple languages
- Compares relevance scoring between baseline and fine-tuned models
- Tests pipeline performance (retrieval + reranking)
Usage:
# Test both models
python scripts/quick_validation.py \
--retriever_model ./output/bge-m3-enhanced/final_model \
--reranker_model ./output/bge-reranker/final_model \
--test_pipeline
# Test only retriever
python scripts/quick_validation.py \
--retriever_model ./output/bge-m3-enhanced/final_model
# Save results to file
python scripts/quick_validation.py \
--retriever_model ./output/bge-m3-enhanced/final_model \
--output_file ./results.json
Output: Console summary + optional JSON results file
2. Comprehensive Validation (comprehensive_validation.py)
Purpose: Detailed validation across multiple datasets with statistical analysis.
Features:
- Tests on all available datasets automatically
- Computes comprehensive metrics (Recall@K, Precision@K, MAP, MRR, NDCG)
- Statistical significance testing
- Performance degradation detection
- Generates HTML reports with visualizations
- Cross-dataset performance analysis
Usage:
# Comprehensive validation with auto-discovered datasets
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
--reranker_finetuned ./output/bge-reranker/final_model
# Use specific datasets
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
--test_datasets data/datasets/examples/embedding_data.jsonl \
data/datasets/Reranker_AFQMC/dev.json \
--output_dir ./validation_results
# Custom evaluation settings
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
--batch_size 16 \
--k_values 1 3 5 10 20 \
--retrieval_top_k 100 \
--rerank_top_k 10
Output:
- JSON results file
- HTML report with detailed analysis
- Performance comparison tables
- Statistical significance analysis
3. Model Comparison Scripts
Purpose: Head-to-head performance comparisons with baseline models.
Unified Model Comparison (compare_models.py)
# Compare retriever
python scripts/compare_models.py \
--model_type retriever \
--finetuned_model ./output/bge-m3-enhanced/final_model \
--baseline_model BAAI/bge-m3 \
--data_path data/datasets/examples/embedding_data.jsonl \
--batch_size 16 \
--max_samples 1000
# Compare reranker
python scripts/compare_models.py \
--model_type reranker \
--finetuned_model ./output/bge-reranker/final_model \
--baseline_model BAAI/bge-reranker-base \
--data_path data/datasets/examples/reranker_data.jsonl \
--batch_size 16 \
--max_samples 1000
# Compare both at once
python scripts/compare_models.py \
--model_type both \
--finetuned_retriever ./output/bge-m3-enhanced/final_model \
--finetuned_reranker ./output/bge-reranker/final_model \
--retriever_data data/datasets/examples/embedding_data.jsonl \
--reranker_data data/datasets/examples/reranker_data.jsonl
Output: Tab-separated comparison tables with metrics and performance deltas.
4. Validation Suite Runner (run_validation_suite.py)
Purpose: Orchestrates all validation tests in a systematic way.
Features:
- Auto-discovers trained models
- Runs multiple validation approaches
- Generates comprehensive reports
- Provides final verdict on model quality
Usage:
# Full auto-discovery and validation
python scripts/run_validation_suite.py --auto-discover
# Specify models and validation types
python scripts/run_validation_suite.py \
--retriever_model ./output/bge-m3-enhanced/final_model \
--reranker_model ./output/bge-reranker/final_model \
--validation_types quick comprehensive comparison
# Custom settings
python scripts/run_validation_suite.py \
--retriever_model ./output/bge-m3-enhanced/final_model \
--batch_size 32 \
--max_samples 500 \
--output_dir ./my_validation_results
Output:
- Multiple result files from each validation type
- HTML summary report
- Final verdict and recommendations
5. Validation Utilities (validation_utils.py)
Purpose: Helper utilities for validation preparation and analysis.
Features:
- Discover trained models in workspace
- Analyze available test datasets
- Create sample test data
- Analyze validation results
Usage:
# Discover trained models
python scripts/validation_utils.py --discover-models
# Analyze available test datasets
python scripts/validation_utils.py --analyze-datasets
# Create sample test data
python scripts/validation_utils.py --create-sample-data --output-dir ./test_data
# Analyze validation results
python scripts/validation_utils.py --analyze-results ./validation_results
📁 Understanding Results
Result Files
After running validation, you'll get several result files:
validation_suite_report.html- Main HTML report with visual summaryvalidation_results.json- Complete detailed results in JSON formatquick_validation_results.json- Quick validation results*_comparison_*.txt- Model comparison tables- Comprehensive validation folder with detailed metrics
Interpreting Results
Overall Verdict
- 🌟 EXCELLENT - Significant improvements across most metrics
- ✅ GOOD - Clear improvements with minor degradations
- 👌 FAIR - Mixed results, some improvements
- ⚠️ PARTIAL - Some validation tests failed
- ❌ POOR - More degradations than improvements
Key Metrics to Watch
For Retrievers:
- Recall@K - How many relevant documents are in top-K results
- MAP (Mean Average Precision) - Overall ranking quality
- MRR (Mean Reciprocal Rank) - Position of first relevant result
- NDCG@K - Normalized ranking quality with graded relevance
For Rerankers:
- Accuracy - Correct classification of relevant/irrelevant pairs
- MRR@K - Mean reciprocal rank in top-K results
- NDCG@K - Ranking quality with position discounting
For Pipeline:
- End-to-End Recall@K - Overall system performance
- Pipeline Accuracy - Correct top results after retrieval + reranking
Statistical Significance
The comprehensive validation includes statistical tests to determine if improvements are significant:
- Improvement % - Percentage improvement over baseline
- Statistical Significance - Whether improvements are statistically meaningful
- Effect Size - Magnitude of the improvement
🔧 Troubleshooting
Common Issues
1. Models Not Found
❌ Retriever model not found: ./output/bge-m3-enhanced/final_model
Solution:
- Check that training completed successfully
- Verify the model path exists
- Use
--discover-modelsto find available models
2. No Test Datasets Found
❌ No test datasets found!
Solution:
- Place test data in
data/datasets/directory - Use
--create-sample-datato generate sample datasets - Specify datasets explicitly with
--test_datasets
3. Validation Failures
❌ Comprehensive validation failed
Solution:
- Check model compatibility (tokenizer issues)
- Reduce batch size if running out of memory
- Verify dataset format is correct
- Check logs for detailed error messages
4. Poor Performance Results
⚠️ More degradations than improvements
Analysis Steps:
- Check training data quality and format
- Verify hyperparameters (learning rate might be too high)
- Ensure sufficient training data and epochs
- Check for data leakage or format mismatches
- Consider different base models
Validation Best Practices
1. Before Running Validation
- ✅ Ensure training completed without errors
- ✅ Verify model files exist and are complete
- ✅ Check that test datasets are in correct format
- ✅ Have baseline models available for comparison
2. During Validation
- 📊 Start with quick validation for fast feedback
- 🔬 Run comprehensive validation on representative datasets
- 📈 Use multiple metrics to get complete picture
- ⏱️ Be patient - comprehensive validation takes time
3. Analyzing Results
- 🎯 Focus on metrics most relevant to your use case
- 📊 Look at trends across multiple datasets
- ⚖️ Consider statistical significance, not just raw improvements
- 🔍 Investigate datasets where performance degraded
🎯 Validation Checklist
Use this checklist to ensure thorough validation:
Pre-Validation Setup
- Models trained and saved successfully
- Test datasets available and formatted correctly
- Baseline models accessible (BAAI/bge-m3, BAAI/bge-reranker-base)
- Sufficient compute resources for validation
Quick Validation (5-10 minutes)
- Run quick validation script
- Verify models load correctly
- Check basic performance improvements
- Test pipeline functionality (if both models available)
Comprehensive Validation (30-60 minutes)
- Run comprehensive validation script
- Test on multiple datasets
- Analyze statistical significance
- Review HTML report for detailed insights
Comparison Benchmarks (15-30 minutes)
- Run retriever comparison (if applicable)
- Run reranker comparison (if applicable)
- Analyze performance deltas
- Check throughput and memory usage
Final Analysis
- Review overall verdict and recommendations
- Identify best and worst performing datasets
- Check for any critical issues or degradations
- Document findings and next steps
🚀 Advanced Usage
Custom Test Datasets
Create your own test datasets for domain-specific validation:
// For retriever testing (embedding_format)
{"query": "Your query", "pos": ["relevant doc 1", "relevant doc 2"], "neg": ["irrelevant doc 1"]}
// For reranker testing (pairs format)
{"query": "Your query", "passage": "Document text", "label": 1}
{"query": "Your query", "passage": "Irrelevant text", "label": 0}
Cross-Domain Validation
Test model generalization across different domains:
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
--test_datasets data/medical_qa.jsonl data/legal_docs.jsonl data/tech_support.jsonl
Performance Profiling
Monitor resource usage during validation:
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
--batch_size 8 \ # Reduce if memory issues
--max_samples 100 # Limit samples for speed
A/B Testing Multiple Models
Compare different fine-tuning approaches:
# Test Model A
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/model_A/final_model \
--output_dir ./validation_A
# Test Model B
python scripts/comprehensive_validation.py \
--retriever_finetuned ./output/model_B/final_model \
--output_dir ./validation_B
# Compare results
python scripts/validation_utils.py --analyze-results ./validation_A
python scripts/validation_utils.py --analyze-results ./validation_B
📚 Additional Resources
- BGE Model Documentation: BAAI/bge
- Evaluation Metrics Guide: See
evaluation/metrics.pyfor detailed metric implementations - Dataset Format Guide: See
data/datasets/examples/format_usage_examples.py - Training Guide: See main README for training instructions
💡 Tips for Better Validation
- Use Multiple Datasets: Don't rely on a single test set
- Check Statistical Significance: Small improvements might not be meaningful
- Monitor Resource Usage: Ensure models are efficient enough for production
- Validate Domain Generalization: Test on data similar to production use case
- Document Results: Keep validation reports for future reference
- Iterate Based on Results: Use validation insights to improve training
🤝 Getting Help
If you encounter issues with validation:
- Check the troubleshooting section above
- Review the logs for detailed error messages
- Use
--discover-modelsand--analyze-datasetsto debug setup issues - Start with quick validation to isolate problems
- Check model and dataset formats are correct
Remember: Good validation is key to successful fine-tuning. Take time to thoroughly test your models before deploying them!