438 lines
14 KiB
Markdown
438 lines
14 KiB
Markdown
# BGE Fine-tuning Validation Guide
|
|
|
|
This guide provides comprehensive instructions for validating your fine-tuned BGE models to ensure they actually perform better than the baseline models.
|
|
|
|
## 🎯 Overview
|
|
|
|
After fine-tuning BGE models, it's crucial to validate that your models have actually improved. This guide covers multiple validation approaches:
|
|
|
|
1. **Quick Validation** - Fast sanity checks
|
|
2. **Comprehensive Validation** - Detailed performance analysis
|
|
3. **Comparison Benchmarks** - Head-to-head baseline comparisons
|
|
4. **Pipeline Validation** - End-to-end retrieval + reranking tests
|
|
5. **Statistical Analysis** - Significance testing and detailed metrics
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Option 1: Run Full Validation Suite (Recommended)
|
|
|
|
The easiest way to validate your models is using the validation suite:
|
|
|
|
```bash
|
|
# Auto-discover trained models and run complete validation
|
|
python scripts/run_validation_suite.py --auto-discover
|
|
|
|
# Or specify models explicitly
|
|
python scripts/run_validation_suite.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model \
|
|
--reranker_model ./output/bge-reranker/final_model
|
|
```
|
|
|
|
This runs all validation tests and generates a comprehensive report.
|
|
|
|
### Option 2: Quick Validation Only
|
|
|
|
For a fast check if your models are working correctly:
|
|
|
|
```bash
|
|
python scripts/quick_validation.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model \
|
|
--reranker_model ./output/bge-reranker/final_model \
|
|
--test_pipeline
|
|
```
|
|
|
|
## 📊 Validation Tools
|
|
|
|
### 1. Quick Validation (`quick_validation.py`)
|
|
|
|
**Purpose**: Fast sanity checks to verify models are working and show basic improvements.
|
|
|
|
**Features**:
|
|
- Tests model loading and basic functionality
|
|
- Uses predefined test queries in multiple languages
|
|
- Compares relevance scoring between baseline and fine-tuned models
|
|
- Tests pipeline performance (retrieval + reranking)
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Test both models
|
|
python scripts/quick_validation.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model \
|
|
--reranker_model ./output/bge-reranker/final_model \
|
|
--test_pipeline
|
|
|
|
# Test only retriever
|
|
python scripts/quick_validation.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model
|
|
|
|
# Save results to file
|
|
python scripts/quick_validation.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model \
|
|
--output_file ./results.json
|
|
```
|
|
|
|
**Output**: Console summary + optional JSON results file
|
|
|
|
### 2. Comprehensive Validation (`comprehensive_validation.py`)
|
|
|
|
**Purpose**: Detailed validation across multiple datasets with statistical analysis.
|
|
|
|
**Features**:
|
|
- Tests on all available datasets automatically
|
|
- Computes comprehensive metrics (Recall@K, Precision@K, MAP, MRR, NDCG)
|
|
- Statistical significance testing
|
|
- Performance degradation detection
|
|
- Generates HTML reports with visualizations
|
|
- Cross-dataset performance analysis
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Comprehensive validation with auto-discovered datasets
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
|
|
--reranker_finetuned ./output/bge-reranker/final_model
|
|
|
|
# Use specific datasets
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
|
|
--test_datasets data/datasets/examples/embedding_data.jsonl \
|
|
data/datasets/Reranker_AFQMC/dev.json \
|
|
--output_dir ./validation_results
|
|
|
|
# Custom evaluation settings
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
|
|
--batch_size 16 \
|
|
--k_values 1 3 5 10 20 \
|
|
--retrieval_top_k 100 \
|
|
--rerank_top_k 10
|
|
```
|
|
|
|
**Output**:
|
|
- JSON results file
|
|
- HTML report with detailed analysis
|
|
- Performance comparison tables
|
|
- Statistical significance analysis
|
|
|
|
### 3. Model Comparison Scripts
|
|
|
|
**Purpose**: Head-to-head performance comparisons with baseline models.
|
|
|
|
#### Unified Model Comparison (`compare_models.py`)
|
|
|
|
```bash
|
|
# Compare retriever
|
|
python scripts/compare_models.py \
|
|
--model_type retriever \
|
|
--finetuned_model ./output/bge-m3-enhanced/final_model \
|
|
--baseline_model BAAI/bge-m3 \
|
|
--data_path data/datasets/examples/embedding_data.jsonl \
|
|
--batch_size 16 \
|
|
--max_samples 1000
|
|
|
|
# Compare reranker
|
|
python scripts/compare_models.py \
|
|
--model_type reranker \
|
|
--finetuned_model ./output/bge-reranker/final_model \
|
|
--baseline_model BAAI/bge-reranker-base \
|
|
--data_path data/datasets/examples/reranker_data.jsonl \
|
|
--batch_size 16 \
|
|
--max_samples 1000
|
|
|
|
# Compare both at once
|
|
python scripts/compare_models.py \
|
|
--model_type both \
|
|
--finetuned_retriever ./output/bge-m3-enhanced/final_model \
|
|
--finetuned_reranker ./output/bge-reranker/final_model \
|
|
--retriever_data data/datasets/examples/embedding_data.jsonl \
|
|
--reranker_data data/datasets/examples/reranker_data.jsonl
|
|
```
|
|
|
|
**Output**: Tab-separated comparison tables with metrics and performance deltas.
|
|
|
|
### 4. Validation Suite Runner (`run_validation_suite.py`)
|
|
|
|
**Purpose**: Orchestrates all validation tests in a systematic way.
|
|
|
|
**Features**:
|
|
- Auto-discovers trained models
|
|
- Runs multiple validation approaches
|
|
- Generates comprehensive reports
|
|
- Provides final verdict on model quality
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Full auto-discovery and validation
|
|
python scripts/run_validation_suite.py --auto-discover
|
|
|
|
# Specify models and validation types
|
|
python scripts/run_validation_suite.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model \
|
|
--reranker_model ./output/bge-reranker/final_model \
|
|
--validation_types quick comprehensive comparison
|
|
|
|
# Custom settings
|
|
python scripts/run_validation_suite.py \
|
|
--retriever_model ./output/bge-m3-enhanced/final_model \
|
|
--batch_size 32 \
|
|
--max_samples 500 \
|
|
--output_dir ./my_validation_results
|
|
```
|
|
|
|
**Output**:
|
|
- Multiple result files from each validation type
|
|
- HTML summary report
|
|
- Final verdict and recommendations
|
|
|
|
### 5. Validation Utilities (`validation_utils.py`)
|
|
|
|
**Purpose**: Helper utilities for validation preparation and analysis.
|
|
|
|
**Features**:
|
|
- Discover trained models in workspace
|
|
- Analyze available test datasets
|
|
- Create sample test data
|
|
- Analyze validation results
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Discover trained models
|
|
python scripts/validation_utils.py --discover-models
|
|
|
|
# Analyze available test datasets
|
|
python scripts/validation_utils.py --analyze-datasets
|
|
|
|
# Create sample test data
|
|
python scripts/validation_utils.py --create-sample-data --output-dir ./test_data
|
|
|
|
# Analyze validation results
|
|
python scripts/validation_utils.py --analyze-results ./validation_results
|
|
```
|
|
|
|
## 📁 Understanding Results
|
|
|
|
### Result Files
|
|
|
|
After running validation, you'll get several result files:
|
|
|
|
1. **`validation_suite_report.html`** - Main HTML report with visual summary
|
|
2. **`validation_results.json`** - Complete detailed results in JSON format
|
|
3. **`quick_validation_results.json`** - Quick validation results
|
|
4. **`*_comparison_*.txt`** - Model comparison tables
|
|
5. **Comprehensive validation folder** with detailed metrics
|
|
|
|
### Interpreting Results
|
|
|
|
#### Overall Verdict
|
|
|
|
- **🌟 EXCELLENT** - Significant improvements across most metrics
|
|
- **✅ GOOD** - Clear improvements with minor degradations
|
|
- **👌 FAIR** - Mixed results, some improvements
|
|
- **⚠️ PARTIAL** - Some validation tests failed
|
|
- **❌ POOR** - More degradations than improvements
|
|
|
|
#### Key Metrics to Watch
|
|
|
|
**For Retrievers**:
|
|
- **Recall@K** - How many relevant documents are in top-K results
|
|
- **MAP (Mean Average Precision)** - Overall ranking quality
|
|
- **MRR (Mean Reciprocal Rank)** - Position of first relevant result
|
|
- **NDCG@K** - Normalized ranking quality with graded relevance
|
|
|
|
**For Rerankers**:
|
|
- **Accuracy** - Correct classification of relevant/irrelevant pairs
|
|
- **MRR@K** - Mean reciprocal rank in top-K results
|
|
- **NDCG@K** - Ranking quality with position discounting
|
|
|
|
**For Pipeline**:
|
|
- **End-to-End Recall@K** - Overall system performance
|
|
- **Pipeline Accuracy** - Correct top results after retrieval + reranking
|
|
|
|
#### Statistical Significance
|
|
|
|
The comprehensive validation includes statistical tests to determine if improvements are significant:
|
|
|
|
- **Improvement %** - Percentage improvement over baseline
|
|
- **Statistical Significance** - Whether improvements are statistically meaningful
|
|
- **Effect Size** - Magnitude of the improvement
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### 1. Models Not Found
|
|
```
|
|
❌ Retriever model not found: ./output/bge-m3-enhanced/final_model
|
|
```
|
|
**Solution**:
|
|
- Check that training completed successfully
|
|
- Verify the model path exists
|
|
- Use `--discover-models` to find available models
|
|
|
|
#### 2. No Test Datasets Found
|
|
```
|
|
❌ No test datasets found!
|
|
```
|
|
**Solution**:
|
|
- Place test data in `data/datasets/` directory
|
|
- Use `--create-sample-data` to generate sample datasets
|
|
- Specify datasets explicitly with `--test_datasets`
|
|
|
|
#### 3. Validation Failures
|
|
```
|
|
❌ Comprehensive validation failed
|
|
```
|
|
**Solution**:
|
|
- Check model compatibility (tokenizer issues)
|
|
- Reduce batch size if running out of memory
|
|
- Verify dataset format is correct
|
|
- Check logs for detailed error messages
|
|
|
|
#### 4. Poor Performance Results
|
|
```
|
|
⚠️ More degradations than improvements
|
|
```
|
|
**Analysis Steps**:
|
|
1. Check training data quality and format
|
|
2. Verify hyperparameters (learning rate might be too high)
|
|
3. Ensure sufficient training data and epochs
|
|
4. Check for data leakage or format mismatches
|
|
5. Consider different base models
|
|
|
|
### Validation Best Practices
|
|
|
|
#### 1. Before Running Validation
|
|
- ✅ Ensure training completed without errors
|
|
- ✅ Verify model files exist and are complete
|
|
- ✅ Check that test datasets are in correct format
|
|
- ✅ Have baseline models available for comparison
|
|
|
|
#### 2. During Validation
|
|
- 📊 Start with quick validation for fast feedback
|
|
- 🔬 Run comprehensive validation on representative datasets
|
|
- 📈 Use multiple metrics to get complete picture
|
|
- ⏱️ Be patient - comprehensive validation takes time
|
|
|
|
#### 3. Analyzing Results
|
|
- 🎯 Focus on metrics most relevant to your use case
|
|
- 📊 Look at trends across multiple datasets
|
|
- ⚖️ Consider statistical significance, not just raw improvements
|
|
- 🔍 Investigate datasets where performance degraded
|
|
|
|
## 🎯 Validation Checklist
|
|
|
|
Use this checklist to ensure thorough validation:
|
|
|
|
### Pre-Validation Setup
|
|
- [ ] Models trained and saved successfully
|
|
- [ ] Test datasets available and formatted correctly
|
|
- [ ] Baseline models accessible (BAAI/bge-m3, BAAI/bge-reranker-base)
|
|
- [ ] Sufficient compute resources for validation
|
|
|
|
### Quick Validation (5-10 minutes)
|
|
- [ ] Run quick validation script
|
|
- [ ] Verify models load correctly
|
|
- [ ] Check basic performance improvements
|
|
- [ ] Test pipeline functionality (if both models available)
|
|
|
|
### Comprehensive Validation (30-60 minutes)
|
|
- [ ] Run comprehensive validation script
|
|
- [ ] Test on multiple datasets
|
|
- [ ] Analyze statistical significance
|
|
- [ ] Review HTML report for detailed insights
|
|
|
|
### Comparison Benchmarks (15-30 minutes)
|
|
- [ ] Run retriever comparison (if applicable)
|
|
- [ ] Run reranker comparison (if applicable)
|
|
- [ ] Analyze performance deltas
|
|
- [ ] Check throughput and memory usage
|
|
|
|
### Final Analysis
|
|
- [ ] Review overall verdict and recommendations
|
|
- [ ] Identify best and worst performing datasets
|
|
- [ ] Check for any critical issues or degradations
|
|
- [ ] Document findings and next steps
|
|
|
|
## 🚀 Advanced Usage
|
|
|
|
### Custom Test Datasets
|
|
|
|
Create your own test datasets for domain-specific validation:
|
|
|
|
```jsonl
|
|
// For retriever testing (embedding_format)
|
|
{"query": "Your query", "pos": ["relevant doc 1", "relevant doc 2"], "neg": ["irrelevant doc 1"]}
|
|
|
|
// For reranker testing (pairs format)
|
|
{"query": "Your query", "passage": "Document text", "label": 1}
|
|
{"query": "Your query", "passage": "Irrelevant text", "label": 0}
|
|
```
|
|
|
|
### Cross-Domain Validation
|
|
|
|
Test model generalization across different domains:
|
|
|
|
```bash
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
|
|
--test_datasets data/medical_qa.jsonl data/legal_docs.jsonl data/tech_support.jsonl
|
|
```
|
|
|
|
### Performance Profiling
|
|
|
|
Monitor resource usage during validation:
|
|
|
|
```bash
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/bge-m3-enhanced/final_model \
|
|
--batch_size 8 \ # Reduce if memory issues
|
|
--max_samples 100 # Limit samples for speed
|
|
```
|
|
|
|
### A/B Testing Multiple Models
|
|
|
|
Compare different fine-tuning approaches:
|
|
|
|
```bash
|
|
# Test Model A
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/model_A/final_model \
|
|
--output_dir ./validation_A
|
|
|
|
# Test Model B
|
|
python scripts/comprehensive_validation.py \
|
|
--retriever_finetuned ./output/model_B/final_model \
|
|
--output_dir ./validation_B
|
|
|
|
# Compare results
|
|
python scripts/validation_utils.py --analyze-results ./validation_A
|
|
python scripts/validation_utils.py --analyze-results ./validation_B
|
|
```
|
|
|
|
## 📚 Additional Resources
|
|
|
|
- **BGE Model Documentation**: [BAAI/bge](https://github.com/FlagOpen/FlagEmbedding)
|
|
- **Evaluation Metrics Guide**: See `evaluation/metrics.py` for detailed metric implementations
|
|
- **Dataset Format Guide**: See `data/datasets/examples/format_usage_examples.py`
|
|
- **Training Guide**: See main README for training instructions
|
|
|
|
## 💡 Tips for Better Validation
|
|
|
|
1. **Use Multiple Datasets**: Don't rely on a single test set
|
|
2. **Check Statistical Significance**: Small improvements might not be meaningful
|
|
3. **Monitor Resource Usage**: Ensure models are efficient enough for production
|
|
4. **Validate Domain Generalization**: Test on data similar to production use case
|
|
5. **Document Results**: Keep validation reports for future reference
|
|
6. **Iterate Based on Results**: Use validation insights to improve training
|
|
|
|
## 🤝 Getting Help
|
|
|
|
If you encounter issues with validation:
|
|
|
|
1. Check the troubleshooting section above
|
|
2. Review the logs for detailed error messages
|
|
3. Use `--discover-models` and `--analyze-datasets` to debug setup issues
|
|
4. Start with quick validation to isolate problems
|
|
5. Check model and dataset formats are correct
|
|
|
|
Remember: Good validation is key to successful fine-tuning. Take time to thoroughly test your models before deploying them! |