bge_finetune/README_VALIDATION.md
2025-07-23 14:54:46 +08:00

223 lines
7.9 KiB
Markdown

# 🎯 BGE Model Validation System - Streamlined & Simplified
## ⚡ **Quick Start - One Command for Everything**
The old confusing multiple validation scripts have been **completely replaced** with one simple interface:
```bash
# Quick check (5 minutes) - Did my training work?
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model
# Compare with baselines - How much did I improve?
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
--retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl
# Complete validation suite (1 hour) - Is my model production-ready?
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
--test_data_dir ./test_data
```
## 🧹 **What We Cleaned Up**
### **❌ REMOVED (Redundant/Confusing Files):**
- `scripts/validate_m3.py` - Simple validation (redundant)
- `scripts/validate_reranker.py` - Simple validation (redundant)
- `scripts/evaluate.py` - Main evaluation (overlapped)
- `scripts/compare_retriever.py` - Retriever comparison (merged)
- `scripts/compare_reranker.py` - Reranker comparison (merged)
### **✅ NEW STREAMLINED SYSTEM:**
- **`scripts/validate.py`** - **Single entry point for all validation**
- **`scripts/compare_models.py`** - **Unified model comparison**
- `scripts/quick_validation.py` - Improved quick validation
- `scripts/comprehensive_validation.py` - Enhanced comprehensive validation
- `scripts/benchmark.py` - Performance benchmarking
- `evaluation/` - Core evaluation modules (kept clean)
---
## 🚀 **Complete Usage Examples**
### **1. After Training - Quick Sanity Check**
```bash
python scripts/validate.py quick \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model
```
**Time**: 5 minutes | **Purpose**: Verify models work correctly
### **2. Measure Improvements - Baseline Comparison**
```bash
python scripts/validate.py compare \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model \
--retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
--reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"
```
**Time**: 15 minutes | **Purpose**: Quantify how much you improved vs baseline
### **3. Thorough Testing - Comprehensive Validation**
```bash
python scripts/validate.py comprehensive \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model \
--test_data_dir "data/datasets/三国演义/splits"
```
**Time**: 30 minutes | **Purpose**: Detailed accuracy evaluation
### **4. Production Ready - Complete Suite**
```bash
python scripts/validate.py all \
--retriever_model ./output/bge_m3_三国演义/final_model \
--reranker_model ./output/bge_reranker_三国演义/final_model \
--test_data_dir "data/datasets/三国演义/splits"
```
**Time**: 1 hour | **Purpose**: Everything - ready for deployment
---
## 📊 **Understanding Your Results**
### **Validation Status:**
- **🌟 EXCELLENT** - Significant improvements (>5% average)
- **✅ GOOD** - Clear improvements (2-5% average)
- **👌 FAIR** - Modest improvements (0-2% average)
- **❌ POOR** - No improvement or degradation
### **Key Metrics:**
**Retriever**: Recall@5, Recall@10, MAP, MRR
**Reranker**: Accuracy, Precision, Recall, F1
### **Output Files:**
- `validation_summary.md` - Main summary report
- `validation_results.json` - Complete detailed results
- `comparison/` - Baseline comparison results
- `comprehensive/` - Detailed validation metrics
---
## 🛠️ **Advanced Options**
### **Single Model Testing:**
```bash
# Test only retriever
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model
# Test only reranker
python scripts/validate.py quick --reranker_model ./output/reranker/final_model
```
### **Performance Tuning:**
```bash
# Speed up validation (for testing)
python scripts/validate.py compare \
--retriever_model ./output/model \
--retriever_data ./test_data/m3_test.jsonl \
--batch_size 16 \
--max_samples 1000
# Detailed comparison with custom baselines
python scripts/compare_models.py \
--model_type both \
--finetuned_retriever ./output/bge_m3/final_model \
--finetuned_reranker ./output/reranker/final_model \
--baseline_retriever "BAAI/bge-m3" \
--baseline_reranker "BAAI/bge-reranker-base" \
--retriever_data ./test_data/m3_test.jsonl \
--reranker_data ./test_data/reranker_test.jsonl
```
### **Benchmarking Only:**
```bash
# Test inference performance
python scripts/validate.py benchmark \
--retriever_model ./output/model \
--reranker_model ./output/model \
--batch_size 32 \
--max_samples 1000
```
---
## 🎯 **Integration with Your Workflow**
### **Complete Training → Validation Pipeline:**
```bash
# 1. Split your datasets properly
python scripts/split_datasets.py \
--input_dir "data/datasets/三国演义" \
--output_dir "data/datasets/三国演义/splits"
# 2. Quick training test (optional)
python scripts/quick_train_test.py \
--data_dir "data/datasets/三国演义/splits" \
--samples_per_model 1000
# 3. Full training
python scripts/train_m3.py \
--train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
--eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
--output_dir "./output/bge_m3_三国演义"
python scripts/train_reranker.py \
--reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
--eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
--output_dir "./output/bge_reranker_三国演义"
# 4. Complete validation
python scripts/validate.py all \
--retriever_model "./output/bge_m3_三国演义/final_model" \
--reranker_model "./output/bge_reranker_三国演义/final_model" \
--test_data_dir "data/datasets/三国演义/splits"
```
---
## 🚨 **Troubleshooting**
### **Common Issues:**
1. **"Model not found"** → Check if training completed and model path exists
2. **"Out of memory"** → Reduce `--batch_size` or use `--max_samples`
3. **"No test data"** → Ensure you ran `split_datasets.py` first
4. **Import errors** → Run from project root directory
### **Performance:**
- **Slow validation** → Use `--max_samples 1000` for quick testing
- **High memory** → Reduce batch size to 8-16
- **GPU not used** → Check CUDA/device configuration
---
## 💡 **Best Practices**
1. **Always start with `quick` mode** - Verify models work before deeper testing
2. **Use proper test/train splits** - Don't validate on training data
3. **Compare against baselines** - Know how much you actually improved
4. **Keep validation results** - Track progress across different experiments
5. **Test with representative data** - Use diverse test sets
6. **Monitor resource usage** - Adjust batch sizes for your hardware
---
## 🎉 **Benefits of New System**
**Single entry point** - No more confusion about which script to use
**Clear modes** - `quick`, `compare`, `comprehensive`, `all`
**Unified output** - Consistent result formats and summaries
**Better error handling** - Clear error messages and troubleshooting
**Integrated workflow** - Works seamlessly with training scripts
**Comprehensive reporting** - Detailed summaries and recommendations
**Performance aware** - Built-in benchmarking and optimization
The validation system is now **clear, powerful, and easy to use**! 🚀
---
## 📚 **Related Documentation**
- [Training Guide](./docs/usage_guide.md) - How to train BGE models
- [Data Formats](./docs/data_formats.md) - Dataset format specifications
- [Configuration](./config.toml) - System configuration options
**Need help?** The validation system provides detailed error messages and suggestions. Check the generated `validation_summary.md` for specific recommendations!