223 lines
7.9 KiB
Markdown
223 lines
7.9 KiB
Markdown
# 🎯 BGE Model Validation System - Streamlined & Simplified
|
|
|
|
## ⚡ **Quick Start - One Command for Everything**
|
|
|
|
The old confusing multiple validation scripts have been **completely replaced** with one simple interface:
|
|
|
|
```bash
|
|
# Quick check (5 minutes) - Did my training work?
|
|
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model
|
|
|
|
# Compare with baselines - How much did I improve?
|
|
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
|
|
--retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl
|
|
|
|
# Complete validation suite (1 hour) - Is my model production-ready?
|
|
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
|
|
--test_data_dir ./test_data
|
|
```
|
|
|
|
## 🧹 **What We Cleaned Up**
|
|
|
|
### **❌ REMOVED (Redundant/Confusing Files):**
|
|
- `scripts/validate_m3.py` - Simple validation (redundant)
|
|
- `scripts/validate_reranker.py` - Simple validation (redundant)
|
|
- `scripts/evaluate.py` - Main evaluation (overlapped)
|
|
- `scripts/compare_retriever.py` - Retriever comparison (merged)
|
|
- `scripts/compare_reranker.py` - Reranker comparison (merged)
|
|
|
|
### **✅ NEW STREAMLINED SYSTEM:**
|
|
- **`scripts/validate.py`** - **Single entry point for all validation**
|
|
- **`scripts/compare_models.py`** - **Unified model comparison**
|
|
- `scripts/quick_validation.py` - Improved quick validation
|
|
- `scripts/comprehensive_validation.py` - Enhanced comprehensive validation
|
|
- `scripts/benchmark.py` - Performance benchmarking
|
|
- `evaluation/` - Core evaluation modules (kept clean)
|
|
|
|
---
|
|
|
|
## 🚀 **Complete Usage Examples**
|
|
|
|
### **1. After Training - Quick Sanity Check**
|
|
```bash
|
|
python scripts/validate.py quick \
|
|
--retriever_model ./output/bge_m3_三国演义/final_model \
|
|
--reranker_model ./output/bge_reranker_三国演义/final_model
|
|
```
|
|
**Time**: 5 minutes | **Purpose**: Verify models work correctly
|
|
|
|
### **2. Measure Improvements - Baseline Comparison**
|
|
```bash
|
|
python scripts/validate.py compare \
|
|
--retriever_model ./output/bge_m3_三国演义/final_model \
|
|
--reranker_model ./output/bge_reranker_三国演义/final_model \
|
|
--retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
|
|
--reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"
|
|
```
|
|
**Time**: 15 minutes | **Purpose**: Quantify how much you improved vs baseline
|
|
|
|
### **3. Thorough Testing - Comprehensive Validation**
|
|
```bash
|
|
python scripts/validate.py comprehensive \
|
|
--retriever_model ./output/bge_m3_三国演义/final_model \
|
|
--reranker_model ./output/bge_reranker_三国演义/final_model \
|
|
--test_data_dir "data/datasets/三国演义/splits"
|
|
```
|
|
**Time**: 30 minutes | **Purpose**: Detailed accuracy evaluation
|
|
|
|
### **4. Production Ready - Complete Suite**
|
|
```bash
|
|
python scripts/validate.py all \
|
|
--retriever_model ./output/bge_m3_三国演义/final_model \
|
|
--reranker_model ./output/bge_reranker_三国演义/final_model \
|
|
--test_data_dir "data/datasets/三国演义/splits"
|
|
```
|
|
**Time**: 1 hour | **Purpose**: Everything - ready for deployment
|
|
|
|
---
|
|
|
|
## 📊 **Understanding Your Results**
|
|
|
|
### **Validation Status:**
|
|
- **🌟 EXCELLENT** - Significant improvements (>5% average)
|
|
- **✅ GOOD** - Clear improvements (2-5% average)
|
|
- **👌 FAIR** - Modest improvements (0-2% average)
|
|
- **❌ POOR** - No improvement or degradation
|
|
|
|
### **Key Metrics:**
|
|
**Retriever**: Recall@5, Recall@10, MAP, MRR
|
|
**Reranker**: Accuracy, Precision, Recall, F1
|
|
|
|
### **Output Files:**
|
|
- `validation_summary.md` - Main summary report
|
|
- `validation_results.json` - Complete detailed results
|
|
- `comparison/` - Baseline comparison results
|
|
- `comprehensive/` - Detailed validation metrics
|
|
|
|
---
|
|
|
|
## 🛠️ **Advanced Options**
|
|
|
|
### **Single Model Testing:**
|
|
```bash
|
|
# Test only retriever
|
|
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model
|
|
|
|
# Test only reranker
|
|
python scripts/validate.py quick --reranker_model ./output/reranker/final_model
|
|
```
|
|
|
|
### **Performance Tuning:**
|
|
```bash
|
|
# Speed up validation (for testing)
|
|
python scripts/validate.py compare \
|
|
--retriever_model ./output/model \
|
|
--retriever_data ./test_data/m3_test.jsonl \
|
|
--batch_size 16 \
|
|
--max_samples 1000
|
|
|
|
# Detailed comparison with custom baselines
|
|
python scripts/compare_models.py \
|
|
--model_type both \
|
|
--finetuned_retriever ./output/bge_m3/final_model \
|
|
--finetuned_reranker ./output/reranker/final_model \
|
|
--baseline_retriever "BAAI/bge-m3" \
|
|
--baseline_reranker "BAAI/bge-reranker-base" \
|
|
--retriever_data ./test_data/m3_test.jsonl \
|
|
--reranker_data ./test_data/reranker_test.jsonl
|
|
```
|
|
|
|
### **Benchmarking Only:**
|
|
```bash
|
|
# Test inference performance
|
|
python scripts/validate.py benchmark \
|
|
--retriever_model ./output/model \
|
|
--reranker_model ./output/model \
|
|
--batch_size 32 \
|
|
--max_samples 1000
|
|
```
|
|
|
|
---
|
|
|
|
## 🎯 **Integration with Your Workflow**
|
|
|
|
### **Complete Training → Validation Pipeline:**
|
|
|
|
```bash
|
|
# 1. Split your datasets properly
|
|
python scripts/split_datasets.py \
|
|
--input_dir "data/datasets/三国演义" \
|
|
--output_dir "data/datasets/三国演义/splits"
|
|
|
|
# 2. Quick training test (optional)
|
|
python scripts/quick_train_test.py \
|
|
--data_dir "data/datasets/三国演义/splits" \
|
|
--samples_per_model 1000
|
|
|
|
# 3. Full training
|
|
python scripts/train_m3.py \
|
|
--train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
|
|
--eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
|
|
--output_dir "./output/bge_m3_三国演义"
|
|
|
|
python scripts/train_reranker.py \
|
|
--reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
|
|
--eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
|
|
--output_dir "./output/bge_reranker_三国演义"
|
|
|
|
# 4. Complete validation
|
|
python scripts/validate.py all \
|
|
--retriever_model "./output/bge_m3_三国演义/final_model" \
|
|
--reranker_model "./output/bge_reranker_三国演义/final_model" \
|
|
--test_data_dir "data/datasets/三国演义/splits"
|
|
```
|
|
|
|
---
|
|
|
|
## 🚨 **Troubleshooting**
|
|
|
|
### **Common Issues:**
|
|
1. **"Model not found"** → Check if training completed and model path exists
|
|
2. **"Out of memory"** → Reduce `--batch_size` or use `--max_samples`
|
|
3. **"No test data"** → Ensure you ran `split_datasets.py` first
|
|
4. **Import errors** → Run from project root directory
|
|
|
|
### **Performance:**
|
|
- **Slow validation** → Use `--max_samples 1000` for quick testing
|
|
- **High memory** → Reduce batch size to 8-16
|
|
- **GPU not used** → Check CUDA/device configuration
|
|
|
|
---
|
|
|
|
## 💡 **Best Practices**
|
|
|
|
1. **Always start with `quick` mode** - Verify models work before deeper testing
|
|
2. **Use proper test/train splits** - Don't validate on training data
|
|
3. **Compare against baselines** - Know how much you actually improved
|
|
4. **Keep validation results** - Track progress across different experiments
|
|
5. **Test with representative data** - Use diverse test sets
|
|
6. **Monitor resource usage** - Adjust batch sizes for your hardware
|
|
|
|
---
|
|
|
|
## 🎉 **Benefits of New System**
|
|
|
|
✅ **Single entry point** - No more confusion about which script to use
|
|
✅ **Clear modes** - `quick`, `compare`, `comprehensive`, `all`
|
|
✅ **Unified output** - Consistent result formats and summaries
|
|
✅ **Better error handling** - Clear error messages and troubleshooting
|
|
✅ **Integrated workflow** - Works seamlessly with training scripts
|
|
✅ **Comprehensive reporting** - Detailed summaries and recommendations
|
|
✅ **Performance aware** - Built-in benchmarking and optimization
|
|
|
|
The validation system is now **clear, powerful, and easy to use**! 🚀
|
|
|
|
---
|
|
|
|
## 📚 **Related Documentation**
|
|
|
|
- [Training Guide](./docs/usage_guide.md) - How to train BGE models
|
|
- [Data Formats](./docs/data_formats.md) - Dataset format specifications
|
|
- [Configuration](./config.toml) - System configuration options
|
|
|
|
**Need help?** The validation system provides detailed error messages and suggestions. Check the generated `validation_summary.md` for specific recommendations! |