Revised for training
This commit is contained in:
223
README_VALIDATION.md
Normal file
223
README_VALIDATION.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# 🎯 BGE Model Validation System - Streamlined & Simplified
|
||||
|
||||
## ⚡ **Quick Start - One Command for Everything**
|
||||
|
||||
The old confusing multiple validation scripts have been **completely replaced** with one simple interface:
|
||||
|
||||
```bash
|
||||
# Quick check (5 minutes) - Did my training work?
|
||||
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model
|
||||
|
||||
# Compare with baselines - How much did I improve?
|
||||
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
|
||||
--retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl
|
||||
|
||||
# Complete validation suite (1 hour) - Is my model production-ready?
|
||||
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
|
||||
--test_data_dir ./test_data
|
||||
```
|
||||
|
||||
## 🧹 **What We Cleaned Up**
|
||||
|
||||
### **❌ REMOVED (Redundant/Confusing Files):**
|
||||
- `scripts/validate_m3.py` - Simple validation (redundant)
|
||||
- `scripts/validate_reranker.py` - Simple validation (redundant)
|
||||
- `scripts/evaluate.py` - Main evaluation (overlapped)
|
||||
- `scripts/compare_retriever.py` - Retriever comparison (merged)
|
||||
- `scripts/compare_reranker.py` - Reranker comparison (merged)
|
||||
|
||||
### **✅ NEW STREAMLINED SYSTEM:**
|
||||
- **`scripts/validate.py`** - **Single entry point for all validation**
|
||||
- **`scripts/compare_models.py`** - **Unified model comparison**
|
||||
- `scripts/quick_validation.py` - Improved quick validation
|
||||
- `scripts/comprehensive_validation.py` - Enhanced comprehensive validation
|
||||
- `scripts/benchmark.py` - Performance benchmarking
|
||||
- `evaluation/` - Core evaluation modules (kept clean)
|
||||
|
||||
---
|
||||
|
||||
## 🚀 **Complete Usage Examples**
|
||||
|
||||
### **1. After Training - Quick Sanity Check**
|
||||
```bash
|
||||
python scripts/validate.py quick \
|
||||
--retriever_model ./output/bge_m3_三国演义/final_model \
|
||||
--reranker_model ./output/bge_reranker_三国演义/final_model
|
||||
```
|
||||
**Time**: 5 minutes | **Purpose**: Verify models work correctly
|
||||
|
||||
### **2. Measure Improvements - Baseline Comparison**
|
||||
```bash
|
||||
python scripts/validate.py compare \
|
||||
--retriever_model ./output/bge_m3_三国演义/final_model \
|
||||
--reranker_model ./output/bge_reranker_三国演义/final_model \
|
||||
--retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
|
||||
--reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"
|
||||
```
|
||||
**Time**: 15 minutes | **Purpose**: Quantify how much you improved vs baseline
|
||||
|
||||
### **3. Thorough Testing - Comprehensive Validation**
|
||||
```bash
|
||||
python scripts/validate.py comprehensive \
|
||||
--retriever_model ./output/bge_m3_三国演义/final_model \
|
||||
--reranker_model ./output/bge_reranker_三国演义/final_model \
|
||||
--test_data_dir "data/datasets/三国演义/splits"
|
||||
```
|
||||
**Time**: 30 minutes | **Purpose**: Detailed accuracy evaluation
|
||||
|
||||
### **4. Production Ready - Complete Suite**
|
||||
```bash
|
||||
python scripts/validate.py all \
|
||||
--retriever_model ./output/bge_m3_三国演义/final_model \
|
||||
--reranker_model ./output/bge_reranker_三国演义/final_model \
|
||||
--test_data_dir "data/datasets/三国演义/splits"
|
||||
```
|
||||
**Time**: 1 hour | **Purpose**: Everything - ready for deployment
|
||||
|
||||
---
|
||||
|
||||
## 📊 **Understanding Your Results**
|
||||
|
||||
### **Validation Status:**
|
||||
- **🌟 EXCELLENT** - Significant improvements (>5% average)
|
||||
- **✅ GOOD** - Clear improvements (2-5% average)
|
||||
- **👌 FAIR** - Modest improvements (0-2% average)
|
||||
- **❌ POOR** - No improvement or degradation
|
||||
|
||||
### **Key Metrics:**
|
||||
**Retriever**: Recall@5, Recall@10, MAP, MRR
|
||||
**Reranker**: Accuracy, Precision, Recall, F1
|
||||
|
||||
### **Output Files:**
|
||||
- `validation_summary.md` - Main summary report
|
||||
- `validation_results.json` - Complete detailed results
|
||||
- `comparison/` - Baseline comparison results
|
||||
- `comprehensive/` - Detailed validation metrics
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ **Advanced Options**
|
||||
|
||||
### **Single Model Testing:**
|
||||
```bash
|
||||
# Test only retriever
|
||||
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model
|
||||
|
||||
# Test only reranker
|
||||
python scripts/validate.py quick --reranker_model ./output/reranker/final_model
|
||||
```
|
||||
|
||||
### **Performance Tuning:**
|
||||
```bash
|
||||
# Speed up validation (for testing)
|
||||
python scripts/validate.py compare \
|
||||
--retriever_model ./output/model \
|
||||
--retriever_data ./test_data/m3_test.jsonl \
|
||||
--batch_size 16 \
|
||||
--max_samples 1000
|
||||
|
||||
# Detailed comparison with custom baselines
|
||||
python scripts/compare_models.py \
|
||||
--model_type both \
|
||||
--finetuned_retriever ./output/bge_m3/final_model \
|
||||
--finetuned_reranker ./output/reranker/final_model \
|
||||
--baseline_retriever "BAAI/bge-m3" \
|
||||
--baseline_reranker "BAAI/bge-reranker-base" \
|
||||
--retriever_data ./test_data/m3_test.jsonl \
|
||||
--reranker_data ./test_data/reranker_test.jsonl
|
||||
```
|
||||
|
||||
### **Benchmarking Only:**
|
||||
```bash
|
||||
# Test inference performance
|
||||
python scripts/validate.py benchmark \
|
||||
--retriever_model ./output/model \
|
||||
--reranker_model ./output/model \
|
||||
--batch_size 32 \
|
||||
--max_samples 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 **Integration with Your Workflow**
|
||||
|
||||
### **Complete Training → Validation Pipeline:**
|
||||
|
||||
```bash
|
||||
# 1. Split your datasets properly
|
||||
python scripts/split_datasets.py \
|
||||
--input_dir "data/datasets/三国演义" \
|
||||
--output_dir "data/datasets/三国演义/splits"
|
||||
|
||||
# 2. Quick training test (optional)
|
||||
python scripts/quick_train_test.py \
|
||||
--data_dir "data/datasets/三国演义/splits" \
|
||||
--samples_per_model 1000
|
||||
|
||||
# 3. Full training
|
||||
python scripts/train_m3.py \
|
||||
--train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
|
||||
--eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
|
||||
--output_dir "./output/bge_m3_三国演义"
|
||||
|
||||
python scripts/train_reranker.py \
|
||||
--reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
|
||||
--eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
|
||||
--output_dir "./output/bge_reranker_三国演义"
|
||||
|
||||
# 4. Complete validation
|
||||
python scripts/validate.py all \
|
||||
--retriever_model "./output/bge_m3_三国演义/final_model" \
|
||||
--reranker_model "./output/bge_reranker_三国演义/final_model" \
|
||||
--test_data_dir "data/datasets/三国演义/splits"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 **Troubleshooting**
|
||||
|
||||
### **Common Issues:**
|
||||
1. **"Model not found"** → Check if training completed and model path exists
|
||||
2. **"Out of memory"** → Reduce `--batch_size` or use `--max_samples`
|
||||
3. **"No test data"** → Ensure you ran `split_datasets.py` first
|
||||
4. **Import errors** → Run from project root directory
|
||||
|
||||
### **Performance:**
|
||||
- **Slow validation** → Use `--max_samples 1000` for quick testing
|
||||
- **High memory** → Reduce batch size to 8-16
|
||||
- **GPU not used** → Check CUDA/device configuration
|
||||
|
||||
---
|
||||
|
||||
## 💡 **Best Practices**
|
||||
|
||||
1. **Always start with `quick` mode** - Verify models work before deeper testing
|
||||
2. **Use proper test/train splits** - Don't validate on training data
|
||||
3. **Compare against baselines** - Know how much you actually improved
|
||||
4. **Keep validation results** - Track progress across different experiments
|
||||
5. **Test with representative data** - Use diverse test sets
|
||||
6. **Monitor resource usage** - Adjust batch sizes for your hardware
|
||||
|
||||
---
|
||||
|
||||
## 🎉 **Benefits of New System**
|
||||
|
||||
✅ **Single entry point** - No more confusion about which script to use
|
||||
✅ **Clear modes** - `quick`, `compare`, `comprehensive`, `all`
|
||||
✅ **Unified output** - Consistent result formats and summaries
|
||||
✅ **Better error handling** - Clear error messages and troubleshooting
|
||||
✅ **Integrated workflow** - Works seamlessly with training scripts
|
||||
✅ **Comprehensive reporting** - Detailed summaries and recommendations
|
||||
✅ **Performance aware** - Built-in benchmarking and optimization
|
||||
|
||||
The validation system is now **clear, powerful, and easy to use**! 🚀
|
||||
|
||||
---
|
||||
|
||||
## 📚 **Related Documentation**
|
||||
|
||||
- [Training Guide](./docs/usage_guide.md) - How to train BGE models
|
||||
- [Data Formats](./docs/data_formats.md) - Dataset format specifications
|
||||
- [Configuration](./config.toml) - System configuration options
|
||||
|
||||
**Need help?** The validation system provides detailed error messages and suggestions. Check the generated `validation_summary.md` for specific recommendations!
|
||||
Reference in New Issue
Block a user