Revised for training

2025-07-23 14:54:46 +08:00
parent 6dbd2f3281
commit 229f6bb027
32 changed files with 59884 additions and 11081 deletions
--- a/README_VALIDATION.md
+++ b/README_VALIDATION.md
@@ -0,0 +1,223 @@
+# 🎯 BGE Model Validation System - Streamlined & Simplified
+
+## ⚡ **Quick Start - One Command for Everything**
+
+The old confusing multiple validation scripts have been **completely replaced** with one simple interface:
+
+```bash
+# Quick check (5 minutes) - Did my training work?
+python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model
+
+# Compare with baselines - How much did I improve?
+python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
+    --retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl
+
+# Complete validation suite (1 hour) - Is my model production-ready?
+python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
+    --test_data_dir ./test_data
+```
+
+## 🧹 **What We Cleaned Up**
+
+### **❌ REMOVED (Redundant/Confusing Files):**
+- `scripts/validate_m3.py` - Simple validation (redundant)
+- `scripts/validate_reranker.py` - Simple validation (redundant)
+- `scripts/evaluate.py` - Main evaluation (overlapped)
+- `scripts/compare_retriever.py` - Retriever comparison (merged)
+- `scripts/compare_reranker.py` - Reranker comparison (merged)
+
+### **✅ NEW STREAMLINED SYSTEM:**
+- **`scripts/validate.py`** - **Single entry point for all validation**
+- **`scripts/compare_models.py`** - **Unified model comparison**
+- `scripts/quick_validation.py` - Improved quick validation
+- `scripts/comprehensive_validation.py` - Enhanced comprehensive validation
+- `scripts/benchmark.py` - Performance benchmarking
+- `evaluation/` - Core evaluation modules (kept clean)
+
+---
+
+## 🚀 **Complete Usage Examples**
+
+### **1. After Training - Quick Sanity Check**
+```bash
+python scripts/validate.py quick \
+    --retriever_model ./output/bge_m3_三国演义/final_model \
+    --reranker_model ./output/bge_reranker_三国演义/final_model
+```
+**Time**: 5 minutes | **Purpose**: Verify models work correctly
+
+### **2. Measure Improvements - Baseline Comparison**  
+```bash
+python scripts/validate.py compare \
+    --retriever_model ./output/bge_m3_三国演义/final_model \
+    --reranker_model ./output/bge_reranker_三国演义/final_model \
+    --retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
+    --reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"
+```
+**Time**: 15 minutes | **Purpose**: Quantify how much you improved vs baseline
+
+### **3. Thorough Testing - Comprehensive Validation**
+```bash
+python scripts/validate.py comprehensive \
+    --retriever_model ./output/bge_m3_三国演义/final_model \
+    --reranker_model ./output/bge_reranker_三国演义/final_model \
+    --test_data_dir "data/datasets/三国演义/splits"
+```
+**Time**: 30 minutes | **Purpose**: Detailed accuracy evaluation
+
+### **4. Production Ready - Complete Suite**
+```bash
+python scripts/validate.py all \
+    --retriever_model ./output/bge_m3_三国演义/final_model \
+    --reranker_model ./output/bge_reranker_三国演义/final_model \
+    --test_data_dir "data/datasets/三国演义/splits"
+```
+**Time**: 1 hour | **Purpose**: Everything - ready for deployment
+
+---
+
+## 📊 **Understanding Your Results**
+
+### **Validation Status:**
+- **🌟 EXCELLENT** - Significant improvements (>5% average)
+- **✅ GOOD** - Clear improvements (2-5% average)
+- **👌 FAIR** - Modest improvements (0-2% average)  
+- **❌ POOR** - No improvement or degradation
+
+### **Key Metrics:**
+**Retriever**: Recall@5, Recall@10, MAP, MRR  
+**Reranker**: Accuracy, Precision, Recall, F1
+
+### **Output Files:**
+- `validation_summary.md` - Main summary report
+- `validation_results.json` - Complete detailed results
+- `comparison/` - Baseline comparison results
+- `comprehensive/` - Detailed validation metrics
+
+---
+
+## 🛠️ **Advanced Options**
+
+### **Single Model Testing:**
+```bash
+# Test only retriever
+python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model
+
+# Test only reranker
+python scripts/validate.py quick --reranker_model ./output/reranker/final_model
+```
+
+### **Performance Tuning:**
+```bash
+# Speed up validation (for testing)
+python scripts/validate.py compare \
+    --retriever_model ./output/model \
+    --retriever_data ./test_data/m3_test.jsonl \
+    --batch_size 16 \
+    --max_samples 1000
+
+# Detailed comparison with custom baselines
+python scripts/compare_models.py \
+    --model_type both \
+    --finetuned_retriever ./output/bge_m3/final_model \
+    --finetuned_reranker ./output/reranker/final_model \
+    --baseline_retriever "BAAI/bge-m3" \
+    --baseline_reranker "BAAI/bge-reranker-base" \
+    --retriever_data ./test_data/m3_test.jsonl \
+    --reranker_data ./test_data/reranker_test.jsonl
+```
+
+### **Benchmarking Only:**
+```bash
+# Test inference performance
+python scripts/validate.py benchmark \
+    --retriever_model ./output/model \
+    --reranker_model ./output/model \
+    --batch_size 32 \
+    --max_samples 1000
+```
+
+---
+
+## 🎯 **Integration with Your Workflow**
+
+### **Complete Training → Validation Pipeline:**
+
+```bash
+# 1. Split your datasets properly
+python scripts/split_datasets.py \
+    --input_dir "data/datasets/三国演义" \
+    --output_dir "data/datasets/三国演义/splits"
+
+# 2. Quick training test (optional)
+python scripts/quick_train_test.py \
+    --data_dir "data/datasets/三国演义/splits" \
+    --samples_per_model 1000
+
+# 3. Full training
+python scripts/train_m3.py \
+    --train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
+    --eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
+    --output_dir "./output/bge_m3_三国演义"
+
+python scripts/train_reranker.py \
+    --reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
+    --eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
+    --output_dir "./output/bge_reranker_三国演义"
+
+# 4. Complete validation
+python scripts/validate.py all \
+    --retriever_model "./output/bge_m3_三国演义/final_model" \
+    --reranker_model "./output/bge_reranker_三国演义/final_model" \
+    --test_data_dir "data/datasets/三国演义/splits"
+```
+
+---
+
+## 🚨 **Troubleshooting**
+
+### **Common Issues:**
+1. **"Model not found"** → Check if training completed and model path exists
+2. **"Out of memory"** → Reduce `--batch_size` or use `--max_samples`
+3. **"No test data"** → Ensure you ran `split_datasets.py` first
+4. **Import errors** → Run from project root directory
+
+### **Performance:**
+- **Slow validation** → Use `--max_samples 1000` for quick testing
+- **High memory** → Reduce batch size to 8-16  
+- **GPU not used** → Check CUDA/device configuration
+
+---
+
+## 💡 **Best Practices**
+
+1. **Always start with `quick` mode** - Verify models work before deeper testing
+2. **Use proper test/train splits** - Don't validate on training data
+3. **Compare against baselines** - Know how much you actually improved  
+4. **Keep validation results** - Track progress across different experiments
+5. **Test with representative data** - Use diverse test sets
+6. **Monitor resource usage** - Adjust batch sizes for your hardware
+
+---
+
+## 🎉 **Benefits of New System**
+
+✅ **Single entry point** - No more confusion about which script to use  
+✅ **Clear modes** - `quick`, `compare`, `comprehensive`, `all`  
+✅ **Unified output** - Consistent result formats and summaries  
+✅ **Better error handling** - Clear error messages and troubleshooting  
+✅ **Integrated workflow** - Works seamlessly with training scripts  
+✅ **Comprehensive reporting** - Detailed summaries and recommendations  
+✅ **Performance aware** - Built-in benchmarking and optimization  
+
+The validation system is now **clear, powerful, and easy to use**! 🚀
+
+---
+
+## 📚 **Related Documentation**
+
+- [Training Guide](./docs/usage_guide.md) - How to train BGE models
+- [Data Formats](./docs/data_formats.md) - Dataset format specifications  
+- [Configuration](./config.toml) - System configuration options
+
+**Need help?** The validation system provides detailed error messages and suggestions. Check the generated `validation_summary.md` for specific recommendations!