# 🎯 BGE Model Validation System - Streamlined & Simplified ## ⚡ **Quick Start - One Command for Everything** The old confusing multiple validation scripts have been **completely replaced** with one simple interface: ```bash # Quick check (5 minutes) - Did my training work? python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model # Compare with baselines - How much did I improve? python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \ --retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl # Complete validation suite (1 hour) - Is my model production-ready? python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \ --test_data_dir ./test_data ``` ## 🧹 **What We Cleaned Up** ### **❌ REMOVED (Redundant/Confusing Files):** - `scripts/validate_m3.py` - Simple validation (redundant) - `scripts/validate_reranker.py` - Simple validation (redundant) - `scripts/evaluate.py` - Main evaluation (overlapped) - `scripts/compare_retriever.py` - Retriever comparison (merged) - `scripts/compare_reranker.py` - Reranker comparison (merged) ### **✅ NEW STREAMLINED SYSTEM:** - **`scripts/validate.py`** - **Single entry point for all validation** - **`scripts/compare_models.py`** - **Unified model comparison** - `scripts/quick_validation.py` - Improved quick validation - `scripts/comprehensive_validation.py` - Enhanced comprehensive validation - `scripts/benchmark.py` - Performance benchmarking - `evaluation/` - Core evaluation modules (kept clean) --- ## 🚀 **Complete Usage Examples** ### **1. After Training - Quick Sanity Check** ```bash python scripts/validate.py quick \ --retriever_model ./output/bge_m3_三国演义/final_model \ --reranker_model ./output/bge_reranker_三国演义/final_model ``` **Time**: 5 minutes | **Purpose**: Verify models work correctly ### **2. Measure Improvements - Baseline Comparison** ```bash python scripts/validate.py compare \ --retriever_model ./output/bge_m3_三国演义/final_model \ --reranker_model ./output/bge_reranker_三国演义/final_model \ --retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \ --reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl" ``` **Time**: 15 minutes | **Purpose**: Quantify how much you improved vs baseline ### **3. Thorough Testing - Comprehensive Validation** ```bash python scripts/validate.py comprehensive \ --retriever_model ./output/bge_m3_三国演义/final_model \ --reranker_model ./output/bge_reranker_三国演义/final_model \ --test_data_dir "data/datasets/三国演义/splits" ``` **Time**: 30 minutes | **Purpose**: Detailed accuracy evaluation ### **4. Production Ready - Complete Suite** ```bash python scripts/validate.py all \ --retriever_model ./output/bge_m3_三国演义/final_model \ --reranker_model ./output/bge_reranker_三国演义/final_model \ --test_data_dir "data/datasets/三国演义/splits" ``` **Time**: 1 hour | **Purpose**: Everything - ready for deployment --- ## 📊 **Understanding Your Results** ### **Validation Status:** - **🌟 EXCELLENT** - Significant improvements (>5% average) - **✅ GOOD** - Clear improvements (2-5% average) - **👌 FAIR** - Modest improvements (0-2% average) - **❌ POOR** - No improvement or degradation ### **Key Metrics:** **Retriever**: Recall@5, Recall@10, MAP, MRR **Reranker**: Accuracy, Precision, Recall, F1 ### **Output Files:** - `validation_summary.md` - Main summary report - `validation_results.json` - Complete detailed results - `comparison/` - Baseline comparison results - `comprehensive/` - Detailed validation metrics --- ## 🛠️ **Advanced Options** ### **Single Model Testing:** ```bash # Test only retriever python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model # Test only reranker python scripts/validate.py quick --reranker_model ./output/reranker/final_model ``` ### **Performance Tuning:** ```bash # Speed up validation (for testing) python scripts/validate.py compare \ --retriever_model ./output/model \ --retriever_data ./test_data/m3_test.jsonl \ --batch_size 16 \ --max_samples 1000 # Detailed comparison with custom baselines python scripts/compare_models.py \ --model_type both \ --finetuned_retriever ./output/bge_m3/final_model \ --finetuned_reranker ./output/reranker/final_model \ --baseline_retriever "BAAI/bge-m3" \ --baseline_reranker "BAAI/bge-reranker-base" \ --retriever_data ./test_data/m3_test.jsonl \ --reranker_data ./test_data/reranker_test.jsonl ``` ### **Benchmarking Only:** ```bash # Test inference performance python scripts/validate.py benchmark \ --retriever_model ./output/model \ --reranker_model ./output/model \ --batch_size 32 \ --max_samples 1000 ``` --- ## 🎯 **Integration with Your Workflow** ### **Complete Training → Validation Pipeline:** ```bash # 1. Split your datasets properly python scripts/split_datasets.py \ --input_dir "data/datasets/三国演义" \ --output_dir "data/datasets/三国演义/splits" # 2. Quick training test (optional) python scripts/quick_train_test.py \ --data_dir "data/datasets/三国演义/splits" \ --samples_per_model 1000 # 3. Full training python scripts/train_m3.py \ --train_data "data/datasets/三国演义/splits/m3_train.jsonl" \ --eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \ --output_dir "./output/bge_m3_三国演义" python scripts/train_reranker.py \ --reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \ --eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \ --output_dir "./output/bge_reranker_三国演义" # 4. Complete validation python scripts/validate.py all \ --retriever_model "./output/bge_m3_三国演义/final_model" \ --reranker_model "./output/bge_reranker_三国演义/final_model" \ --test_data_dir "data/datasets/三国演义/splits" ``` --- ## 🚨 **Troubleshooting** ### **Common Issues:** 1. **"Model not found"** → Check if training completed and model path exists 2. **"Out of memory"** → Reduce `--batch_size` or use `--max_samples` 3. **"No test data"** → Ensure you ran `split_datasets.py` first 4. **Import errors** → Run from project root directory ### **Performance:** - **Slow validation** → Use `--max_samples 1000` for quick testing - **High memory** → Reduce batch size to 8-16 - **GPU not used** → Check CUDA/device configuration --- ## 💡 **Best Practices** 1. **Always start with `quick` mode** - Verify models work before deeper testing 2. **Use proper test/train splits** - Don't validate on training data 3. **Compare against baselines** - Know how much you actually improved 4. **Keep validation results** - Track progress across different experiments 5. **Test with representative data** - Use diverse test sets 6. **Monitor resource usage** - Adjust batch sizes for your hardware --- ## 🎉 **Benefits of New System** ✅ **Single entry point** - No more confusion about which script to use ✅ **Clear modes** - `quick`, `compare`, `comprehensive`, `all` ✅ **Unified output** - Consistent result formats and summaries ✅ **Better error handling** - Clear error messages and troubleshooting ✅ **Integrated workflow** - Works seamlessly with training scripts ✅ **Comprehensive reporting** - Detailed summaries and recommendations ✅ **Performance aware** - Built-in benchmarking and optimization The validation system is now **clear, powerful, and easy to use**! 🚀 --- ## 📚 **Related Documentation** - [Training Guide](./docs/usage_guide.md) - How to train BGE models - [Data Formats](./docs/data_formats.md) - Dataset format specifications - [Configuration](./config.toml) - System configuration options **Need help?** The validation system provides detailed error messages and suggestions. Check the generated `validation_summary.md` for specific recommendations!