bge_finetune/README_VALIDATION.md

# 🎯 BGE Model Validation System - Streamlined & Simplified

## ⚡ **Quick Start - One Command for Everything**

The old confusing multiple validation scripts have been **completely replaced** with one simple interface:

```bash
# Quick check (5 minutes) - Did my training work?
python scripts/validate.py quick --retriever_model ./output/model --reranker_model ./output/model

# Compare with baselines - How much did I improve?
python scripts/validate.py compare --retriever_model ./output/model --reranker_model ./output/model \
    --retriever_data ./test_data/m3_test.jsonl --reranker_data ./test_data/reranker_test.jsonl

# Complete validation suite (1 hour) - Is my model production-ready?
python scripts/validate.py all --retriever_model ./output/model --reranker_model ./output/model \
    --test_data_dir ./test_data
```

## 🧹 **What We Cleaned Up**

### **❌ REMOVED (Redundant/Confusing Files):**
- `scripts/validate_m3.py` - Simple validation (redundant)
- `scripts/validate_reranker.py` - Simple validation (redundant)
- `scripts/evaluate.py` - Main evaluation (overlapped)
- `scripts/compare_retriever.py` - Retriever comparison (merged)
- `scripts/compare_reranker.py` - Reranker comparison (merged)

### **✅ NEW STREAMLINED SYSTEM:**
- **`scripts/validate.py`** - **Single entry point for all validation**
- **`scripts/compare_models.py`** - **Unified model comparison**
- `scripts/quick_validation.py` - Improved quick validation
- `scripts/comprehensive_validation.py` - Enhanced comprehensive validation
- `scripts/benchmark.py` - Performance benchmarking
- `evaluation/` - Core evaluation modules (kept clean)

---

## 🚀 **Complete Usage Examples**

### **1. After Training - Quick Sanity Check**
```bash
python scripts/validate.py quick \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model
```
**Time**: 5 minutes | **Purpose**: Verify models work correctly

### **2. Measure Improvements - Baseline Comparison**
```bash
python scripts/validate.py compare \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --retriever_data "data/datasets/三国演义/splits/m3_test.jsonl" \
    --reranker_data "data/datasets/三国演义/splits/reranker_test.jsonl"
```
**Time**: 15 minutes | **Purpose**: Quantify how much you improved vs baseline

### **3. Thorough Testing - Comprehensive Validation**
```bash
python scripts/validate.py comprehensive \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --test_data_dir "data/datasets/三国演义/splits"
```
**Time**: 30 minutes | **Purpose**: Detailed accuracy evaluation

### **4. Production Ready - Complete Suite**
```bash
python scripts/validate.py all \
    --retriever_model ./output/bge_m3_三国演义/final_model \
    --reranker_model ./output/bge_reranker_三国演义/final_model \
    --test_data_dir "data/datasets/三国演义/splits"
```
**Time**: 1 hour | **Purpose**: Everything - ready for deployment

---

## 📊 **Understanding Your Results**

### **Validation Status:**
- **🌟 EXCELLENT** - Significant improvements (>5% average)
- **✅ GOOD** - Clear improvements (2-5% average)
- **👌 FAIR** - Modest improvements (0-2% average)
- **❌ POOR** - No improvement or degradation

### **Key Metrics:**
**Retriever**: Recall@5, Recall@10, MAP, MRR
**Reranker**: Accuracy, Precision, Recall, F1

### **Output Files:**
- `validation_summary.md` - Main summary report
- `validation_results.json` - Complete detailed results
- `comparison/` - Baseline comparison results
- `comprehensive/` - Detailed validation metrics

---

## 🛠️ **Advanced Options**

### **Single Model Testing:**
```bash
# Test only retriever
python scripts/validate.py quick --retriever_model ./output/bge_m3/final_model

# Test only reranker
python scripts/validate.py quick --reranker_model ./output/reranker/final_model
```

### **Performance Tuning:**
```bash
# Speed up validation (for testing)
python scripts/validate.py compare \
    --retriever_model ./output/model \
    --retriever_data ./test_data/m3_test.jsonl \
    --batch_size 16 \
    --max_samples 1000

# Detailed comparison with custom baselines
python scripts/compare_models.py \
    --model_type both \
    --finetuned_retriever ./output/bge_m3/final_model \
    --finetuned_reranker ./output/reranker/final_model \
    --baseline_retriever "BAAI/bge-m3" \
    --baseline_reranker "BAAI/bge-reranker-base" \
    --retriever_data ./test_data/m3_test.jsonl \
    --reranker_data ./test_data/reranker_test.jsonl
```

### **Benchmarking Only:**
```bash
# Test inference performance
python scripts/validate.py benchmark \
    --retriever_model ./output/model \
    --reranker_model ./output/model \
    --batch_size 32 \
    --max_samples 1000
```

---

## 🎯 **Integration with Your Workflow**

### **Complete Training → Validation Pipeline:**

```bash
# 1. Split your datasets properly
python scripts/split_datasets.py \
    --input_dir "data/datasets/三国演义" \
    --output_dir "data/datasets/三国演义/splits"

# 2. Quick training test (optional)
python scripts/quick_train_test.py \
    --data_dir "data/datasets/三国演义/splits" \
    --samples_per_model 1000

# 3. Full training
python scripts/train_m3.py \
    --train_data "data/datasets/三国演义/splits/m3_train.jsonl" \
    --eval_data "data/datasets/三国演义/splits/m3_val.jsonl" \
    --output_dir "./output/bge_m3_三国演义"

python scripts/train_reranker.py \
    --reranker_data "data/datasets/三国演义/splits/reranker_train.jsonl" \
    --eval_data "data/datasets/三国演义/splits/reranker_val.jsonl" \
    --output_dir "./output/bge_reranker_三国演义"

# 4. Complete validation
python scripts/validate.py all \
    --retriever_model "./output/bge_m3_三国演义/final_model" \
    --reranker_model "./output/bge_reranker_三国演义/final_model" \
    --test_data_dir "data/datasets/三国演义/splits"
```

---

## 🚨 **Troubleshooting**

### **Common Issues:**
1. **"Model not found"** → Check if training completed and model path exists
2. **"Out of memory"** → Reduce `--batch_size` or use `--max_samples`
3. **"No test data"** → Ensure you ran `split_datasets.py` first
4. **Import errors** → Run from project root directory

### **Performance:**
- **Slow validation** → Use `--max_samples 1000` for quick testing
- **High memory** → Reduce batch size to 8-16
- **GPU not used** → Check CUDA/device configuration

---

## 💡 **Best Practices**

1. **Always start with `quick` mode** - Verify models work before deeper testing
2. **Use proper test/train splits** - Don't validate on training data
3. **Compare against baselines** - Know how much you actually improved
4. **Keep validation results** - Track progress across different experiments
5. **Test with representative data** - Use diverse test sets
6. **Monitor resource usage** - Adjust batch sizes for your hardware

---

## 🎉 **Benefits of New System**

✅ **Single entry point** - No more confusion about which script to use
✅ **Clear modes** - `quick`, `compare`, `comprehensive`, `all`
✅ **Unified output** - Consistent result formats and summaries
✅ **Better error handling** - Clear error messages and troubleshooting
✅ **Integrated workflow** - Works seamlessly with training scripts
✅ **Comprehensive reporting** - Detailed summaries and recommendations
✅ **Performance aware** - Built-in benchmarking and optimization

The validation system is now **clear, powerful, and easy to use**! 🚀

---

## 📚 **Related Documentation**

- [Training Guide](./docs/usage_guide.md) - How to train BGE models
- [Data Formats](./docs/data_formats.md) - Dataset format specifications
- [Configuration](./config.toml) - System configuration options

**Need help?** The validation system provides detailed error messages and suggestions. Check the generated `validation_summary.md` for specific recommendations!