Revised for training

This commit is contained in:
ldy
2025-07-23 14:54:46 +08:00
parent 6dbd2f3281
commit 229f6bb027
32 changed files with 59884 additions and 11081 deletions

143
readme.md
View File

@@ -1127,32 +1127,32 @@ The evaluation system provides comprehensive metrics for both retrieval and rera
### Usage Examples
#### Basic Evaluation
#### Basic Evaluation - **NEW STREAMLINED SYSTEM** 🎯
```bash
# Evaluate retriever
python scripts/evaluate.py \
--eval_type retriever \
--model_path ./output/bge-m3-finetuned \
--eval_data data/test.jsonl \
--k_values 1 5 10 20 50 \
--output_file results/retriever_metrics.json
# Quick validation (5 minutes)
python scripts/validate.py quick \
--retriever_model ./output/bge-m3-finetuned \
--reranker_model ./output/bge-reranker-finetuned
# Evaluate reranker
python scripts/evaluate.py \
--eval_type reranker \
--model_path ./output/bge-reranker-finetuned \
--eval_data data/test.jsonl \
--output_file results/reranker_metrics.json
# Comprehensive evaluation (30 minutes)
python scripts/validate.py comprehensive \
--retriever_model ./output/bge-m3-finetuned \
--reranker_model ./output/bge-reranker-finetuned \
--test_data_dir ./data/test
# Evaluate full pipeline
python scripts/evaluate.py \
--eval_type pipeline \
--retriever_path ./output/bge-m3-finetuned \
--reranker_path ./output/bge-reranker-finetuned \
--eval_data data/test.jsonl \
--retrieval_top_k 50 \
--rerank_top_k 10
# Compare with baselines
python scripts/validate.py compare \
--retriever_model ./output/bge-m3-finetuned \
--reranker_model ./output/bge-reranker-finetuned \
--retriever_data data/test_retriever.jsonl \
--reranker_data data/test_reranker.jsonl
# Complete validation suite
python scripts/validate.py all \
--retriever_model ./output/bge-m3-finetuned \
--reranker_model ./output/bge-reranker-finetuned \
--test_data_dir ./data/test
```
#### Advanced Evaluation with Graded Relevance
@@ -1170,24 +1170,23 @@ python scripts/evaluate.py \
```
```bash
# Evaluate with graded relevance
python scripts/evaluate.py \
--eval_type retriever \
--model_path ./output/bge-m3-finetuned \
--eval_data data/test_graded.jsonl \
--use_graded_relevance \
--relevance_threshold 1 \
--ndcg_k_values 5 10 20
# Comprehensive evaluation with detailed metrics (includes NDCG, graded relevance)
python scripts/validate.py comprehensive \
--retriever_model ./output/bge-m3-finetuned \
--reranker_model ./output/bge-reranker-finetuned \
--test_data_dir data/test \
--batch_size 32
```
#### Statistical Significance Testing
```bash
# Compare two models with significance testing
python scripts/compare_retriever.py \
--baseline_model_path BAAI/bge-m3 \
--finetuned_model_path ./output/bge-m3-finetuned \
--data_path data/test.jsonl \
# Compare models with statistical analysis
python scripts/compare_models.py \
--model_type retriever \
--baseline_model BAAI/bge-m3 \
--finetuned_model ./output/bge-m3-finetuned \
--data_path data/test.jsonl
--num_bootstrap_samples 1000 \
--confidence_level 0.95
@@ -1308,40 +1307,61 @@ report.plot_metrics("plots/")
---
## 📈 Model Evaluation & Benchmarking
## 📈 Model Evaluation & Benchmarking - **NEW STREAMLINED SYSTEM** 🎯
### 1. Detailed Accuracy Evaluation
> **⚡ SIMPLIFIED**: The old confusing multiple validation scripts have been replaced with **one simple command**!
Use `scripts/evaluate.py` for comprehensive accuracy evaluation of retriever, reranker, or the full pipeline. Supports all standard metrics (MRR, Recall@k, MAP, NDCG, etc.) and baseline comparison.
### **Quick Start - Single Command for Everything:**
**Example:**
```bash
python scripts/evaluate.py --retriever_model path/to/finetuned --eval_data path/to/data.jsonl --eval_retriever --k_values 1 5 10
python scripts/evaluate.py --reranker_model path/to/finetuned --eval_data path/to/data.jsonl --eval_reranker
python scripts/evaluate.py --retriever_model path/to/finetuned --reranker_model path/to/finetuned --eval_data path/to/data.jsonl --eval_pipeline --compare_baseline --base_retriever path/to/baseline
# Quick validation (5 minutes) - Did my training work?
python scripts/validate.py quick \
--retriever_model ./output/bge_m3/final_model \
--reranker_model ./output/reranker/final_model
# Compare with baselines (15 minutes) - How much did I improve?
python scripts/validate.py compare \
--retriever_model ./output/bge_m3/final_model \
--reranker_model ./output/reranker/final_model \
--retriever_data ./test_data/m3_test.jsonl \
--reranker_data ./test_data/reranker_test.jsonl
# Complete validation suite (1 hour) - Production ready?
python scripts/validate.py all \
--retriever_model ./output/bge_m3/final_model \
--reranker_model ./output/reranker/final_model \
--test_data_dir ./test_data
```
### 2. Quick Performance Benchmarking
### **Validation Modes:**
Use `scripts/benchmark.py` for fast, single-model performance profiling (throughput, latency, memory). This script does **not** report accuracy metrics.
| Mode | Time | Purpose | Command |
|------|------|---------|---------|
| `quick` | 5 min | Sanity check | `python scripts/validate.py quick --retriever_model ... --reranker_model ...` |
| `compare` | 15 min | Baseline comparison | `python scripts/validate.py compare --retriever_model ... --reranker_data ...` |
| `comprehensive` | 30 min | Detailed metrics | `python scripts/validate.py comprehensive --test_data_dir ...` |
| `benchmark` | 10 min | Performance only | `python scripts/validate.py benchmark --retriever_model ...` |
| `all` | 1 hour | Complete suite | `python scripts/validate.py all --test_data_dir ...` |
### **Advanced Usage:**
**Example:**
```bash
python scripts/benchmark.py --model_type retriever --model_path path/to/model --data_path path/to/data.jsonl --batch_size 16 --device cuda
python scripts/benchmark.py --model_type reranker --model_path path/to/model --data_path path/to/data.jsonl --batch_size 16 --device cuda
# Unified model comparison
python scripts/compare_models.py \
--model_type both \
--finetuned_retriever ./output/bge_m3/final_model \
--finetuned_reranker ./output/reranker/final_model \
--retriever_data ./test_data/m3_test.jsonl \
--reranker_data ./test_data/reranker_test.jsonl
# Performance benchmarking
python scripts/benchmark.py \
--model_type retriever \
--model_path ./output/bge_m3/final_model \
--data_path ./test_data/m3_test.jsonl
```
### 3. Fine-tuned vs. Baseline Comparison (Accuracy & Performance)
Use `scripts/compare_retriever.py` and `scripts/compare_reranker.py` to compare a fine-tuned model against a baseline, reporting both accuracy and performance metrics side by side, including the delta (improvement) for each metric.
**Example:**
```bash
python scripts/compare_retriever.py --finetuned_model_path path/to/finetuned --baseline_model_path path/to/baseline --data_path path/to/data.jsonl --batch_size 16 --device cuda
python scripts/compare_reranker.py --finetuned_model_path path/to/finetuned --baseline_model_path path/to/baseline --data_path path/to/data.jsonl --batch_size 16 --device cuda
```
- Results are saved to a file and include a table of metrics for both models and the improvement (delta).
**📋 See [README_VALIDATION.md](./README_VALIDATION.md) for complete documentation**
---
@@ -1479,10 +1499,11 @@ bge_finetune/
│ ├── train_m3.py # Train BGE-M3 embeddings
│ ├── train_reranker.py # Train BGE-Reranker
│ ├── train_joint.py # Joint training (RocketQAv2)
│ ├── evaluate.py # Comprehensive evaluation
│ ├── validate.py # **NEW: Single validation entry point**
│ ├── compare_models.py # **NEW: Unified model comparison**
│ ├── benchmark.py # Performance benchmarking
│ ├── compare_retriever.py # Retriever comparison
│ ├── compare_reranker.py # Reranker comparison
│ ├── comprehensive_validation.py # Detailed validation suite
│ ├── quick_validation.py # Quick validation checks
│ ├── test_installation.py # Environment verification
│ └── setup.py # Initial setup script