Revised for training
This commit is contained in:
143
readme.md
143
readme.md
@@ -1127,32 +1127,32 @@ The evaluation system provides comprehensive metrics for both retrieval and rera
|
||||
|
||||
### Usage Examples
|
||||
|
||||
#### Basic Evaluation
|
||||
#### Basic Evaluation - **NEW STREAMLINED SYSTEM** 🎯
|
||||
|
||||
```bash
|
||||
# Evaluate retriever
|
||||
python scripts/evaluate.py \
|
||||
--eval_type retriever \
|
||||
--model_path ./output/bge-m3-finetuned \
|
||||
--eval_data data/test.jsonl \
|
||||
--k_values 1 5 10 20 50 \
|
||||
--output_file results/retriever_metrics.json
|
||||
# Quick validation (5 minutes)
|
||||
python scripts/validate.py quick \
|
||||
--retriever_model ./output/bge-m3-finetuned \
|
||||
--reranker_model ./output/bge-reranker-finetuned
|
||||
|
||||
# Evaluate reranker
|
||||
python scripts/evaluate.py \
|
||||
--eval_type reranker \
|
||||
--model_path ./output/bge-reranker-finetuned \
|
||||
--eval_data data/test.jsonl \
|
||||
--output_file results/reranker_metrics.json
|
||||
# Comprehensive evaluation (30 minutes)
|
||||
python scripts/validate.py comprehensive \
|
||||
--retriever_model ./output/bge-m3-finetuned \
|
||||
--reranker_model ./output/bge-reranker-finetuned \
|
||||
--test_data_dir ./data/test
|
||||
|
||||
# Evaluate full pipeline
|
||||
python scripts/evaluate.py \
|
||||
--eval_type pipeline \
|
||||
--retriever_path ./output/bge-m3-finetuned \
|
||||
--reranker_path ./output/bge-reranker-finetuned \
|
||||
--eval_data data/test.jsonl \
|
||||
--retrieval_top_k 50 \
|
||||
--rerank_top_k 10
|
||||
# Compare with baselines
|
||||
python scripts/validate.py compare \
|
||||
--retriever_model ./output/bge-m3-finetuned \
|
||||
--reranker_model ./output/bge-reranker-finetuned \
|
||||
--retriever_data data/test_retriever.jsonl \
|
||||
--reranker_data data/test_reranker.jsonl
|
||||
|
||||
# Complete validation suite
|
||||
python scripts/validate.py all \
|
||||
--retriever_model ./output/bge-m3-finetuned \
|
||||
--reranker_model ./output/bge-reranker-finetuned \
|
||||
--test_data_dir ./data/test
|
||||
```
|
||||
|
||||
#### Advanced Evaluation with Graded Relevance
|
||||
@@ -1170,24 +1170,23 @@ python scripts/evaluate.py \
|
||||
```
|
||||
|
||||
```bash
|
||||
# Evaluate with graded relevance
|
||||
python scripts/evaluate.py \
|
||||
--eval_type retriever \
|
||||
--model_path ./output/bge-m3-finetuned \
|
||||
--eval_data data/test_graded.jsonl \
|
||||
--use_graded_relevance \
|
||||
--relevance_threshold 1 \
|
||||
--ndcg_k_values 5 10 20
|
||||
# Comprehensive evaluation with detailed metrics (includes NDCG, graded relevance)
|
||||
python scripts/validate.py comprehensive \
|
||||
--retriever_model ./output/bge-m3-finetuned \
|
||||
--reranker_model ./output/bge-reranker-finetuned \
|
||||
--test_data_dir data/test \
|
||||
--batch_size 32
|
||||
```
|
||||
|
||||
#### Statistical Significance Testing
|
||||
|
||||
```bash
|
||||
# Compare two models with significance testing
|
||||
python scripts/compare_retriever.py \
|
||||
--baseline_model_path BAAI/bge-m3 \
|
||||
--finetuned_model_path ./output/bge-m3-finetuned \
|
||||
--data_path data/test.jsonl \
|
||||
# Compare models with statistical analysis
|
||||
python scripts/compare_models.py \
|
||||
--model_type retriever \
|
||||
--baseline_model BAAI/bge-m3 \
|
||||
--finetuned_model ./output/bge-m3-finetuned \
|
||||
--data_path data/test.jsonl
|
||||
--num_bootstrap_samples 1000 \
|
||||
--confidence_level 0.95
|
||||
|
||||
@@ -1308,40 +1307,61 @@ report.plot_metrics("plots/")
|
||||
|
||||
---
|
||||
|
||||
## 📈 Model Evaluation & Benchmarking
|
||||
## 📈 Model Evaluation & Benchmarking - **NEW STREAMLINED SYSTEM** 🎯
|
||||
|
||||
### 1. Detailed Accuracy Evaluation
|
||||
> **⚡ SIMPLIFIED**: The old confusing multiple validation scripts have been replaced with **one simple command**!
|
||||
|
||||
Use `scripts/evaluate.py` for comprehensive accuracy evaluation of retriever, reranker, or the full pipeline. Supports all standard metrics (MRR, Recall@k, MAP, NDCG, etc.) and baseline comparison.
|
||||
### **Quick Start - Single Command for Everything:**
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
python scripts/evaluate.py --retriever_model path/to/finetuned --eval_data path/to/data.jsonl --eval_retriever --k_values 1 5 10
|
||||
python scripts/evaluate.py --reranker_model path/to/finetuned --eval_data path/to/data.jsonl --eval_reranker
|
||||
python scripts/evaluate.py --retriever_model path/to/finetuned --reranker_model path/to/finetuned --eval_data path/to/data.jsonl --eval_pipeline --compare_baseline --base_retriever path/to/baseline
|
||||
# Quick validation (5 minutes) - Did my training work?
|
||||
python scripts/validate.py quick \
|
||||
--retriever_model ./output/bge_m3/final_model \
|
||||
--reranker_model ./output/reranker/final_model
|
||||
|
||||
# Compare with baselines (15 minutes) - How much did I improve?
|
||||
python scripts/validate.py compare \
|
||||
--retriever_model ./output/bge_m3/final_model \
|
||||
--reranker_model ./output/reranker/final_model \
|
||||
--retriever_data ./test_data/m3_test.jsonl \
|
||||
--reranker_data ./test_data/reranker_test.jsonl
|
||||
|
||||
# Complete validation suite (1 hour) - Production ready?
|
||||
python scripts/validate.py all \
|
||||
--retriever_model ./output/bge_m3/final_model \
|
||||
--reranker_model ./output/reranker/final_model \
|
||||
--test_data_dir ./test_data
|
||||
```
|
||||
|
||||
### 2. Quick Performance Benchmarking
|
||||
### **Validation Modes:**
|
||||
|
||||
Use `scripts/benchmark.py` for fast, single-model performance profiling (throughput, latency, memory). This script does **not** report accuracy metrics.
|
||||
| Mode | Time | Purpose | Command |
|
||||
|------|------|---------|---------|
|
||||
| `quick` | 5 min | Sanity check | `python scripts/validate.py quick --retriever_model ... --reranker_model ...` |
|
||||
| `compare` | 15 min | Baseline comparison | `python scripts/validate.py compare --retriever_model ... --reranker_data ...` |
|
||||
| `comprehensive` | 30 min | Detailed metrics | `python scripts/validate.py comprehensive --test_data_dir ...` |
|
||||
| `benchmark` | 10 min | Performance only | `python scripts/validate.py benchmark --retriever_model ...` |
|
||||
| `all` | 1 hour | Complete suite | `python scripts/validate.py all --test_data_dir ...` |
|
||||
|
||||
### **Advanced Usage:**
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
python scripts/benchmark.py --model_type retriever --model_path path/to/model --data_path path/to/data.jsonl --batch_size 16 --device cuda
|
||||
python scripts/benchmark.py --model_type reranker --model_path path/to/model --data_path path/to/data.jsonl --batch_size 16 --device cuda
|
||||
# Unified model comparison
|
||||
python scripts/compare_models.py \
|
||||
--model_type both \
|
||||
--finetuned_retriever ./output/bge_m3/final_model \
|
||||
--finetuned_reranker ./output/reranker/final_model \
|
||||
--retriever_data ./test_data/m3_test.jsonl \
|
||||
--reranker_data ./test_data/reranker_test.jsonl
|
||||
|
||||
# Performance benchmarking
|
||||
python scripts/benchmark.py \
|
||||
--model_type retriever \
|
||||
--model_path ./output/bge_m3/final_model \
|
||||
--data_path ./test_data/m3_test.jsonl
|
||||
```
|
||||
|
||||
### 3. Fine-tuned vs. Baseline Comparison (Accuracy & Performance)
|
||||
|
||||
Use `scripts/compare_retriever.py` and `scripts/compare_reranker.py` to compare a fine-tuned model against a baseline, reporting both accuracy and performance metrics side by side, including the delta (improvement) for each metric.
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
python scripts/compare_retriever.py --finetuned_model_path path/to/finetuned --baseline_model_path path/to/baseline --data_path path/to/data.jsonl --batch_size 16 --device cuda
|
||||
python scripts/compare_reranker.py --finetuned_model_path path/to/finetuned --baseline_model_path path/to/baseline --data_path path/to/data.jsonl --batch_size 16 --device cuda
|
||||
```
|
||||
|
||||
- Results are saved to a file and include a table of metrics for both models and the improvement (delta).
|
||||
**📋 See [README_VALIDATION.md](./README_VALIDATION.md) for complete documentation**
|
||||
|
||||
---
|
||||
|
||||
@@ -1479,10 +1499,11 @@ bge_finetune/
|
||||
│ ├── train_m3.py # Train BGE-M3 embeddings
|
||||
│ ├── train_reranker.py # Train BGE-Reranker
|
||||
│ ├── train_joint.py # Joint training (RocketQAv2)
|
||||
│ ├── evaluate.py # Comprehensive evaluation
|
||||
│ ├── validate.py # **NEW: Single validation entry point**
|
||||
│ ├── compare_models.py # **NEW: Unified model comparison**
|
||||
│ ├── benchmark.py # Performance benchmarking
|
||||
│ ├── compare_retriever.py # Retriever comparison
|
||||
│ ├── compare_reranker.py # Reranker comparison
|
||||
│ ├── comprehensive_validation.py # Detailed validation suite
|
||||
│ ├── quick_validation.py # Quick validation checks
|
||||
│ ├── test_installation.py # Environment verification
|
||||
│ └── setup.py # Initial setup script
|
||||
│
|
||||
|
||||
Reference in New Issue
Block a user