bge_finetune/readme_new.md

# BGE Fine-tuning for Chinese Book Retrieval

Robust, production-ready tooling to fine-tune the BAAI **BGE-M3** embedding model and **BGE-Reranker** cross-encoder for high-performance retrieval of Chinese book content (with English query support).

---

## Table of Contents
1. Overview
2. Requirements & Compatibility
3. Core Features
4. Installation
5. Quick Start
6. Configuration System
7. Training & Evaluation
8. Distributed & Ascend NPU Support
9. Troubleshooting & FAQ
10. References
11. Roadmap

---

## 1. Overview
* Unified repository for dense, sparse (SPLADE-style) and multi-vector (ColBERT-style) retrieval.
* Joint training (RocketQAv2) and ANCE-style hard-negative mining included.
* Built-in evaluation, benchmarking and statistical significance testing.

---

## 2. Requirements & Compatibility
* **Python ≥ 3.10**
* **PyTorch ≥ 2.2.0** (nightly recommended for CUDA 12.8+)
* **CUDA ≥ 12.8** *(RTX 5090 / A100 / H100)*  or  **Torch-NPU** *(Ascend 910 / 910B)*

See `requirements.txt` for the full dependency list.

---

## 3. Core Features
* **InfoNCE (τ = 0.02) & Listwise CE** losses with self-distillation.
* **ANCE-style Hard-Negative Mining** with FAISS index refresh.
* **Atomic Checkpointing** & robust failure recovery.
* **Automatic Mixed Precision** (fp16 / bf16) & gradient checkpointing.
* **Comprehensive Metrics**: Recall@k, MRR@k, NDCG@k, MAP, Success-Rate.
* **Cross-hardware Support**: CUDA (NCCL), Ascend NPU (HCCL), CPU (Gloo).

---

## 4. Installation
```bash
# 1. Install PyTorch nightly (CUDA 12.8)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

# 2. Install project dependencies
pip install -r requirements.txt

# 3. (Optional) Ascend NPU backend
pip install torch_npu   # then edit config.toml [ascend]
```
Models auto-download on first use, or fetch manually:
```bash
huggingface-cli download BAAI/bge-m3 --local-dir ./models/bge-m3
huggingface-cli download BAAI/bge-reranker-base --local-dir ./models/bge-reranker-base
```

---

## 5. Quick Start
### Fine-tune BGE-M3 (Embedding)
```bash
python scripts/train_m3.py \
  --train_data data/train.jsonl \
  --output_dir output/bge-m3 \
  --per_device_train_batch_size 8 \
  --learning_rate 1e-5
```

### Fine-tune BGE-Reranker (Cross-encoder)
```bash
python scripts/train_reranker.py \
  --train_data data/train.jsonl \
  --output_dir output/bge-reranker \
  --learning_rate 6e-5 \
  --train_group_size 16
```

### Joint Training (RocketQAv2-style)
```bash
python scripts/train_joint.py \
  --train_data data/train.jsonl \
  --output_dir output/joint-training \
  --bf16 --gradient_checkpointing
```

---

## 6. Configuration System
Configuration is **TOML-driven** with clear precedence:
1. Command-line → 2. Model (`[m3]` / `[reranker]`) → 3. Hardware (`[hardware]` / `[ascend]`) → 4. Training defaults (`[training]`) → 5. Dataclass field defaults.

Load & validate:
```python
from config.base_config import BaseConfig
cfg = BaseConfig.from_toml("config.toml")
```

Key sections: `[training]`, `[hardware]`, `[distributed]`, `[m3]`, `[reranker]`, `[augmentation]`, `[checkpoint]`, `[logging]`.

---

## 7. Training & Evaluation
* **Scripts**: `train_m3.py`, `train_reranker.py`, `train_joint.py`
* **Evaluation**: `scripts/evaluate.py` supports retriever, reranker or full pipeline.
* **Benchmarking**: `scripts/benchmark.py` for latency / throughput; `compare_*` for baseline comparisons.

Example evaluation:
```bash
python scripts/evaluate.py \
  --retriever_model output/bge-m3 \
  --eval_data data/test.jsonl \
  --k_values 1 5 10
```

---

## 8. Distributed & Ascend NPU Support
### Multi-GPU (NVIDIA)
```bash
torchrun --nproc_per_node=4 scripts/train_m3.py --train_data data/train.jsonl --output_dir output/distributed
```
Backend auto-selects **NCCL**. Use `[distributed]` in `config.toml` for fine-tuning.

### Huawei Ascend NPU
```bash
pip install torch_npu
python scripts/train_m3.py --device npu --train_data data/train.jsonl --output_dir output/ascend
```
Backend auto-selects **HCCL**.

---

## 9. Troubleshooting & FAQ
* **CUDA mismatch**: install the matching PyTorch + CUDA build (`nvidia-smi` to verify).
* **Ascend setup**: ensure `torch_npu` and `[ascend]` config.
* **NCCL errors**: set `NCCL_DEBUG=INFO` and disable IB if needed (`NCCL_IB_DISABLE=1`).
* **Memory issues**: enable `--gradient_checkpointing` & mixed precision.

---

## 10. References
1. BGE-M3 Embeddings – [arXiv:2402.03216](https://arxiv.org/abs/2402.03216)
2. RocketQAv2 – ACL 2021
3. ANCE – [arXiv:2007.00808](https://arxiv.org/abs/2007.00808)
4. ColBERT – SIGIR 2020
5. SPLADE – [arXiv:2109.10086](https://arxiv.org/abs/2109.10086)