init commit

2025-07-22 16:55:25 +08:00
commit 36003b83e2
67 changed files with 76613 additions and 0 deletions
--- a/readme_new.md
+++ b/readme_new.md
@@ -0,0 +1,154 @@
+# BGE Fine-tuning for Chinese Book Retrieval
+
+Robust, production-ready tooling to fine-tune the BAAI **BGE-M3** embedding model and **BGE-Reranker** cross-encoder for high-performance retrieval of Chinese book content (with English query support).
+
+---
+
+## Table of Contents
+1. Overview
+2. Requirements & Compatibility
+3. Core Features
+4. Installation
+5. Quick Start
+6. Configuration System
+7. Training & Evaluation
+8. Distributed & Ascend NPU Support
+9. Troubleshooting & FAQ
+10. References
+11. Roadmap
+
+---
+
+## 1. Overview
+* Unified repository for dense, sparse (SPLADE-style) and multi-vector (ColBERT-style) retrieval.
+* Joint training (RocketQAv2) and ANCE-style hard-negative mining included.
+* Built-in evaluation, benchmarking and statistical significance testing.
+
+---
+
+## 2. Requirements & Compatibility
+* **Python ≥ 3.10**
+* **PyTorch ≥ 2.2.0** (nightly recommended for CUDA 12.8+)
+* **CUDA ≥ 12.8** *(RTX 5090 / A100 / H100)*  or  **Torch-NPU** *(Ascend 910 / 910B)*
+
+See `requirements.txt` for the full dependency list.
+
+---
+
+## 3. Core Features
+* **InfoNCE (τ = 0.02) & Listwise CE** losses with self-distillation.
+* **ANCE-style Hard-Negative Mining** with FAISS index refresh.
+* **Atomic Checkpointing** & robust failure recovery.
+* **Automatic Mixed Precision** (fp16 / bf16) & gradient checkpointing.
+* **Comprehensive Metrics**: Recall@k, MRR@k, NDCG@k, MAP, Success-Rate.
+* **Cross-hardware Support**: CUDA (NCCL), Ascend NPU (HCCL), CPU (Gloo).
+
+---
+
+## 4. Installation
+```bash
+# 1. Install PyTorch nightly (CUDA 12.8)
+pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
+
+# 2. Install project dependencies
+pip install -r requirements.txt
+
+# 3. (Optional) Ascend NPU backend
+pip install torch_npu   # then edit config.toml [ascend]
+```
+Models auto-download on first use, or fetch manually:
+```bash
+huggingface-cli download BAAI/bge-m3 --local-dir ./models/bge-m3
+huggingface-cli download BAAI/bge-reranker-base --local-dir ./models/bge-reranker-base
+```
+
+---
+
+## 5. Quick Start
+### Fine-tune BGE-M3 (Embedding)
+```bash
+python scripts/train_m3.py \
+  --train_data data/train.jsonl \
+  --output_dir output/bge-m3 \
+  --per_device_train_batch_size 8 \
+  --learning_rate 1e-5
+```
+
+### Fine-tune BGE-Reranker (Cross-encoder)
+```bash
+python scripts/train_reranker.py \
+  --train_data data/train.jsonl \
+  --output_dir output/bge-reranker \
+  --learning_rate 6e-5 \
+  --train_group_size 16
+```
+
+### Joint Training (RocketQAv2-style)
+```bash
+python scripts/train_joint.py \
+  --train_data data/train.jsonl \
+  --output_dir output/joint-training \
+  --bf16 --gradient_checkpointing
+```
+
+---
+
+## 6. Configuration System
+Configuration is **TOML-driven** with clear precedence:
+1. Command-line → 2. Model (`[m3]` / `[reranker]`) → 3. Hardware (`[hardware]` / `[ascend]`) → 4. Training defaults (`[training]`) → 5. Dataclass field defaults.
+
+Load & validate:
+```python
+from config.base_config import BaseConfig
+cfg = BaseConfig.from_toml("config.toml")
+```
+
+Key sections: `[training]`, `[hardware]`, `[distributed]`, `[m3]`, `[reranker]`, `[augmentation]`, `[checkpoint]`, `[logging]`.
+
+---
+
+## 7. Training & Evaluation
+* **Scripts**: `train_m3.py`, `train_reranker.py`, `train_joint.py`
+* **Evaluation**: `scripts/evaluate.py` supports retriever, reranker or full pipeline.
+* **Benchmarking**: `scripts/benchmark.py` for latency / throughput; `compare_*` for baseline comparisons.
+
+Example evaluation:
+```bash
+python scripts/evaluate.py \
+  --retriever_model output/bge-m3 \
+  --eval_data data/test.jsonl \
+  --k_values 1 5 10
+```
+
+---
+
+## 8. Distributed & Ascend NPU Support
+### Multi-GPU (NVIDIA)
+```bash
+torchrun --nproc_per_node=4 scripts/train_m3.py --train_data data/train.jsonl --output_dir output/distributed
+```
+Backend auto-selects **NCCL**. Use `[distributed]` in `config.toml` for fine-tuning.
+
+### Huawei Ascend NPU
+```bash
+pip install torch_npu
+python scripts/train_m3.py --device npu --train_data data/train.jsonl --output_dir output/ascend
+```
+Backend auto-selects **HCCL**.
+
+---
+
+## 9. Troubleshooting & FAQ
+* **CUDA mismatch**: install the matching PyTorch + CUDA build (`nvidia-smi` to verify).
+* **Ascend setup**: ensure `torch_npu` and `[ascend]` config.
+* **NCCL errors**: set `NCCL_DEBUG=INFO` and disable IB if needed (`NCCL_IB_DISABLE=1`).
+* **Memory issues**: enable `--gradient_checkpointing` & mixed precision.
+
+---
+
+## 10. References
+1. BGE-M3 Embeddings – [arXiv:2402.03216](https://arxiv.org/abs/2402.03216)
+2. RocketQAv2 – ACL 2021
+3. ANCE – [arXiv:2007.00808](https://arxiv.org/abs/2007.00808)
+4. ColBERT – SIGIR 2020
+5. SPLADE – [arXiv:2109.10086](https://arxiv.org/abs/2109.10086)