bge_finetune/docs/usage_guide.md
2025-07-22 16:55:25 +08:00

7.4 KiB

BGE Fine-tuning Usage Guide

Overview

This guide provides detailed instructions on fine-tuning BGE (BAAI General Embedding) models using our enhanced training framework. The framework supports both BGE-M3 (embedding) and BGE-reranker (cross-encoder) models with state-of-the-art training techniques.

Table of Contents

  1. Installation
  2. Quick Start
  3. Data Preparation
  4. Training Scripts
  5. Advanced Usage
  6. Troubleshooting

Installation

# Clone the repository
git clone https://github.com/yourusername/bge-finetune.git
cd bge-finetune

# Install dependencies
pip install -r requirements.txt

# Optional: Install Ascend NPU support
# pip install torch-npu  # If using Huawei Ascend NPUs

Quick Start

1. Prepare Your Data

The framework supports multiple data formats:

  • JSONL (recommended): One JSON object per line
  • CSV/TSV: Tabular data with headers
  • JSON: Single JSON file with array of samples

Example JSONL format for embedding model:

{"query": "What is machine learning?", "pos": ["ML is a subset of AI..."], "neg": ["The weather today..."]}
{"query": "How to cook pasta?", "pos": ["Boil water and add pasta..."], "neg": ["Machine learning uses..."]}

2. Fine-tune BGE-M3 (Embedding Model)

python scripts/train_m3.py \
    --model_name_or_path BAAI/bge-m3 \
    --train_data data/train.jsonl \
    --output_dir ./output/bge-m3-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --learning_rate 1e-5

3. Fine-tune BGE-Reranker

python scripts/train_reranker.py \
    --model_name_or_path BAAI/bge-reranker-base \
    --train_data data/train_pairs.jsonl \
    --output_dir ./output/bge-reranker-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 32 \
    --learning_rate 2e-5

Data Preparation

Data Formats

1. Embedding Model (Triplets Format)

{
    "query": "query text",
    "pos": ["positive passage 1", "positive passage 2"],
    "neg": ["negative passage 1", "negative passage 2"],
    "pos_scores": [1.0, 0.9],  // Optional: relevance scores
    "neg_scores": [0.1, 0.2]   // Optional: relevance scores
}

2. Reranker Model (Pairs Format)

{
    "query": "query text",
    "passage": "passage text",
    "label": 1,  // 1 for relevant, 0 for not relevant
    "score": 0.95  // Optional: relevance score
}

Performance Optimization

For small datasets (< 100k samples), enable in-memory caching:

# In your custom script
dataset = BGEM3Dataset(
    data_path="train.jsonl",
    tokenizer=tokenizer,
    cache_in_memory=True,  # Pre-tokenize and cache in memory
    max_cache_size=100000  # Maximum samples to cache
)

Training Scripts

BGE-M3 Training Options

python scripts/train_m3.py \
    --model_name_or_path BAAI/bge-m3 \
    --train_data data/train.jsonl \
    --eval_data data/eval.jsonl \
    --output_dir ./output/bge-m3-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 32 \
    --learning_rate 1e-5 \
    --warmup_ratio 0.1 \
    --gradient_accumulation_steps 2 \
    --fp16 \  # Enable mixed precision training
    --train_group_size 8 \  # Number of passages per query
    --use_hard_negatives \  # Enable hard negative mining
    --temperature 0.02 \  # Contrastive loss temperature
    --save_steps 500 \
    --eval_steps 500 \
    --logging_steps 100

BGE-Reranker Training Options

python scripts/train_reranker.py \
    --model_name_or_path BAAI/bge-reranker-base \
    --train_data data/train_pairs.jsonl \
    --eval_data data/eval_pairs.jsonl \
    --output_dir ./output/bge-reranker-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 32 \
    --learning_rate 2e-5 \
    --max_length 512 \
    --train_group_size 16 \  # Pairs per query
    --gradient_checkpointing \  # Save memory
    --logging_steps 50

Joint Training (RocketQAv2 Approach)

Train retriever and reranker together:

python scripts/train_joint.py \
    --retriever_model BAAI/bge-m3 \
    --reranker_model BAAI/bge-reranker-base \
    --train_data data/train.jsonl \
    --output_dir ./output/joint-training \
    --num_train_epochs 3 \
    --alternate_steps 100  # Switch between models every N steps

Advanced Usage

Configuration Files

Use TOML configuration for complex setups:

# config.toml
[training]
default_batch_size = 16
default_num_epochs = 3
gradient_accumulation_steps = 2

[m3]
model_name_or_path = "BAAI/bge-m3"
train_group_size = 8
temperature = 0.02

[hardware]
device_type = "cuda"  # or "npu" for Ascend

Run with config:

python scripts/train_m3.py --config_path config.toml --train_data data/train.jsonl

Multi-GPU Training

# Single node, multiple GPUs
torchrun --nproc_per_node=4 scripts/train_m3.py \
    --train_data data/train.jsonl \
    --output_dir ./output/bge-m3-multigpu

# Multiple nodes
torchrun --nnodes=2 --nproc_per_node=4 \
    --rdzv_id=100 --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    scripts/train_m3.py --train_data data/train.jsonl

Custom Data Preprocessing

from data.preprocessing import DataPreprocessor

# Preprocess custom format data
preprocessor = DataPreprocessor(
    tokenizer=tokenizer,
    max_query_length=64,
    max_passage_length=512
)

# Convert and clean data
output_path, val_path = preprocessor.preprocess_file(
    input_path="raw_data.csv",
    output_path="processed_data.jsonl",
    file_format="csv",
    validation_split=0.1
)

Model Evaluation

python scripts/evaluate.py \
    --retriever_model ./output/bge-m3-finetuned \
    --reranker_model ./output/bge-reranker-finetuned \
    --eval_data data/test.jsonl \
    --output_dir ./evaluation_results \
    --metrics ndcg mrr recall \
    --top_k 10 20 50

Troubleshooting

Common Issues

  1. Out of Memory

    • Reduce batch size
    • Enable gradient checkpointing: --gradient_checkpointing
    • Use fp16 training: --fp16
    • Enable in-memory caching for small datasets
  2. Slow Training

    • Increase number of data loader workers: --dataloader_num_workers 8
    • Enable in-memory caching for datasets < 100k samples
    • Use SSD for data storage
  3. Poor Performance

    • Check data quality and format
    • Adjust learning rate and warmup
    • Use hard negative mining: --use_hard_negatives
    • Increase training epochs

Debugging Tips

# Enable debug logging
export LOG_LEVEL=DEBUG
python scripts/train_m3.py ...

# Profile training
python scripts/benchmark.py \
    --model_type retriever \
    --model_path ./output/bge-m3-finetuned \
    --data_path data/test.jsonl

Best Practices

  1. Data Quality

    • Ensure balanced positive/negative samples
    • Use relevance scores when available
    • Clean and normalize text data
  2. Training Strategy

    • Start with small learning rate (1e-5 to 2e-5)
    • Use warmup (10% of steps)
    • Monitor evaluation metrics
  3. Resource Optimization

    • Use gradient accumulation for large batches
    • Enable mixed precision training
    • Consider in-memory caching for small datasets

Additional Resources