7.4 KiB
7.4 KiB
BGE Fine-tuning Usage Guide
Overview
This guide provides detailed instructions on fine-tuning BGE (BAAI General Embedding) models using our enhanced training framework. The framework supports both BGE-M3 (embedding) and BGE-reranker (cross-encoder) models with state-of-the-art training techniques.
Table of Contents
Installation
# Clone the repository
git clone https://github.com/yourusername/bge-finetune.git
cd bge-finetune
# Install dependencies
pip install -r requirements.txt
# Optional: Install Ascend NPU support
# pip install torch-npu # If using Huawei Ascend NPUs
Quick Start
1. Prepare Your Data
The framework supports multiple data formats:
- JSONL (recommended): One JSON object per line
- CSV/TSV: Tabular data with headers
- JSON: Single JSON file with array of samples
Example JSONL format for embedding model:
{"query": "What is machine learning?", "pos": ["ML is a subset of AI..."], "neg": ["The weather today..."]}
{"query": "How to cook pasta?", "pos": ["Boil water and add pasta..."], "neg": ["Machine learning uses..."]}
2. Fine-tune BGE-M3 (Embedding Model)
python scripts/train_m3.py \
--model_name_or_path BAAI/bge-m3 \
--train_data data/train.jsonl \
--output_dir ./output/bge-m3-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--learning_rate 1e-5
3. Fine-tune BGE-Reranker
python scripts/train_reranker.py \
--model_name_or_path BAAI/bge-reranker-base \
--train_data data/train_pairs.jsonl \
--output_dir ./output/bge-reranker-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5
Data Preparation
Data Formats
1. Embedding Model (Triplets Format)
{
"query": "query text",
"pos": ["positive passage 1", "positive passage 2"],
"neg": ["negative passage 1", "negative passage 2"],
"pos_scores": [1.0, 0.9], // Optional: relevance scores
"neg_scores": [0.1, 0.2] // Optional: relevance scores
}
2. Reranker Model (Pairs Format)
{
"query": "query text",
"passage": "passage text",
"label": 1, // 1 for relevant, 0 for not relevant
"score": 0.95 // Optional: relevance score
}
Performance Optimization
For small datasets (< 100k samples), enable in-memory caching:
# In your custom script
dataset = BGEM3Dataset(
data_path="train.jsonl",
tokenizer=tokenizer,
cache_in_memory=True, # Pre-tokenize and cache in memory
max_cache_size=100000 # Maximum samples to cache
)
Training Scripts
BGE-M3 Training Options
python scripts/train_m3.py \
--model_name_or_path BAAI/bge-m3 \
--train_data data/train.jsonl \
--eval_data data/eval.jsonl \
--output_dir ./output/bge-m3-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--learning_rate 1e-5 \
--warmup_ratio 0.1 \
--gradient_accumulation_steps 2 \
--fp16 \ # Enable mixed precision training
--train_group_size 8 \ # Number of passages per query
--use_hard_negatives \ # Enable hard negative mining
--temperature 0.02 \ # Contrastive loss temperature
--save_steps 500 \
--eval_steps 500 \
--logging_steps 100
BGE-Reranker Training Options
python scripts/train_reranker.py \
--model_name_or_path BAAI/bge-reranker-base \
--train_data data/train_pairs.jsonl \
--eval_data data/eval_pairs.jsonl \
--output_dir ./output/bge-reranker-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_length 512 \
--train_group_size 16 \ # Pairs per query
--gradient_checkpointing \ # Save memory
--logging_steps 50
Joint Training (RocketQAv2 Approach)
Train retriever and reranker together:
python scripts/train_joint.py \
--retriever_model BAAI/bge-m3 \
--reranker_model BAAI/bge-reranker-base \
--train_data data/train.jsonl \
--output_dir ./output/joint-training \
--num_train_epochs 3 \
--alternate_steps 100 # Switch between models every N steps
Advanced Usage
Configuration Files
Use TOML configuration for complex setups:
# config.toml
[training]
default_batch_size = 16
default_num_epochs = 3
gradient_accumulation_steps = 2
[m3]
model_name_or_path = "BAAI/bge-m3"
train_group_size = 8
temperature = 0.02
[hardware]
device_type = "cuda" # or "npu" for Ascend
Run with config:
python scripts/train_m3.py --config_path config.toml --train_data data/train.jsonl
Multi-GPU Training
# Single node, multiple GPUs
torchrun --nproc_per_node=4 scripts/train_m3.py \
--train_data data/train.jsonl \
--output_dir ./output/bge-m3-multigpu
# Multiple nodes
torchrun --nnodes=2 --nproc_per_node=4 \
--rdzv_id=100 --rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
scripts/train_m3.py --train_data data/train.jsonl
Custom Data Preprocessing
from data.preprocessing import DataPreprocessor
# Preprocess custom format data
preprocessor = DataPreprocessor(
tokenizer=tokenizer,
max_query_length=64,
max_passage_length=512
)
# Convert and clean data
output_path, val_path = preprocessor.preprocess_file(
input_path="raw_data.csv",
output_path="processed_data.jsonl",
file_format="csv",
validation_split=0.1
)
Model Evaluation
python scripts/evaluate.py \
--retriever_model ./output/bge-m3-finetuned \
--reranker_model ./output/bge-reranker-finetuned \
--eval_data data/test.jsonl \
--output_dir ./evaluation_results \
--metrics ndcg mrr recall \
--top_k 10 20 50
Troubleshooting
Common Issues
-
Out of Memory
- Reduce batch size
- Enable gradient checkpointing:
--gradient_checkpointing - Use fp16 training:
--fp16 - Enable in-memory caching for small datasets
-
Slow Training
- Increase number of data loader workers:
--dataloader_num_workers 8 - Enable in-memory caching for datasets < 100k samples
- Use SSD for data storage
- Increase number of data loader workers:
-
Poor Performance
- Check data quality and format
- Adjust learning rate and warmup
- Use hard negative mining:
--use_hard_negatives - Increase training epochs
Debugging Tips
# Enable debug logging
export LOG_LEVEL=DEBUG
python scripts/train_m3.py ...
# Profile training
python scripts/benchmark.py \
--model_type retriever \
--model_path ./output/bge-m3-finetuned \
--data_path data/test.jsonl
Best Practices
-
Data Quality
- Ensure balanced positive/negative samples
- Use relevance scores when available
- Clean and normalize text data
-
Training Strategy
- Start with small learning rate (1e-5 to 2e-5)
- Use warmup (10% of steps)
- Monitor evaluation metrics
-
Resource Optimization
- Use gradient accumulation for large batches
- Enable mixed precision training
- Consider in-memory caching for small datasets