7.2 KiB
7.2 KiB
BGE Data Pipeline Format Specifications
The refactored BGE data pipeline supports four distinct input formats with automatic detection and conversion:
📊 Format Overview
| Format | Use Case | Detection Key | Output |
|---|---|---|---|
| Triplets | BGE-M3 embedding training | pos + neg |
Contrastive learning batches |
| Pairs | BGE-reranker training | passage + label |
Cross-encoder pairs |
| Candidates | Unified threshold conversion | candidates |
Auto-exploded pairs |
| Legacy Nested | Backward compatibility | pos only |
Converted to pairs |
1. 🎯 Triplets Format (BGE-M3 Embedding)
For contrastive learning with multiple positives and negatives per query.
{
"query": "What is machine learning?",
"pos": [
"Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
"ML algorithms use statistical techniques to learn patterns from data and make predictions."
],
"neg": [
"Weather forecasting uses meteorological data to predict atmospheric conditions.",
"Cooking recipes require specific ingredients and step-by-step instructions."
],
"pos_scores": [0.95, 0.88],
"neg_scores": [0.15, 0.08],
"prompt": "为此查询生成表示:",
"type": "definition"
}
Required Fields
query: Query textpos: List of positive passages
Optional Fields
neg: List of negative passagespos_scores: Relevance scores for positivesneg_scores: Relevance scores for negativesprompt: Query instruction prefixtype: Query type classification
2. 🔍 Pairs Format (BGE-Reranker)
For cross-encoder training with individual query-passage pairs.
{
"query": "What is machine learning?",
"passage": "Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
"label": 1,
"score": 0.95,
"qid": "q1",
"pid": "p1"
}
Required Fields
query: Query textpassage: Passage textlabel: Relevance label (0 or 1)
Optional Fields
score: Relevance score (0.0-1.0)qid: Query identifierpid: Passage identifier
3. 🎲 Candidates Format (Unified)
For datasets with multiple candidates per query. Each candidate is automatically exploded into a separate pair based on score threshold.
{
"query": "What is artificial intelligence?",
"qid": "q1",
"candidates": [
{
"text": "Artificial intelligence is the simulation of human intelligence in machines.",
"score": 0.94,
"pid": "c1"
},
{
"text": "AI systems can perform tasks that typically require human intelligence.",
"score": 0.87,
"pid": "c2"
},
{
"text": "Mountain climbing requires proper equipment and training.",
"score": 0.23,
"pid": "c3"
}
]
}
Required Fields
query: Query textcandidates: List of candidate objects
Candidate Object Fields
text: Candidate passage textscore: Relevance score (optional, defaults to 1.0)pid: Passage identifier (optional, auto-generated)
Processing Logic
- Threshold Assignment: Candidates with
score >= thresholdbecomelabel=1, others becomelabel=0 - Default Threshold: 0.5 (configurable)
- Output: Multiple pairs, one per candidate
4. 🔄 Legacy Nested Format (Backward Compatibility)
Legacy format automatically converted to flat pairs for reranker training.
{
"query": "What is natural language processing?",
"pos": [
"Natural language processing is a field of AI that helps computers understand human language.",
"NLP combines computational linguistics with machine learning to process text and speech."
],
"neg": [
"Automobile manufacturing involves assembly lines and quality control processes.",
"Photography captures light through camera lenses to create images."
],
"pos_scores": [0.93, 0.86],
"neg_scores": [0.14, 0.22]
}
Required Fields
query: Query textpos: List of positive passages
Optional Fields
neg: List of negative passagespos_scores: Scores for positivesneg_scores: Scores for negatives
Processing Logic
- Each
pos[i]becomes(query, passage, label=1) - Each
neg[j]becomes(query, passage, label=0) - Maintains original scores
- Tracks source as
legacy_nested
🔧 Usage Examples
Fine-tuning Data Loader
from data.dataset import BGEM3Dataset, BGERerankerDataset, BGEDataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Triplets for BGE-M3
triplets_dataset = BGEM3Dataset(
data_path="triplets_data.jsonl",
tokenizer=tokenizer,
format_type="triplets"
)
# Pairs for BGE-reranker
pairs_dataset = BGERerankerDataset(
data_path="pairs_data.jsonl",
tokenizer=tokenizer,
format_type="pairs"
)
# Candidates with threshold conversion
candidates_dataset = BGEDataset(
data_path="candidates_data.jsonl",
tokenizer=tokenizer,
format_type="candidates",
candidates_threshold=0.6 # Custom threshold
)
# Legacy with automatic conversion
legacy_dataset = BGERerankerDataset(
data_path="legacy_data.jsonl",
tokenizer=tokenizer,
legacy_support=True
)
Optimization Pipeline
# BGE-M3 optimization (triplets → cached triplets)
python -m data.optimization bge-m3 triplets_data.jsonl \
--output_dir ./cache/data \
--hard_negative_ratio 0.8 \
--difficulty_threshold 0.2
# BGE-reranker optimization (pairs → cached pairs)
python -m data.optimization bge-reranker pairs_data.jsonl \
--output_dir ./cache/data \
--output_format flat
# Candidates optimization (candidates → cached pairs)
python -m data.optimization bge-reranker candidates_data.jsonl \
--output_dir ./cache/data
# Legacy optimization (nested → cached pairs)
python -m data.optimization bge-reranker legacy_data.jsonl \
--output_dir ./cache/data \
--output_format nested
⚡ Performance Features
Enhanced Processing
- 10-15x faster loading through pre-tokenization
- Automatic format detection based on key presence
- Score-weighted sampling for better training quality
- Hard negative mining with difficulty thresholds
Conversion Utilities
- Candidates → Pairs: Threshold-based label assignment
- Legacy → Pairs: Automatic nested-to-flat conversion
- Pairs → Triplets: Grouping by query for embedding training
- Format validation: Comprehensive error checking
Backward Compatibility
- Legacy support: Full compatibility with existing datasets
- Gradual migration: Convert formats incrementally
- Source tracking: Know the original format of converted data
- Flexible thresholds: Configurable conversion parameters
🚀 Key Benefits
- Unified Pipeline: Single codebase handles all formats
- Automatic Detection: No manual format specification needed
- Seamless Conversion: Transparent format transformations
- Performance Optimized: Significant speedup through caching
- Backward Compatible: Works with existing datasets
- Flexible Configuration: Customizable thresholds and parameters