2025-07-22 16:55:25 +08:00

7.2 KiB

Raw Permalink Blame History

BGE Data Pipeline Format Specifications

The refactored BGE data pipeline supports four distinct input formats with automatic detection and conversion:

📊 Format Overview

Format	Use Case	Detection Key	Output
Triplets	BGE-M3 embedding training	`pos` + `neg`	Contrastive learning batches
Pairs	BGE-reranker training	`passage` + `label`	Cross-encoder pairs
Candidates	Unified threshold conversion	`candidates`	Auto-exploded pairs
Legacy Nested	Backward compatibility	`pos` only	Converted to pairs

1. 🎯 Triplets Format (BGE-M3 Embedding)

For contrastive learning with multiple positives and negatives per query.

{
  "query": "What is machine learning?",
  "pos": [
    "Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
    "ML algorithms use statistical techniques to learn patterns from data and make predictions."
  ],
  "neg": [
    "Weather forecasting uses meteorological data to predict atmospheric conditions.", 
    "Cooking recipes require specific ingredients and step-by-step instructions."
  ],
  "pos_scores": [0.95, 0.88],
  "neg_scores": [0.15, 0.08],
  "prompt": "为此查询生成表示：",
  "type": "definition"
}

Required Fields

query: Query text
pos: List of positive passages

Optional Fields

neg: List of negative passages
pos_scores: Relevance scores for positives
neg_scores: Relevance scores for negatives
prompt: Query instruction prefix
type: Query type classification

2. 🔍 Pairs Format (BGE-Reranker)

For cross-encoder training with individual query-passage pairs.

{
  "query": "What is machine learning?",
  "passage": "Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
  "label": 1,
  "score": 0.95,
  "qid": "q1", 
  "pid": "p1"
}

Required Fields

query: Query text
passage: Passage text
label: Relevance label (0 or 1)

Optional Fields

score: Relevance score (0.0-1.0)
qid: Query identifier
pid: Passage identifier

3. 🎲 Candidates Format (Unified)

For datasets with multiple candidates per query. Each candidate is automatically exploded into a separate pair based on score threshold.

{
  "query": "What is artificial intelligence?",
  "qid": "q1",
  "candidates": [
    {
      "text": "Artificial intelligence is the simulation of human intelligence in machines.",
      "score": 0.94,
      "pid": "c1"
    },
    {
      "text": "AI systems can perform tasks that typically require human intelligence.",
      "score": 0.87, 
      "pid": "c2"
    },
    {
      "text": "Mountain climbing requires proper equipment and training.",
      "score": 0.23,
      "pid": "c3"
    }
  ]
}

Required Fields

query: Query text
candidates: List of candidate objects

Candidate Object Fields

text: Candidate passage text
score: Relevance score (optional, defaults to 1.0)
pid: Passage identifier (optional, auto-generated)

Processing Logic

Threshold Assignment: Candidates with score >= threshold become label=1, others become label=0
Default Threshold: 0.5 (configurable)
Output: Multiple pairs, one per candidate

4. 🔄 Legacy Nested Format (Backward Compatibility)

Legacy format automatically converted to flat pairs for reranker training.

{
  "query": "What is natural language processing?",
  "pos": [
    "Natural language processing is a field of AI that helps computers understand human language.",
    "NLP combines computational linguistics with machine learning to process text and speech."
  ],
  "neg": [
    "Automobile manufacturing involves assembly lines and quality control processes.",
    "Photography captures light through camera lenses to create images."
  ],
  "pos_scores": [0.93, 0.86],
  "neg_scores": [0.14, 0.22]
}

Required Fields

query: Query text
pos: List of positive passages

Optional Fields

neg: List of negative passages
pos_scores: Scores for positives
neg_scores: Scores for negatives

Processing Logic

Each pos[i] becomes (query, passage, label=1)
Each neg[j] becomes (query, passage, label=0)
Maintains original scores
Tracks source as legacy_nested

🔧 Usage Examples

Fine-tuning Data Loader

from data.dataset import BGEM3Dataset, BGERerankerDataset, BGEDataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Triplets for BGE-M3
triplets_dataset = BGEM3Dataset(
    data_path="triplets_data.jsonl",
    tokenizer=tokenizer,
    format_type="triplets"
)

# Pairs for BGE-reranker  
pairs_dataset = BGERerankerDataset(
    data_path="pairs_data.jsonl", 
    tokenizer=tokenizer,
    format_type="pairs"
)

# Candidates with threshold conversion
candidates_dataset = BGEDataset(
    data_path="candidates_data.jsonl",
    tokenizer=tokenizer, 
    format_type="candidates",
    candidates_threshold=0.6  # Custom threshold
)

# Legacy with automatic conversion
legacy_dataset = BGERerankerDataset(
    data_path="legacy_data.jsonl",
    tokenizer=tokenizer,
    legacy_support=True
)

Optimization Pipeline

# BGE-M3 optimization (triplets → cached triplets)
python -m data.optimization bge-m3 triplets_data.jsonl \
  --output_dir ./cache/data \
  --hard_negative_ratio 0.8 \
  --difficulty_threshold 0.2

# BGE-reranker optimization (pairs → cached pairs)  
python -m data.optimization bge-reranker pairs_data.jsonl \
  --output_dir ./cache/data \
  --output_format flat

# Candidates optimization (candidates → cached pairs)
python -m data.optimization bge-reranker candidates_data.jsonl \
  --output_dir ./cache/data

# Legacy optimization (nested → cached pairs)
python -m data.optimization bge-reranker legacy_data.jsonl \
  --output_dir ./cache/data \
  --output_format nested

⚡ Performance Features

Enhanced Processing

10-15x faster loading through pre-tokenization
Automatic format detection based on key presence
Score-weighted sampling for better training quality
Hard negative mining with difficulty thresholds

Conversion Utilities

Candidates → Pairs: Threshold-based label assignment
Legacy → Pairs: Automatic nested-to-flat conversion
Pairs → Triplets: Grouping by query for embedding training
Format validation: Comprehensive error checking

Backward Compatibility

Legacy support: Full compatibility with existing datasets
Gradual migration: Convert formats incrementally
Source tracking: Know the original format of converted data
Flexible thresholds: Configurable conversion parameters

🚀 Key Benefits

Unified Pipeline: Single codebase handles all formats
Automatic Detection: No manual format specification needed
Seamless Conversion: Transparent format transformations
Performance Optimized: Significant speedup through caching
Backward Compatible: Works with existing datasets
Flexible Configuration: Customizable thresholds and parameters

7.2 KiB Raw Permalink Blame History

BGE Data Pipeline Format Specifications

📊 Format Overview

1. 🎯 Triplets Format (BGE-M3 Embedding)

Required Fields

Optional Fields

2. 🔍 Pairs Format (BGE-Reranker)

Required Fields

Optional Fields

3. 🎲 Candidates Format (Unified)

Required Fields

Candidate Object Fields

Processing Logic

4. 🔄 Legacy Nested Format (Backward Compatibility)

Required Fields

Optional Fields

Processing Logic

🔧 Usage Examples

Fine-tuning Data Loader

Optimization Pipeline

⚡ Performance Features

Enhanced Processing

Conversion Utilities

Backward Compatibility

🚀 Key Benefits

7.2 KiB

Raw Permalink Blame History