init commit
This commit is contained in:
254
docs/data_formats.md
Normal file
254
docs/data_formats.md
Normal file
@@ -0,0 +1,254 @@
|
||||
# BGE Data Pipeline Format Specifications
|
||||
|
||||
The refactored BGE data pipeline supports four distinct input formats with automatic detection and conversion:
|
||||
|
||||
## 📊 Format Overview
|
||||
|
||||
| Format | Use Case | Detection Key | Output |
|
||||
|--------|----------|---------------|---------|
|
||||
| **Triplets** | BGE-M3 embedding training | `pos` + `neg` | Contrastive learning batches |
|
||||
| **Pairs** | BGE-reranker training | `passage` + `label` | Cross-encoder pairs |
|
||||
| **Candidates** | Unified threshold conversion | `candidates` | Auto-exploded pairs |
|
||||
| **Legacy Nested** | Backward compatibility | `pos` only | Converted to pairs |
|
||||
|
||||
---
|
||||
|
||||
## 1. 🎯 Triplets Format (BGE-M3 Embedding)
|
||||
|
||||
For contrastive learning with multiple positives and negatives per query.
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "What is machine learning?",
|
||||
"pos": [
|
||||
"Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
|
||||
"ML algorithms use statistical techniques to learn patterns from data and make predictions."
|
||||
],
|
||||
"neg": [
|
||||
"Weather forecasting uses meteorological data to predict atmospheric conditions.",
|
||||
"Cooking recipes require specific ingredients and step-by-step instructions."
|
||||
],
|
||||
"pos_scores": [0.95, 0.88],
|
||||
"neg_scores": [0.15, 0.08],
|
||||
"prompt": "为此查询生成表示:",
|
||||
"type": "definition"
|
||||
}
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
- `query`: Query text
|
||||
- `pos`: List of positive passages
|
||||
|
||||
### Optional Fields
|
||||
- `neg`: List of negative passages
|
||||
- `pos_scores`: Relevance scores for positives
|
||||
- `neg_scores`: Relevance scores for negatives
|
||||
- `prompt`: Query instruction prefix
|
||||
- `type`: Query type classification
|
||||
|
||||
---
|
||||
|
||||
## 2. 🔍 Pairs Format (BGE-Reranker)
|
||||
|
||||
For cross-encoder training with individual query-passage pairs.
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "What is machine learning?",
|
||||
"passage": "Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
|
||||
"label": 1,
|
||||
"score": 0.95,
|
||||
"qid": "q1",
|
||||
"pid": "p1"
|
||||
}
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
- `query`: Query text
|
||||
- `passage`: Passage text
|
||||
- `label`: Relevance label (0 or 1)
|
||||
|
||||
### Optional Fields
|
||||
- `score`: Relevance score (0.0-1.0)
|
||||
- `qid`: Query identifier
|
||||
- `pid`: Passage identifier
|
||||
|
||||
---
|
||||
|
||||
## 3. 🎲 Candidates Format (Unified)
|
||||
|
||||
For datasets with multiple candidates per query. Each candidate is automatically exploded into a separate pair based on score threshold.
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "What is artificial intelligence?",
|
||||
"qid": "q1",
|
||||
"candidates": [
|
||||
{
|
||||
"text": "Artificial intelligence is the simulation of human intelligence in machines.",
|
||||
"score": 0.94,
|
||||
"pid": "c1"
|
||||
},
|
||||
{
|
||||
"text": "AI systems can perform tasks that typically require human intelligence.",
|
||||
"score": 0.87,
|
||||
"pid": "c2"
|
||||
},
|
||||
{
|
||||
"text": "Mountain climbing requires proper equipment and training.",
|
||||
"score": 0.23,
|
||||
"pid": "c3"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
- `query`: Query text
|
||||
- `candidates`: List of candidate objects
|
||||
|
||||
### Candidate Object Fields
|
||||
- `text`: Candidate passage text
|
||||
- `score`: Relevance score (optional, defaults to 1.0)
|
||||
- `pid`: Passage identifier (optional, auto-generated)
|
||||
|
||||
### Processing Logic
|
||||
- **Threshold Assignment**: Candidates with `score >= threshold` become `label=1`, others become `label=0`
|
||||
- **Default Threshold**: 0.5 (configurable)
|
||||
- **Output**: Multiple pairs, one per candidate
|
||||
|
||||
---
|
||||
|
||||
## 4. 🔄 Legacy Nested Format (Backward Compatibility)
|
||||
|
||||
Legacy format automatically converted to flat pairs for reranker training.
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "What is natural language processing?",
|
||||
"pos": [
|
||||
"Natural language processing is a field of AI that helps computers understand human language.",
|
||||
"NLP combines computational linguistics with machine learning to process text and speech."
|
||||
],
|
||||
"neg": [
|
||||
"Automobile manufacturing involves assembly lines and quality control processes.",
|
||||
"Photography captures light through camera lenses to create images."
|
||||
],
|
||||
"pos_scores": [0.93, 0.86],
|
||||
"neg_scores": [0.14, 0.22]
|
||||
}
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
- `query`: Query text
|
||||
- `pos`: List of positive passages
|
||||
|
||||
### Optional Fields
|
||||
- `neg`: List of negative passages
|
||||
- `pos_scores`: Scores for positives
|
||||
- `neg_scores`: Scores for negatives
|
||||
|
||||
### Processing Logic
|
||||
- Each `pos[i]` becomes `(query, passage, label=1)`
|
||||
- Each `neg[j]` becomes `(query, passage, label=0)`
|
||||
- Maintains original scores
|
||||
- Tracks source as `legacy_nested`
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Usage Examples
|
||||
|
||||
### Fine-tuning Data Loader
|
||||
|
||||
```python
|
||||
from data.dataset import BGEM3Dataset, BGERerankerDataset, BGEDataset
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||
|
||||
# Triplets for BGE-M3
|
||||
triplets_dataset = BGEM3Dataset(
|
||||
data_path="triplets_data.jsonl",
|
||||
tokenizer=tokenizer,
|
||||
format_type="triplets"
|
||||
)
|
||||
|
||||
# Pairs for BGE-reranker
|
||||
pairs_dataset = BGERerankerDataset(
|
||||
data_path="pairs_data.jsonl",
|
||||
tokenizer=tokenizer,
|
||||
format_type="pairs"
|
||||
)
|
||||
|
||||
# Candidates with threshold conversion
|
||||
candidates_dataset = BGEDataset(
|
||||
data_path="candidates_data.jsonl",
|
||||
tokenizer=tokenizer,
|
||||
format_type="candidates",
|
||||
candidates_threshold=0.6 # Custom threshold
|
||||
)
|
||||
|
||||
# Legacy with automatic conversion
|
||||
legacy_dataset = BGERerankerDataset(
|
||||
data_path="legacy_data.jsonl",
|
||||
tokenizer=tokenizer,
|
||||
legacy_support=True
|
||||
)
|
||||
```
|
||||
|
||||
### Optimization Pipeline
|
||||
|
||||
```bash
|
||||
# BGE-M3 optimization (triplets → cached triplets)
|
||||
python -m data.optimization bge-m3 triplets_data.jsonl \
|
||||
--output_dir ./cache/data \
|
||||
--hard_negative_ratio 0.8 \
|
||||
--difficulty_threshold 0.2
|
||||
|
||||
# BGE-reranker optimization (pairs → cached pairs)
|
||||
python -m data.optimization bge-reranker pairs_data.jsonl \
|
||||
--output_dir ./cache/data \
|
||||
--output_format flat
|
||||
|
||||
# Candidates optimization (candidates → cached pairs)
|
||||
python -m data.optimization bge-reranker candidates_data.jsonl \
|
||||
--output_dir ./cache/data
|
||||
|
||||
# Legacy optimization (nested → cached pairs)
|
||||
python -m data.optimization bge-reranker legacy_data.jsonl \
|
||||
--output_dir ./cache/data \
|
||||
--output_format nested
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚡ Performance Features
|
||||
|
||||
### Enhanced Processing
|
||||
- **10-15x faster loading** through pre-tokenization
|
||||
- **Automatic format detection** based on key presence
|
||||
- **Score-weighted sampling** for better training quality
|
||||
- **Hard negative mining** with difficulty thresholds
|
||||
|
||||
### Conversion Utilities
|
||||
- **Candidates → Pairs**: Threshold-based label assignment
|
||||
- **Legacy → Pairs**: Automatic nested-to-flat conversion
|
||||
- **Pairs → Triplets**: Grouping by query for embedding training
|
||||
- **Format validation**: Comprehensive error checking
|
||||
|
||||
### Backward Compatibility
|
||||
- **Legacy support**: Full compatibility with existing datasets
|
||||
- **Gradual migration**: Convert formats incrementally
|
||||
- **Source tracking**: Know the original format of converted data
|
||||
- **Flexible thresholds**: Configurable conversion parameters
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Key Benefits
|
||||
|
||||
1. **Unified Pipeline**: Single codebase handles all formats
|
||||
2. **Automatic Detection**: No manual format specification needed
|
||||
3. **Seamless Conversion**: Transparent format transformations
|
||||
4. **Performance Optimized**: Significant speedup through caching
|
||||
5. **Backward Compatible**: Works with existing datasets
|
||||
6. **Flexible Configuration**: Customizable thresholds and parameters
|
||||
Reference in New Issue
Block a user