init commit

2025-07-22 16:55:25 +08:00
commit 36003b83e2
67 changed files with 76613 additions and 0 deletions
--- a/docs/data_formats.md
+++ b/docs/data_formats.md
@@ -0,0 +1,254 @@
+# BGE Data Pipeline Format Specifications
+
+The refactored BGE data pipeline supports four distinct input formats with automatic detection and conversion:
+
+## 📊 Format Overview
+
+| Format | Use Case | Detection Key | Output |
+|--------|----------|---------------|---------|
+| **Triplets** | BGE-M3 embedding training | `pos` + `neg` | Contrastive learning batches |
+| **Pairs** | BGE-reranker training | `passage` + `label` | Cross-encoder pairs |
+| **Candidates** | Unified threshold conversion | `candidates` | Auto-exploded pairs |
+| **Legacy Nested** | Backward compatibility | `pos` only | Converted to pairs |
+
+---
+
+## 1. 🎯 Triplets Format (BGE-M3 Embedding)
+
+For contrastive learning with multiple positives and negatives per query.
+
+```json
+{
+  "query": "What is machine learning?",
+  "pos": [
+    "Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
+    "ML algorithms use statistical techniques to learn patterns from data and make predictions."
+  ],
+  "neg": [
+    "Weather forecasting uses meteorological data to predict atmospheric conditions.", 
+    "Cooking recipes require specific ingredients and step-by-step instructions."
+  ],
+  "pos_scores": [0.95, 0.88],
+  "neg_scores": [0.15, 0.08],
+  "prompt": "为此查询生成表示：",
+  "type": "definition"
+}
+```
+
+### Required Fields
+- `query`: Query text
+- `pos`: List of positive passages
+
+### Optional Fields  
+- `neg`: List of negative passages
+- `pos_scores`: Relevance scores for positives
+- `neg_scores`: Relevance scores for negatives
+- `prompt`: Query instruction prefix
+- `type`: Query type classification
+
+---
+
+## 2. 🔍 Pairs Format (BGE-Reranker)
+
+For cross-encoder training with individual query-passage pairs.
+
+```json
+{
+  "query": "What is machine learning?",
+  "passage": "Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
+  "label": 1,
+  "score": 0.95,
+  "qid": "q1", 
+  "pid": "p1"
+}
+```
+
+### Required Fields
+- `query`: Query text
+- `passage`: Passage text  
+- `label`: Relevance label (0 or 1)
+
+### Optional Fields
+- `score`: Relevance score (0.0-1.0)
+- `qid`: Query identifier
+- `pid`: Passage identifier
+
+---
+
+## 3. 🎲 Candidates Format (Unified)
+
+For datasets with multiple candidates per query. Each candidate is automatically exploded into a separate pair based on score threshold.
+
+```json
+{
+  "query": "What is artificial intelligence?",
+  "qid": "q1",
+  "candidates": [
+    {
+      "text": "Artificial intelligence is the simulation of human intelligence in machines.",
+      "score": 0.94,
+      "pid": "c1"
+    },
+    {
+      "text": "AI systems can perform tasks that typically require human intelligence.",
+      "score": 0.87, 
+      "pid": "c2"
+    },
+    {
+      "text": "Mountain climbing requires proper equipment and training.",
+      "score": 0.23,
+      "pid": "c3"
+    }
+  ]
+}
+```
+
+### Required Fields
+- `query`: Query text
+- `candidates`: List of candidate objects
+
+### Candidate Object Fields
+- `text`: Candidate passage text
+- `score`: Relevance score (optional, defaults to 1.0)
+- `pid`: Passage identifier (optional, auto-generated)
+
+### Processing Logic
+- **Threshold Assignment**: Candidates with `score >= threshold` become `label=1`, others become `label=0`
+- **Default Threshold**: 0.5 (configurable)
+- **Output**: Multiple pairs, one per candidate
+
+---
+
+## 4. 🔄 Legacy Nested Format (Backward Compatibility)
+
+Legacy format automatically converted to flat pairs for reranker training.
+
+```json
+{
+  "query": "What is natural language processing?",
+  "pos": [
+    "Natural language processing is a field of AI that helps computers understand human language.",
+    "NLP combines computational linguistics with machine learning to process text and speech."
+  ],
+  "neg": [
+    "Automobile manufacturing involves assembly lines and quality control processes.",
+    "Photography captures light through camera lenses to create images."
+  ],
+  "pos_scores": [0.93, 0.86],
+  "neg_scores": [0.14, 0.22]
+}
+```
+
+### Required Fields
+- `query`: Query text
+- `pos`: List of positive passages
+
+### Optional Fields
+- `neg`: List of negative passages  
+- `pos_scores`: Scores for positives
+- `neg_scores`: Scores for negatives
+
+### Processing Logic
+- Each `pos[i]` becomes `(query, passage, label=1)`
+- Each `neg[j]` becomes `(query, passage, label=0)`
+- Maintains original scores
+- Tracks source as `legacy_nested`
+
+---
+
+## 🔧 Usage Examples
+
+### Fine-tuning Data Loader
+
+```python
+from data.dataset import BGEM3Dataset, BGERerankerDataset, BGEDataset
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+# Triplets for BGE-M3
+triplets_dataset = BGEM3Dataset(
+    data_path="triplets_data.jsonl",
+    tokenizer=tokenizer,
+    format_type="triplets"
+)
+
+# Pairs for BGE-reranker  
+pairs_dataset = BGERerankerDataset(
+    data_path="pairs_data.jsonl", 
+    tokenizer=tokenizer,
+    format_type="pairs"
+)
+
+# Candidates with threshold conversion
+candidates_dataset = BGEDataset(
+    data_path="candidates_data.jsonl",
+    tokenizer=tokenizer, 
+    format_type="candidates",
+    candidates_threshold=0.6  # Custom threshold
+)
+
+# Legacy with automatic conversion
+legacy_dataset = BGERerankerDataset(
+    data_path="legacy_data.jsonl",
+    tokenizer=tokenizer,
+    legacy_support=True
+)
+```
+
+### Optimization Pipeline
+
+```bash
+# BGE-M3 optimization (triplets → cached triplets)
+python -m data.optimization bge-m3 triplets_data.jsonl \
+  --output_dir ./cache/data \
+  --hard_negative_ratio 0.8 \
+  --difficulty_threshold 0.2
+
+# BGE-reranker optimization (pairs → cached pairs)  
+python -m data.optimization bge-reranker pairs_data.jsonl \
+  --output_dir ./cache/data \
+  --output_format flat
+
+# Candidates optimization (candidates → cached pairs)
+python -m data.optimization bge-reranker candidates_data.jsonl \
+  --output_dir ./cache/data
+
+# Legacy optimization (nested → cached pairs)
+python -m data.optimization bge-reranker legacy_data.jsonl \
+  --output_dir ./cache/data \
+  --output_format nested
+```
+
+---
+
+## ⚡ Performance Features
+
+### Enhanced Processing
+- **10-15x faster loading** through pre-tokenization
+- **Automatic format detection** based on key presence
+- **Score-weighted sampling** for better training quality
+- **Hard negative mining** with difficulty thresholds
+
+### Conversion Utilities
+- **Candidates → Pairs**: Threshold-based label assignment
+- **Legacy → Pairs**: Automatic nested-to-flat conversion
+- **Pairs → Triplets**: Grouping by query for embedding training
+- **Format validation**: Comprehensive error checking
+
+### Backward Compatibility
+- **Legacy support**: Full compatibility with existing datasets
+- **Gradual migration**: Convert formats incrementally
+- **Source tracking**: Know the original format of converted data
+- **Flexible thresholds**: Configurable conversion parameters
+
+---
+
+## 🚀 Key Benefits
+
+1. **Unified Pipeline**: Single codebase handles all formats
+2. **Automatic Detection**: No manual format specification needed
+3. **Seamless Conversion**: Transparent format transformations
+4. **Performance Optimized**: Significant speedup through caching
+5. **Backward Compatible**: Works with existing datasets
+6. **Flexible Configuration**: Customizable thresholds and parameters