159 lines
5.4 KiB
Markdown
159 lines
5.4 KiB
Markdown
# BGE Fine-tuning Directory Structure
|
|
|
|
This document clarifies the directory structure and cache usage in the BGE fine-tuning project.
|
|
|
|
## 📂 Project Directory Structure
|
|
|
|
```
|
|
bge_finetune/
|
|
├── 📁 cache/ # Cache directories
|
|
│ ├── 📁 data/ # Preprocessed dataset cache
|
|
│ │ ├── 📄 *.jsonl # Preprocessed JSONL files
|
|
│ │ └── 📄 *.json # Preprocessed JSON files
|
|
│ └── 📁 models/ # Downloaded model cache (HuggingFace/ModelScope)
|
|
│ ├── 📁 tokenizers/ # Cached tokenizer files
|
|
│ ├── 📁 config/ # Cached model config files
|
|
│ └── 📁 downloads/ # Temporary download files
|
|
│
|
|
├── 📁 models/ # Local model storage
|
|
│ ├── 📁 bge-m3/ # Local BGE-M3 model files
|
|
│ └── 📁 bge-reranker-base/ # Local BGE-reranker model files
|
|
│
|
|
├── 📁 data/ # Raw training data
|
|
│ ├── 📄 train.jsonl # Input training data
|
|
│ ├── 📄 eval.jsonl # Input evaluation data
|
|
│ └── 📁 processed/ # Preprocessed (but not cached) data
|
|
│
|
|
├── 📁 output/ # Training outputs
|
|
│ ├── 📁 bge-m3/ # M3 training results
|
|
│ ├── 📁 bge-reranker/ # Reranker training results
|
|
│ └── 📁 checkpoints/ # Model checkpoints
|
|
│
|
|
└── 📁 logs/ # Training logs
|
|
├── 📁 tensorboard/ # TensorBoard logs
|
|
└── 📄 training.log # Text logs
|
|
```
|
|
|
|
## 🎯 Directory Purposes
|
|
|
|
### 1. **`./cache/models/`** - Model Downloads Cache
|
|
- **Purpose**: HuggingFace/ModelScope model download cache
|
|
- **Contents**: Tokenizers, config files, temporary downloads
|
|
- **Usage**: Automatic caching of remote model downloads
|
|
- **Config**: `model_paths.cache_dir = "./cache/models"`
|
|
|
|
### 2. **`./cache/data/`** - Processed Dataset Cache
|
|
- **Purpose**: Preprocessed and cleaned dataset files
|
|
- **Contents**: Validated JSONL/JSON files
|
|
- **Usage**: Store cleaned and validated training data
|
|
- **Config**: `data.cache_dir = "./cache/data"`
|
|
|
|
### 3. **`./models/`** - Local Model Storage
|
|
- **Purpose**: Complete local model files (when available)
|
|
- **Contents**: Full model directories with all files
|
|
- **Usage**: Direct loading without downloads
|
|
- **Config**: `model_paths.bge_m3 = "./models/bge-m3"`
|
|
|
|
### 4. **`./data/`** - Raw Input Data
|
|
- **Purpose**: Original training/evaluation datasets
|
|
- **Contents**: `.jsonl` files in various formats
|
|
- **Usage**: Input to preprocessing and optimization
|
|
- **Config**: User-specified paths in training scripts
|
|
|
|
### 5. **`./output/`** - Training Results
|
|
- **Purpose**: Fine-tuned models and training artifacts
|
|
- **Contents**: Saved models, checkpoints, metrics
|
|
- **Usage**: Results of training scripts
|
|
- **Config**: `--output_dir` in training scripts
|
|
|
|
## ⚙️ Configuration Mapping
|
|
|
|
### Config File Settings
|
|
```toml
|
|
[model_paths]
|
|
cache_dir = "./cache/models" # Model download cache
|
|
bge_m3 = "./models/bge-m3" # Local BGE-M3 path
|
|
bge_reranker = "./models/bge-reranker-base" # Local reranker path
|
|
|
|
[data]
|
|
cache_dir = "./cache/data" # Dataset cache
|
|
optimization_output_dir = "./cache/data" # Optimization default
|
|
```
|
|
|
|
### Script Defaults
|
|
```bash
|
|
# Data optimization (processed datasets)
|
|
python -m data.optimization bge-m3 data/train.jsonl --output_dir ./cache/data
|
|
|
|
# Model training (output models)
|
|
python scripts/train_m3.py --output_dir ./output/bge-m3 --cache_dir ./cache/models
|
|
|
|
# Model evaluation (results)
|
|
python scripts/evaluate.py --output_dir ./evaluation_results
|
|
```
|
|
|
|
## 🔄 Data Flow
|
|
|
|
```
|
|
Raw Data → Preprocessing → Optimization → Training → Output
|
|
data/ → - → cache/data/ → models → output/
|
|
↓ ↓ ↓
|
|
Input Pre-tokenized Model Fine-tuned
|
|
JSONL → .pkl files → Loading → Models
|
|
```
|
|
|
|
## 🛠️ Usage Examples
|
|
|
|
### Model Cache (Download Cache)
|
|
```python
|
|
# Automatic model caching
|
|
tokenizer = AutoTokenizer.from_pretrained(
|
|
"BAAI/bge-m3",
|
|
cache_dir="./cache/models" # Downloads cached here
|
|
)
|
|
```
|
|
|
|
### Data Cache (Optimization Cache)
|
|
```bash
|
|
# Create optimized datasets
|
|
python -m data.optimization bge-m3 data/train.jsonl \
|
|
--output_dir ./cache/data # .pkl files created here
|
|
|
|
# Use cached datasets in training
|
|
python scripts/train_m3.py \
|
|
--train_data ./cache/data/train_m3_cached.pkl
|
|
```
|
|
|
|
### Local Models (No Downloads)
|
|
```python
|
|
# Use local model (no downloads)
|
|
model = BGEM3Model.from_pretrained("./models/bge-m3")
|
|
```
|
|
|
|
## 📋 Best Practices
|
|
|
|
1. **Separate Concerns**: Keep model cache, data cache, and outputs separate
|
|
2. **Clear Naming**: Use descriptive suffixes (`.pkl` for cache, `_cached` for optimized)
|
|
3. **Consistent Paths**: Use config.toml for centralized path management
|
|
4. **Cache Management**:
|
|
- Model cache: Can be shared across projects
|
|
- Data cache: Project-specific optimized datasets
|
|
- Output: Training results and checkpoints
|
|
|
|
## 🧹 Cleanup Commands
|
|
|
|
```bash
|
|
# Clean data cache (re-optimization needed)
|
|
rm -rf ./cache/data/
|
|
|
|
# Clean model cache (re-download needed)
|
|
rm -rf ./cache/models/
|
|
|
|
# Clean training outputs
|
|
rm -rf ./output/
|
|
|
|
# Clean all caches
|
|
rm -rf ./cache/
|
|
```
|
|
|
|
This structure ensures clear separation of concerns and efficient caching for both model downloads and processed datasets. |